re — Python Regular Expressions#
What it is#
re is Python’s standard-library regular-expression engine. It implements a near-PCRE dialect — named groups, lookaround, non-greedy quantifiers, Unicode property classes — but with a small number of deliberate Python-specific divergences (no \K, no recursive patterns, named groups use (?P<name>…) instead of PCRE’s (?<name>…)). It is the engine Django uses to route URLs, that pandas uses for str.contains, and that every Python script reaches for whenever a string in check is no longer enough.
The newer regex third-party module is API-compatible with re but adds variable-length lookbehind and atomic groups; consider it if you need those features. For comparison with the dialect used by grep -P, ripgrep, and nginx, see linux/pcre.
Install#
re is part of the CPython standard library — available everywhere Python is, with no install step.
python -c "import re; print(re.__doc__.splitlines()[0])"
Output:
Support for regular expressions (RE).
API overview#
The module exposes a small set of top-level functions that mirror methods on compiled pattern objects. The two are functionally identical — re.search(pat, s) is re.compile(pat).search(s) with internal caching.
| Function | Method | Returns | Notes |
|---|---|---|---|
re.match(pat, s) | p.match(s) | Match or None | anchors at start of string |
re.fullmatch(pat, s) | p.fullmatch(s) | Match or None | must match the entire string |
re.search(pat, s) | p.search(s) | Match or None | first match anywhere |
re.findall(pat, s) | p.findall(s) | list[str] or list[tuple] | all non-overlapping matches |
re.finditer(pat, s) | p.finditer(s) | iterator of Match | lazy — preferred for many matches |
re.sub(pat, repl, s) | p.sub(repl, s) | str | substitute |
re.subn(pat, repl, s) | p.subn(repl, s) | (str, count) | sub + count |
re.split(pat, s) | p.split(s) | list[str] | split by pattern |
re.compile(pat, flags) | — | Pattern | compile once, reuse |
re.escape(s) | — | str | escape regex metachars in s |
Compile vs not#
re.compile(pattern, flags=0) parses the pattern once and returns a reusable Pattern object. Use it whenever the same regex is applied repeatedly — inside a loop, in a hot function, or as a module-level constant. For one-off uses the module-level functions are fine because re keeps an internal LRU cache (size 512) of recently compiled patterns.
import re
PHONE = re.compile(r'\d{3}-\d{3}-\d{4}')
for line in ['call 555-123-4567', 'no phone', 'two: 555-000-1111 and 555-222-3333']:
for match in PHONE.finditer(line):
print(match.group())
Output:
555-123-4567
555-000-1111
555-222-3333
[!TIP] For a function called millions of times, compiling explicitly is roughly 2× faster than relying on the cache because there’s no dict lookup. For one-off matches, the difference is unmeasurable.
match vs search vs fullmatch vs findall vs finditer#
These five differ in where the regex engine starts looking and what it returns. Mixing them up is the most common re bug.
| Function | Where it anchors | Returns |
|---|---|---|
match | start of string only | first match (or None) |
fullmatch | start AND end of string | first match (or None) — must cover all |
search | anywhere in string | first match (or None) |
findall | anywhere, all non-overlapping | list of strings (or tuples if there are groups) |
finditer | anywhere, all non-overlapping | iterator of Match objects |
import re
s = 'aaa 123 bbb 456'
print(re.match(r'\d+', s)) # None — string starts with 'a'
print(re.search(r'\d+', s)) # finds '123'
print(re.fullmatch(r'.*\d+', s)) # None — doesn't end with digits
print(re.findall(r'\d+', s)) # all numeric runs
print(list(re.finditer(r'\d+', s))) # same, as Match objects
Output:
None
<re.Match object; span=(4, 7), match='123'>
None
['123', '456']
[<re.Match object; span=(4, 7), match='123'>, <re.Match object; span=(12, 15), match='456'>]
[!WARNING]
matchdoesn’t mean “match the whole string” — it means “match starting from position 0”. For end-anchored full matches, usefullmatchor wrap the pattern with\A...\Z.
The Match object#
A successful match, search, or fullmatch returns a Match object — never a string. To get the matched text use .group(), to get capture groups use .group(n) or .groups(), and to get the position use .span().
import re
m = re.search(r'(\w+)=(\d+)', 'value=42 elsewhere')
print(m.group()) # full match
print(m.group(0)) # same as .group()
print(m.group(1)) # first capture
print(m.group(2)) # second capture
print(m.groups()) # all captures as tuple
print(m.span()) # (start, end)
print(m.start(), m.end())
Output:
value=42
value=42
value
42
('value', '42')
(0, 8)
0 8
Named groups#
(?P<name>...) captures into a named slot accessible via .group('name') or .groupdict(). Named groups make tracebacks and back-references in sub self-documenting — prefer them over numeric groups whenever there are more than two captures.
import re
DATE = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')
m = DATE.search('shipped on 2026-05-25 to customer 42')
print(m.group('year'), m.group('month'), m.group('day'))
print(m.groupdict())
Output:
2026 05 25
{'year': '2026', 'month': '05', 'day': '25'}
In a sub replacement, refer to named groups with \g<name> (or \1/\2 for numeric).
import re
print(re.sub(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
r'\g<day>/\g<month>/\g<year>',
'2026-05-25'))
Output:
25/05/2026
Backreferences#
\1, \2, … inside the pattern match the same text that the corresponding capture group did. Use them to find repeats, palindromes, or matching delimiters.
import re
# Find adjacent duplicated words
print(re.findall(r'\b(\w+) \1\b', 'the the quick brown fox fox jumps'))
# Match content wrapped in matching quote chars (single OR double)
m = re.search(r"(['\"])(.*?)\1", 'set name="alice" and age=30')
print(m.group(2))
Output:
['the', 'fox']
alice
Lookaround — (?=…), (?!…), (?<=…), (?<!…)#
Lookaround assertions test whether something does or does not appear adjacent to the current position without consuming characters. Python’s re supports all four variants. The lookbehind must be fixed-length (use the regex library if you need variable-length).
import re
# Match price digits only when followed by USD (lookahead)
print(re.findall(r'\d+(?=\s*USD)', 'cost: 100 USD or 50 EUR or 30 USD'))
# Match digits NOT preceded by $ (negative lookbehind)
print(re.findall(r'(?<!\$)\d+', 'price $100 quantity 50 items'))
# Identifiers not followed by a paren — "variables, not function calls"
print(re.findall(r'\b\w+\b(?!\s*\()', 'foo() bar baz() qux'))
Output:
['100', '30']
['50']
['bar', 'qux']
Flags#
Flags change how the regex engine interprets the pattern. They can be passed as the flags= keyword argument or embedded inline with (?aiLmsux) at the start of the pattern (or (?i:...) to scope to a group).
| Flag | Short | Effect |
|---|---|---|
re.IGNORECASE | re.I | case-insensitive matching |
re.MULTILINE | re.M | ^ and $ match at every line boundary |
re.DOTALL | re.S | . matches newlines too |
re.VERBOSE | re.X | ignore whitespace and # comments inside the pattern |
re.ASCII | re.A | \w, \d, \s match ASCII only (not Unicode) |
re.UNICODE | re.U | default in Python 3 — explicit is rarely needed |
re.DEBUG | — | print compiled-pattern debug info |
Combine flags with |:
import re
pat = re.compile(r'^error: (.+)$', re.I | re.M)
log = """
INFO: started
ERROR: out of memory
WARN: retrying
Error: connection lost
"""
print(pat.findall(log))
Output:
['out of memory', 'connection lost']
Verbose mode (re.X)#
re.VERBOSE (a.k.a. re.X) tells the engine to ignore unescaped whitespace and #-comments inside the pattern, allowing you to format regexes like prose. This is the single most useful feature for keeping non-trivial regexes maintainable.
import re
# A semver-like version pattern, commented and formatted
VERSION = re.compile(r"""
^ # start of string
v? # optional leading 'v'
(?P<major>\d+) \.
(?P<minor>\d+) \.
(?P<patch>\d+)
(?: - (?P<pre>[\w.]+) )? # optional pre-release tag
(?: \+ (?P<build>[\w.]+) )? # optional build metadata
$
""", re.VERBOSE)
for s in ['1.2.3', 'v1.2.3-beta.4+exp.sha.5114f85', 'not a version']:
m = VERSION.match(s)
print(s, '→', m.groupdict() if m else None)
Output:
1.2.3 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': None, 'build': None}
v1.2.3-beta.4+exp.sha.5114f85 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': 'beta.4', 'build': 'exp.sha.5114f85'}
not a version → None
[!TIP] Inside a verbose pattern, write a literal whitespace as
\(escaped space) or[ ](a class), and a literal#as\#.
re.sub with a callback#
re.sub(pattern, repl, string) accepts either a replacement string (with \1, \g<name> etc.) or a callable that takes a Match and returns a replacement string. The callable form is the cleanest way to do conditional, computed, or stateful substitutions.
import re
# Capitalize every word — but skip short ones
def capitalize_long(m):
word = m.group()
return word.upper() if len(word) > 3 else word
print(re.sub(r'\b\w+\b', capitalize_long, 'the quick brown fox'))
# Increment every number in a string
print(re.sub(r'\d+', lambda m: str(int(m.group()) + 1), 'a=1 b=2 c=99'))
# Mask emails
print(re.sub(
r'(?P<name>[\w.]+)@(?P<domain>[\w.]+)',
lambda m: f"{m['name'][0]}***@{m['domain']}",
'contact alice@example.com or bob@example.com',
))
Output:
the QUICK BROWN fox
a=2 b=3 c=100
contact a***@example.com or b***@example.com
re.subn and re.split#
subn is sub that also returns a count of substitutions made — convenient for “did anything change?” checks. split is the regex-aware equivalent of str.split, useful when the delimiter is a pattern, not a literal.
import re
print(re.subn(r'foo', 'bar', 'foo foo baz foo'))
print(re.split(r'\s*,\s*', ' apple , banana, cherry, date'))
print(re.split(r'(\d+)', 'abc123def456ghi')) # keep delimiters by capturing them
Output:
('bar bar baz bar', 3)
['apple', 'banana', 'cherry', 'date']
['abc', '123', 'def', '456', 'ghi']
re.escape#
re.escape(s) returns s with every regex metacharacter prefixed by a backslash — essential whenever you need to embed user-supplied text or an unknown string into a regex.
import re
needle = 'price: $5.99'
pattern = re.escape(needle)
print(pattern)
print(re.search(pattern, 'the price: $5.99 today'))
Output:
price:\ \$5\.99
<re.Match object; span=(4, 16), match='price: $5.99'>
Differences from PCRE#
Python’s re is very close to PCRE but not identical. The table below captures the differences you’ll actually hit; for the full PCRE syntax reference see linux/pcre.
| Feature | Python re | PCRE |
|---|---|---|
| Named group syntax | (?P<name>…) | (?<name>…) |
| Named backreference (pattern) | (?P=name) | \k<name> |
| Named backreference (replacement) | \g<name> | $name or \k<name> |
\K (reset match start) | not supported | supported |
Recursive patterns (?R) / (?1) | not supported | supported |
Atomic groups (?>…) | not supported (3.10 added possessive *+, ++, ?+) | supported |
| Variable-length lookbehind | not supported (fixed-length only) | PCRE2 supports it |
Unicode property classes \p{…} | supported (3.7+) | supported |
Inline flag scoping (?i:…) | supported | supported |
Conditionals (?(name)yes|no) | supported | supported |
| Default Unicode | yes (\w matches Unicode letters) | depends on tool config |
[!TIP] If you need
\K, atomic groups, or variable-length lookbehind in Python, install theregexpackage (pip install regex) — it is a near-superset ofrewith the same API.
Performance tips#
A short list of patterns that the engine optimizes well, plus the anti-patterns that cause catastrophic backtracking.
Anchor when you can#
Patterns that start with ^, \A, or a literal prefix skip ahead quickly because the engine can fail-fast without trying every position.
import re, timeit
PAT_NAIVE = re.compile(r'.*error')
PAT_FAST = re.compile(r'^.*error')
text = 'a' * 10_000 + 'error'
print(timeit.timeit(lambda: PAT_NAIVE.search(text), number=1000))
print(timeit.timeit(lambda: PAT_FAST.search(text), number=1000))
Output:
0.0185
0.0142
Avoid nested quantifiers#
(a+)+, (a|a)+, and (.*)+ cause exponential backtracking on non-matching input. Refactor to a single quantifier or use a possessive quantifier (3.11+).
import re
# DANGEROUS — exponential on non-match
# re.match(r'^(a+)+b$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')
# SAFE
print(re.match(r'^a+b$', 'aaaaaaaaaab'))
Output:
<re.Match object; span=(0, 11), match='aaaaaaaaaab'>
Use non-capturing groups (?:…)#
If you only need grouping for alternation or quantification, use (?:…) — it avoids the overhead of remembering the capture for later retrieval.
import re
# capturing — slower, populates .groups()
print(re.findall(r'(foo|bar)+', 'foofoobar'))
# non-capturing — faster, returns full matches
print(re.findall(r'(?:foo|bar)+', 'foofoobar'))
Output:
['bar']
['foofoobar']
Common pitfalls#
- Forgetting the raw-string prefix
r''—'\d'works by accident (\disn’t a Python escape) but'\b'does not (\bis the backspace character). Always user'...'for regex literals. - Using
matchwhen you wantedsearch—matchonly checks the start of the string. Usesearchfor “does this appear anywhere” orfullmatchfor “is this the entire string”. findallreturns tuples when there are groups — if your pattern has any capture groups,findallreturns the captures only, not the full match. Use non-capturing groups(?:…)to keep the full string, or switch tofinditer..doesn’t match newlines by default — passre.S(DOTALL) when scanning across line boundaries.^and$only match string ends by default — passre.M(MULTILINE) for per-line anchoring.- Greedy vs non-greedy
*/+—<.+>on<a><b>matches<a><b>entire. Use<.+?>for non-greedy, or[^>]+for a negative class (faster). - Catastrophic backtracking — patterns like
(a+)+$lock up on non-matching input. Refactor with possessive quantifiers (a++) or atomic groups via theregexpackage. - Mixing bytes and str — a bytes pattern (
rb'\d') can only match a bytes input, and vice versa. Mismatching raisesTypeError. - Named groups must be unique —
(?P<a>x)(?P<a>y)raises are.error. Use unique names or numeric groups. - Variable-length lookbehind —
(?<=ab|abc)is rejected because the alternatives differ in length. Either split into two patterns or switch to theregexlibrary.
Real-world recipes#
Parse a multi-line config file#
A .ini-style config parser using verbose regex for clarity. Handles section headers, key=value pairs, and # comments.
import re
CONFIG = re.compile(r"""
^\s*
(?:
\[ (?P<section>[^\]]+) \] # section header
| (?P<key>[\w.]+) \s* = \s* (?P<value>.*?) # key = value
| \# .* # comment
)?
\s* $
""", re.VERBOSE | re.MULTILINE)
text = """
# database config
[database]
host = localhost
port = 5432
[logging]
level = INFO
"""
current = None
config = {}
for m in CONFIG.finditer(text):
if m['section']:
current = m['section']
config[current] = {}
elif m['key'] and current is not None:
config[current][m['key']] = m['value']
print(config)
Output:
{'database': {'host': 'localhost', 'port': '5432'}, 'logging': {'level': 'INFO'}}
Extract structured records from logs#
Pull timestamp, level, and message from each line of a typical log file using named groups.
import re
LINE = re.compile(r"""
^
(?P<ts>\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2})
\s+
(?P<level>DEBUG|INFO|WARN|ERROR)
\s+
(?P<msg>.*)
$
""", re.VERBOSE)
logs = [
'2026-05-25 12:01:03 INFO starting service',
'2026-05-25 12:01:04 ERROR connection refused',
'malformed line',
]
for line in logs:
m = LINE.match(line)
if m:
print(m.groupdict())
Output:
{'ts': '2026-05-25 12:01:03', 'level': 'INFO', 'msg': 'starting service'}
{'ts': '2026-05-25 12:01:04', 'level': 'ERROR', 'msg': 'connection refused'}
Strip ANSI escape sequences#
Normalize colored terminal output before saving to a file.
import re
ANSI = re.compile(r'\x1b\[[0-9;]*m')
colored = '\x1b[31mERROR\x1b[0m: \x1b[1mfile\x1b[0m missing'
print(ANSI.sub('', colored))
Output:
ERROR: file missing
URL slug from title#
Lowercase, drop non-alphanumerics, collapse whitespace into a single hyphen.
import re
def slugify(title):
s = title.lower()
s = re.sub(r'[^a-z0-9\s-]', '', s) # strip punctuation
s = re.sub(r'\s+', '-', s) # spaces → hyphens
s = re.sub(r'-+', '-', s).strip('-') # collapse hyphens
return s
print(slugify('Hello, World! — Python 3.12 & re.X'))
Output:
hello-world-python-312-rex
Reformat phone numbers#
Normalize many input shapes to a single canonical format.
import re
PHONE = re.compile(r"""
\D* # any leading non-digits
(?P<area>\d{3}) \D*
(?P<prefix>\d{3}) \D*
(?P<line>\d{4})
\D*$
""", re.VERBOSE)
for raw in ['(555) 123-4567', '555.123.4567', '5551234567', '+1 555 123 4567']:
m = PHONE.match(raw)
if m:
print(f"({m['area']}) {m['prefix']}-{m['line']}")
Output:
(555) 123-4567
(555) 123-4567
(555) 123-4567
(555) 123-4567
Find unmatched braces#
Use a non-greedy quantifier and a negative lookahead to detect dangling { without a }.
import re
# Match `{...}` blocks; collect text outside them, look for stray `{`
text = '{ok} stray { and {nested {inner}} {ok}'
balanced = re.sub(r'\{[^{}]*\}', '', text)
print('residue:', repr(balanced))
print('stray opens:', balanced.count('{'))
Output:
residue: ' stray { and {nested } '
stray opens: 2
Replace with a counter#
Number every occurrence of a pattern, using a closure as the sub callback.
import re
def numberer():
n = 0
def repl(m):
nonlocal n
n += 1
return f'[{n}]{m.group()}'
return repl
print(re.sub(r'\b\w+\b', numberer(), 'one two three four'))
Output:
[1]one [2]two [3]three [4]four
Tokenize source code#
A toy tokenizer using re.Scanner (a hidden gem in re).
import re
scanner = re.Scanner([
(r'\d+', lambda s, t: ('NUM', int(t))),
(r'[+\-*/]', lambda s, t: ('OP', t)),
(r'[a-zA-Z_]\w*', lambda s, t: ('IDENT', t)),
(r'\s+', None),
])
tokens, remainder = scanner.scan('x = 10 + 20 * y')
print(tokens)
Output:
[('IDENT', 'x'), ('IDENT', 'e'), ('NUM', 10), ('OP', '+'), ('NUM', 20), ('OP', '*'), ('IDENT', 'y')]
[!NOTE]
re.Scanneris undocumented but stable since Python 2.4. For production tokenizers prefertokenize(for Python source) or a dedicated lexer likeplyorlark.
See also#
- linux/pcre — the PCRE dialect used by
grep -P,ripgrep,nginx, PHP - linux/grep and linux/sed — shell tools that consume the same patterns
- javascript/regex — JavaScript regex differences, when you cross language boundaries
regex— third-party drop-in with variable-length lookbehind, atomic groups,\K