re — Python Regular Expressions#

What it is#

re is Python’s standard-library regular-expression engine. It implements a near-PCRE dialect — named groups, lookaround, non-greedy quantifiers, Unicode property classes — but with a small number of deliberate Python-specific divergences (no \K, no recursive patterns, named groups use (?P<name>…) instead of PCRE’s (?<name>…)). It is the engine Django uses to route URLs, that pandas uses for str.contains, and that every Python script reaches for whenever a string in check is no longer enough.

The newer regex third-party module is API-compatible with re but adds variable-length lookbehind and atomic groups; consider it if you need those features. For comparison with the dialect used by grep -P, ripgrep, and nginx, see linux/pcre.

Install#

re is part of the CPython standard library — available everywhere Python is, with no install step.

python -c "import re; print(re.__doc__.splitlines()[0])"

Output:

Support for regular expressions (RE).

API overview#

The module exposes a small set of top-level functions that mirror methods on compiled pattern objects. The two are functionally identical — re.search(pat, s) is re.compile(pat).search(s) with internal caching.

Function	Method	Returns	Notes
`re.match(pat, s)`	`p.match(s)`	`Match` or `None`	anchors at start of string
`re.fullmatch(pat, s)`	`p.fullmatch(s)`	`Match` or `None`	must match the entire string
`re.search(pat, s)`	`p.search(s)`	`Match` or `None`	first match anywhere
`re.findall(pat, s)`	`p.findall(s)`	`list[str]` or `list[tuple]`	all non-overlapping matches
`re.finditer(pat, s)`	`p.finditer(s)`	iterator of `Match`	lazy — preferred for many matches
`re.sub(pat, repl, s)`	`p.sub(repl, s)`	`str`	substitute
`re.subn(pat, repl, s)`	`p.subn(repl, s)`	`(str, count)`	sub + count
`re.split(pat, s)`	`p.split(s)`	`list[str]`	split by pattern
`re.compile(pat, flags)`	—	`Pattern`	compile once, reuse
`re.escape(s)`	—	`str`	escape regex metachars in `s`

Compile vs not#

re.compile(pattern, flags=0) parses the pattern once and returns a reusable Pattern object. Use it whenever the same regex is applied repeatedly — inside a loop, in a hot function, or as a module-level constant. For one-off uses the module-level functions are fine because re keeps an internal LRU cache (size 512) of recently compiled patterns.

import re

PHONE = re.compile(r'\d{3}-\d{3}-\d{4}')

for line in ['call 555-123-4567', 'no phone', 'two: 555-000-1111 and 555-222-3333']:
    for match in PHONE.finditer(line):
        print(match.group())

Output:

555-123-4567
555-000-1111
555-222-3333

[!TIP] For a function called millions of times, compiling explicitly is roughly 2× faster than relying on the cache because there’s no dict lookup. For one-off matches, the difference is unmeasurable.

match vs search vs fullmatch vs findall vs finditer#

These five differ in where the regex engine starts looking and what it returns. Mixing them up is the most common re bug.

Function	Where it anchors	Returns
`match`	start of string only	first match (or `None`)
`fullmatch`	start AND end of string	first match (or `None`) — must cover all
`search`	anywhere in string	first match (or `None`)
`findall`	anywhere, all non-overlapping	list of strings (or tuples if there are groups)
`finditer`	anywhere, all non-overlapping	iterator of `Match` objects

import re
s = 'aaa 123 bbb 456'

print(re.match(r'\d+', s))                  # None — string starts with 'a'
print(re.search(r'\d+', s))                 # finds '123'
print(re.fullmatch(r'.*\d+', s))            # None — doesn't end with digits
print(re.findall(r'\d+', s))                # all numeric runs
print(list(re.finditer(r'\d+', s)))         # same, as Match objects

Output:

None
<re.Match object; span=(4, 7), match='123'>
None
['123', '456']
[<re.Match object; span=(4, 7), match='123'>, <re.Match object; span=(12, 15), match='456'>]

[!WARNING] match doesn’t mean “match the whole string” — it means “match starting from position 0”. For end-anchored full matches, use fullmatch or wrap the pattern with \A...\Z.

The Match object#

A successful match, search, or fullmatch returns a Match object — never a string. To get the matched text use .group(), to get capture groups use .group(n) or .groups(), and to get the position use .span().

import re
m = re.search(r'(\w+)=(\d+)', 'value=42 elsewhere')

print(m.group())              # full match
print(m.group(0))             # same as .group()
print(m.group(1))             # first capture
print(m.group(2))             # second capture
print(m.groups())             # all captures as tuple
print(m.span())               # (start, end)
print(m.start(), m.end())

Output:

value=42
value=42
value
42
('value', '42')
(0, 8)
0 8

Named groups#

(?P<name>...) captures into a named slot accessible via .group('name') or .groupdict(). Named groups make tracebacks and back-references in sub self-documenting — prefer them over numeric groups whenever there are more than two captures.

import re

DATE = re.compile(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})')

m = DATE.search('shipped on 2026-05-25 to customer 42')
print(m.group('year'), m.group('month'), m.group('day'))
print(m.groupdict())

Output:

2026 05 25
{'year': '2026', 'month': '05', 'day': '25'}

In a sub replacement, refer to named groups with \g<name> (or \1/\2 for numeric).

import re
print(re.sub(r'(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})',
             r'\g<day>/\g<month>/\g<year>',
             '2026-05-25'))

Output:

25/05/2026

Backreferences#

\1, \2, … inside the pattern match the same text that the corresponding capture group did. Use them to find repeats, palindromes, or matching delimiters.

import re

# Find adjacent duplicated words
print(re.findall(r'\b(\w+) \1\b', 'the the quick brown fox fox jumps'))

# Match content wrapped in matching quote chars (single OR double)
m = re.search(r"(['\"])(.*?)\1", 'set name="alice" and age=30')
print(m.group(2))

Output:

['the', 'fox']
alice

Lookaround — `(?=…)`, `(?!…)`, `(?<=…)`, `(?<!…)`#

Lookaround assertions test whether something does or does not appear adjacent to the current position without consuming characters. Python’s re supports all four variants. The lookbehind must be fixed-length (use the regex library if you need variable-length).

import re

# Match price digits only when followed by USD (lookahead)
print(re.findall(r'\d+(?=\s*USD)', 'cost: 100 USD or 50 EUR or 30 USD'))

# Match digits NOT preceded by $ (negative lookbehind)
print(re.findall(r'(?<!\$)\d+', 'price $100 quantity 50 items'))

# Identifiers not followed by a paren — "variables, not function calls"
print(re.findall(r'\b\w+\b(?!\s*\()', 'foo() bar baz() qux'))

Output:

['100', '30']
['50']
['bar', 'qux']

Flags#

Flags change how the regex engine interprets the pattern. They can be passed as the flags= keyword argument or embedded inline with (?aiLmsux) at the start of the pattern (or (?i:...) to scope to a group).

Flag	Short	Effect
`re.IGNORECASE`	`re.I`	case-insensitive matching
`re.MULTILINE`	`re.M`	`^` and `$` match at every line boundary
`re.DOTALL`	`re.S`	`.` matches newlines too
`re.VERBOSE`	`re.X`	ignore whitespace and `#` comments inside the pattern
`re.ASCII`	`re.A`	`\w`, `\d`, `\s` match ASCII only (not Unicode)
`re.UNICODE`	`re.U`	default in Python 3 — explicit is rarely needed
`re.DEBUG`	—	print compiled-pattern debug info

Combine flags with |:

import re

pat = re.compile(r'^error: (.+)$', re.I | re.M)
log = """
INFO: started
ERROR: out of memory
WARN: retrying
Error: connection lost
"""
print(pat.findall(log))

Output:

['out of memory', 'connection lost']

Verbose mode (`re.X`)#

re.VERBOSE (a.k.a. re.X) tells the engine to ignore unescaped whitespace and #-comments inside the pattern, allowing you to format regexes like prose. This is the single most useful feature for keeping non-trivial regexes maintainable.

import re

# A semver-like version pattern, commented and formatted
VERSION = re.compile(r"""
    ^                       # start of string
    v?                      # optional leading 'v'
    (?P<major>\d+) \.
    (?P<minor>\d+) \.
    (?P<patch>\d+)
    (?: - (?P<pre>[\w.]+) )?   # optional pre-release tag
    (?: \+ (?P<build>[\w.]+) )?  # optional build metadata
    $
""", re.VERBOSE)

for s in ['1.2.3', 'v1.2.3-beta.4+exp.sha.5114f85', 'not a version']:
    m = VERSION.match(s)
    print(s, '→', m.groupdict() if m else None)

Output:

1.2.3 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': None, 'build': None}
v1.2.3-beta.4+exp.sha.5114f85 → {'major': '1', 'minor': '2', 'patch': '3', 'pre': 'beta.4', 'build': 'exp.sha.5114f85'}
not a version → None

[!TIP] Inside a verbose pattern, write a literal whitespace as \ (escaped space) or [ ] (a class), and a literal # as \#.

re.sub with a callback#

re.sub(pattern, repl, string) accepts either a replacement string (with \1, \g<name> etc.) or a callable that takes a Match and returns a replacement string. The callable form is the cleanest way to do conditional, computed, or stateful substitutions.

import re

# Capitalize every word — but skip short ones
def capitalize_long(m):
    word = m.group()
    return word.upper() if len(word) > 3 else word

print(re.sub(r'\b\w+\b', capitalize_long, 'the quick brown fox'))

# Increment every number in a string
print(re.sub(r'\d+', lambda m: str(int(m.group()) + 1), 'a=1 b=2 c=99'))

# Mask emails
print(re.sub(
    r'(?P<name>[\w.]+)@(?P<domain>[\w.]+)',
    lambda m: f"{m['name'][0]}***@{m['domain']}",
    'contact alice@example.com or bob@example.com',
))

Output:

the QUICK BROWN fox
a=2 b=3 c=100
contact a***@example.com or b***@example.com

re.subn and re.split#

subn is sub that also returns a count of substitutions made — convenient for “did anything change?” checks. split is the regex-aware equivalent of str.split, useful when the delimiter is a pattern, not a literal.

import re

print(re.subn(r'foo', 'bar', 'foo foo baz foo'))
print(re.split(r'\s*,\s*', '  apple , banana, cherry,  date'))
print(re.split(r'(\d+)', 'abc123def456ghi'))      # keep delimiters by capturing them

Output:

('bar bar baz bar', 3)
['apple', 'banana', 'cherry', 'date']
['abc', '123', 'def', '456', 'ghi']

re.escape#

re.escape(s) returns s with every regex metacharacter prefixed by a backslash — essential whenever you need to embed user-supplied text or an unknown string into a regex.

import re

needle = 'price: $5.99'
pattern = re.escape(needle)
print(pattern)
print(re.search(pattern, 'the price: $5.99 today'))

Output:

price:\ \$5\.99
<re.Match object; span=(4, 16), match='price: $5.99'>

Differences from PCRE#

Python’s re is very close to PCRE but not identical. The table below captures the differences you’ll actually hit; for the full PCRE syntax reference see linux/pcre.

Feature	Python `re`	PCRE
Named group syntax	`(?P<name>…)`	`(?<name>…)`
Named backreference (pattern)	`(?P=name)`	`\k<name>`
Named backreference (replacement)	`\g<name>`	`$name` or `\k<name>`
`\K` (reset match start)	not supported	supported
Recursive patterns `(?R)` / `(?1)`	not supported	supported
Atomic groups `(?>…)`	not supported (3.10 added possessive `*+`, `++`, `?+`)	supported
Variable-length lookbehind	not supported (fixed-length only)	PCRE2 supports it
Unicode property classes `\p{…}`	supported (3.7+)	supported
Inline flag scoping `(?i:…)`	supported	supported
Conditionals `(?(name)yes\|no)`	supported	supported
Default Unicode	yes (`\w` matches Unicode letters)	depends on tool config

[!TIP] If you need \K, atomic groups, or variable-length lookbehind in Python, install the regex package (pip install regex) — it is a near-superset of re with the same API.

Performance tips#

A short list of patterns that the engine optimizes well, plus the anti-patterns that cause catastrophic backtracking.

Anchor when you can#

Patterns that start with ^, \A, or a literal prefix skip ahead quickly because the engine can fail-fast without trying every position.

import re, timeit

PAT_NAIVE  = re.compile(r'.*error')
PAT_FAST   = re.compile(r'^.*error')

text = 'a' * 10_000 + 'error'
print(timeit.timeit(lambda: PAT_NAIVE.search(text),  number=1000))
print(timeit.timeit(lambda: PAT_FAST.search(text),   number=1000))

Output:

0.0185
0.0142

Avoid nested quantifiers#

(a+)+, (a|a)+, and (.*)+ cause exponential backtracking on non-matching input. Refactor to a single quantifier or use a possessive quantifier (3.11+).

import re

# DANGEROUS — exponential on non-match
# re.match(r'^(a+)+b$', 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa!')

# SAFE
print(re.match(r'^a+b$', 'aaaaaaaaaab'))

Output:

<re.Match object; span=(0, 11), match='aaaaaaaaaab'>

Use non-capturing groups `(?:…)`#

If you only need grouping for alternation or quantification, use (?:…) — it avoids the overhead of remembering the capture for later retrieval.

import re
# capturing — slower, populates .groups()
print(re.findall(r'(foo|bar)+', 'foofoobar'))
# non-capturing — faster, returns full matches
print(re.findall(r'(?:foo|bar)+', 'foofoobar'))

Output:

['bar']
['foofoobar']

Common pitfalls#

Forgetting the raw-string prefix r'' — '\d' works by accident (\d isn’t a Python escape) but '\b' does not (\b is the backspace character). Always use r'...' for regex literals.
Using match when you wanted search — match only checks the start of the string. Use search for “does this appear anywhere” or fullmatch for “is this the entire string”.
findall returns tuples when there are groups — if your pattern has any capture groups, findall returns the captures only, not the full match. Use non-capturing groups (?:…) to keep the full string, or switch to finditer.
. doesn’t match newlines by default — pass re.S (DOTALL) when scanning across line boundaries.
^ and $ only match string ends by default — pass re.M (MULTILINE) for per-line anchoring.
Greedy vs non-greedy */+ — <.+> on <a><b> matches <a><b> entire. Use <.+?> for non-greedy, or [^>]+ for a negative class (faster).
Catastrophic backtracking — patterns like (a+)+$ lock up on non-matching input. Refactor with possessive quantifiers (a++) or atomic groups via the regex package.
Mixing bytes and str — a bytes pattern (rb'\d') can only match a bytes input, and vice versa. Mismatching raises TypeError.
Named groups must be unique — (?P<a>x)(?P<a>y) raises a re.error. Use unique names or numeric groups.
Variable-length lookbehind — (?<=ab|abc) is rejected because the alternatives differ in length. Either split into two patterns or switch to the regex library.

Real-world recipes#

Parse a multi-line config file#

A .ini-style config parser using verbose regex for clarity. Handles section headers, key=value pairs, and # comments.

import re

CONFIG = re.compile(r"""
    ^\s*
    (?:
        \[ (?P<section>[^\]]+) \]            # section header
      | (?P<key>[\w.]+) \s* = \s* (?P<value>.*?)  # key = value
      | \# .*                                # comment
    )?
    \s* $
""", re.VERBOSE | re.MULTILINE)

text = """
# database config
[database]
host = localhost
port = 5432

[logging]
level = INFO
"""

current = None
config = {}
for m in CONFIG.finditer(text):
    if m['section']:
        current = m['section']
        config[current] = {}
    elif m['key'] and current is not None:
        config[current][m['key']] = m['value']

print(config)

Output:

{'database': {'host': 'localhost', 'port': '5432'}, 'logging': {'level': 'INFO'}}

Extract structured records from logs#

Pull timestamp, level, and message from each line of a typical log file using named groups.

import re

LINE = re.compile(r"""
    ^
    (?P<ts>\d{4}-\d{2}-\d{2}\ \d{2}:\d{2}:\d{2})
    \s+
    (?P<level>DEBUG|INFO|WARN|ERROR)
    \s+
    (?P<msg>.*)
    $
""", re.VERBOSE)

logs = [
    '2026-05-25 12:01:03 INFO  starting service',
    '2026-05-25 12:01:04 ERROR connection refused',
    'malformed line',
]
for line in logs:
    m = LINE.match(line)
    if m:
        print(m.groupdict())

Output:

{'ts': '2026-05-25 12:01:03', 'level': 'INFO', 'msg': 'starting service'}
{'ts': '2026-05-25 12:01:04', 'level': 'ERROR', 'msg': 'connection refused'}

Strip ANSI escape sequences#

Normalize colored terminal output before saving to a file.

import re

ANSI = re.compile(r'\x1b\[[0-9;]*m')
colored = '\x1b[31mERROR\x1b[0m: \x1b[1mfile\x1b[0m missing'
print(ANSI.sub('', colored))

Output:

ERROR: file missing

URL slug from title#

Lowercase, drop non-alphanumerics, collapse whitespace into a single hyphen.

import re

def slugify(title):
    s = title.lower()
    s = re.sub(r'[^a-z0-9\s-]', '', s)         # strip punctuation
    s = re.sub(r'\s+', '-', s)                  # spaces → hyphens
    s = re.sub(r'-+', '-', s).strip('-')        # collapse hyphens
    return s

print(slugify('Hello, World! — Python 3.12 & re.X'))

Output:

hello-world-python-312-rex

Reformat phone numbers#

Normalize many input shapes to a single canonical format.

import re

PHONE = re.compile(r"""
    \D*                              # any leading non-digits
    (?P<area>\d{3}) \D*
    (?P<prefix>\d{3}) \D*
    (?P<line>\d{4})
    \D*$
""", re.VERBOSE)

for raw in ['(555) 123-4567', '555.123.4567', '5551234567', '+1 555 123 4567']:
    m = PHONE.match(raw)
    if m:
        print(f"({m['area']}) {m['prefix']}-{m['line']}")

Output:

(555) 123-4567
(555) 123-4567
(555) 123-4567
(555) 123-4567

Find unmatched braces#

Use a non-greedy quantifier and a negative lookahead to detect dangling { without a }.

import re

# Match `{...}` blocks; collect text outside them, look for stray `{`
text = '{ok} stray { and {nested {inner}} {ok}'
balanced = re.sub(r'\{[^{}]*\}', '', text)
print('residue:', repr(balanced))
print('stray opens:', balanced.count('{'))

Output:

residue: ' stray { and {nested } '
stray opens: 2

Replace with a counter#

Number every occurrence of a pattern, using a closure as the sub callback.

import re

def numberer():
    n = 0
    def repl(m):
        nonlocal n
        n += 1
        return f'[{n}]{m.group()}'
    return repl

print(re.sub(r'\b\w+\b', numberer(), 'one two three four'))

Output:

[1]one [2]two [3]three [4]four

Tokenize source code#

A toy tokenizer using re.Scanner (a hidden gem in re).

import re

scanner = re.Scanner([
    (r'\d+',           lambda s, t: ('NUM', int(t))),
    (r'[+\-*/]',       lambda s, t: ('OP', t)),
    (r'[a-zA-Z_]\w*',  lambda s, t: ('IDENT', t)),
    (r'\s+',           None),
])

tokens, remainder = scanner.scan('x = 10 + 20 * y')
print(tokens)

Output:

[('IDENT', 'x'), ('IDENT', 'e'), ('NUM', 10), ('OP', '+'), ('NUM', 20), ('OP', '*'), ('IDENT', 'y')]

[!NOTE] re.Scanner is undocumented but stable since Python 2.4. For production tokenizers prefer tokenize (for Python source) or a dedicated lexer like ply or lark.

g h	home
g p	Programming section
g p	Python section
g j	JavaScript section
g t	TypeScript section
g o	OS section
g l	Linux section
g w	Windows section
g z	z/OS section
g o	macOS section
g a	AI section
g c	Claude Code section
g c	Codex CLI section
g c	Claude API section
g p	Prompting section
g f	Frameworks section
g p	Packages section
g p	Pip (Python) section
g p	npm (Node) section
g p	Cargo (Rust) section
g p	Go modules section
g g	graph view
g t	tags index

⌘K / /	open search palette
t	cycle theme (dark → light → system)
?	toggle this panel

[ / ]	previous / next sheet in section
j / k	scroll down / up

re — Python Regular Expressions#

What it is#

Install#

API overview#

Compile vs not#

match vs search vs fullmatch vs findall vs finditer#

The Match object#

Named groups#

Backreferences#

Lookaround — (?=…), (?!…), (?<=…), (?<!…)#

Flags#

Verbose mode (re.X)#

re.sub with a callback#

re.subn and re.split#

re.escape#

Differences from PCRE#

Performance tips#

Anchor when you can#

Avoid nested quantifiers#

Use non-capturing groups (?:…)#

Common pitfalls#

Real-world recipes#

Parse a multi-line config file#

Extract structured records from logs#

Strip ANSI escape sequences#

URL slug from title#

Reformat phone numbers#

Find unmatched braces#

Replace with a counter#

Tokenize source code#

See also#

Lookaround — `(?=…)`, `(?!…)`, `(?<=…)`, `(?<!…)`#

Verbose mode (`re.X`)#

Use non-capturing groups `(?:…)`#