Python｜正则表达式操作（re 模块）用法讲解及各函数样例

官方文档地址：https://docs.python.org/zh-cn/3/library/re.html

4 个匹配正则表达式并返回匹配结果的函数

在 Python 的正则表达式 re 包中，有如下 4 个扫描正则表达式 pattern 是否与字符串 string 中的某个位置匹配，并返回或生成匹配结果的 Match 对象的函数。

需要注意的是，对于这 4 个函数中，在字符串中没有找到与模式匹配的位置，与在某个位置上找到零长度的匹配是不一样的。

`re.search(pattern, string, flags=0)`

扫描整个 string 查找正则表达式 pattern 产生匹配的 第一个位置，并返回相应的 Match 对象。如果字符串中没有与模式匹配的位置，则返回 None。例如：

import re

# 定义要匹配的字符串
text = "Hello, how are you today?"

# 定义要查找的正则表达式
word = "y.*u"

# 使用 re.search 查找单词在字符串中的位置
match = re.search(word, text)

print(match)  # 输出: <re.Match object; span=(15, 18), match='you'>

`re.match(pattern, string, flags=0)`

如果 string 开头的零个或多个字符 与正则表达式 pattern 匹配，则返回相应的 Match 对象。如果字符串与模式不匹配则返回 None。

另外需要注意的是，即使选择了多行模式，re.match() 也只匹配字符串的开始位置，而不是每一行的开始位置。如果需要匹配每一行的开始位置，可以使用 search() 方法配合 ^ 实现。例如：

import re

# 定义要匹配的字符串
text = "Hello, World!"

# 定义要查找的正则表达式
pattern = r"Hel+o"

# 使用 re.match 进行匹配
match = re.match(pattern, text)

print(match)  # 输出: <re.Match object; span=(0, 5), match='Hello'>

`re.fullmatch(pattern, string, flags=0)`

如果 整个 string 与正则表达式 pattern 匹配，则返回相应的 Match 对象。如果字符串与模式不匹配则返回 None。例如：

import re

# 定义要匹配的字符串
string1 = "Hello123"
string2 = "Hello World"
string3 = "Hello_123"

# 定义要查找的正则表达式
pattern = r"^[A-Za-z0-9_]+$"

# 使用 re.fullmatch 进行匹配
match1 = re.fullmatch(pattern, string1)
match2 = re.fullmatch(pattern, string2)
match3 = re.fullmatch(pattern, string3)

print(match1)  # 输出: <re.Match object; span=(0, 8), match='Hello123'>
print(match2)  # 输出: None
print(match3)  # 输出: <re.Match object; span=(0, 9), match='Hello_123'>

`re.finditer(pattern, string, flags=0)`

针对正则表达式 pattern 在 string 里的 所有非重叠匹配 返回一个产生 Match 对象的迭代器。string 将被从左至右地扫描，并且匹配也将按被找到的顺序返回。

空匹配也将被包括在结果中。例如：

import re

# 定义要匹配的字符串
string = "I have 3 apples and 5 oranges."

# 定义要查找的正则表达式
pattern = r"\d+"

# 使用 re.finditer 进行匹配
matches = re.finditer(pattern, string)

for match in matches:  # 遍历匹配结果
    print(match.group())

以上样例的输出结果是：

3
5

在 Python 3.7 的修改中，使非空匹配可以在前一个空匹配之后出现了。

匹配对象（`re.Match`）

以上 4 个函数返回的匹配结果，均为 re.Match 对象。Match 对象有如下常用方法：

`Match.expand(template)`

对 template 中的进行反斜杠转移替换为匹配到的对应字符，并返回替换后的结果。例如：

import re

pattern = r"(\d{4})-(\d{2})-(\d{2})"
text = "今天是2023-09-17"

match = re.search(pattern, text)
if match:
    new_text = match.expand(r"现在是\1年\2月\3日")
    print(new_text)  # 输出: 现在是2023年09月17日

`Match.group([group1, ...])`

返回一个或多个匹配的子组。如果只有一个参数，结果就是一个字符串；如果有多个参数，结果就是一个元组（每个参数对应一个项）；如果没有参数，则返回整个字符串。

如果输入的参数为 0，那么也返回整个字符串。

对于使用了 () 语法标注子组的正则表达式匹配结果，我们可以使用整型参数来获取对应子组。例如：

import re

pattern = r"(\d{4})-(\d{2})-(\d{2})"
date = "2023-09-07"

match = re.match(pattern, date)
if match:
    year = match.group(1)
    month = match.group(2)
    day = match.group(3)
    print(f"Year: {
      
      year}, Month: {
      
      month}, Day: {
      
      day}")  # 输出: Year: 2023, Month: 09, Day: 17

对于使用了 (?P<name>...) 语法标注子组的正则表达式匹配结果，我们可以使用字符串类型参数来获取对应的子组。例如：

import re

pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
date = "2023-09-07"

match = re.match(pattern, date)
if match:
    year = match.group("year")
    month = match.group("month")
    day = match.group("day")
    print(f"Year: {
      
      year}, Month: {
      
      month}, Day: {
      
      day}")  # 输出: Year: 2023, Month: 09, Day: 17

在没有提供参数时，返回整个匹配结果的字符串。例如：

import re

pattern = r"(?P<year>\d{4})-(?P<month>\d{2})-(?P<day>\d{2})"
date = "2023-09-07"

match = re.match(pattern, date)
if match:
    matched_string = match.group()
    print(matched_string)  # 输出: 2023-09-07

`Match.getitem(g)`

等价于 Match.group(g)。

`Match.groups(default=None)`

返回一个元组，包括所有匹配的子组。

当 default 参数为 None 时，没有匹配到的子组会返回 None。例如：

import re

pattern = r"(\d{3})-(\d{3})-(\d{3})?"
text = "Phone numbers: 123-456-"

match = re.search(pattern, text)

# 返回匹配到的所有分组
groups_without_default = match.groups()

print(groups_without_default)  # 输出: ('123', '456', None)

当 default 参数不为 None，会将没有匹配到的子组替换为 default 参数的值。例如：

import re

pattern = r"(\d{3})-(\d{3})-(\d{3})?"
text = "Phone numbers: 123-456-"

match = re.search(pattern, text)

# 返回匹配到的所有分组，如果分组没有匹配到内容，则返回默认值 "-"
groups_with_default = match.groups("-")

print(groups_with_default)  # 输出: ('123', '456', '-')

`Match.groupdict(default=None)`

返回一个字典，包括了所有的命名子组。字典的键就是组名。当 default 参数为 None 时，没有匹配到的子组会返回 None。当 default 参数不为 None，会将没有匹配到的子组替换为 default 参数的值。例如：

import re

pattern = r"(?P<first_name>\w+) (?P<last_name>\w+)"
text = "John Doe"

match = re.search(pattern, text)

# 使用 groupdict() 方法获取命名分组的字典
name_dict = match.groupdict()

print(name_dict)  # 输出: {'first_name': 'John', 'last_name': 'Doe'}

`Match.start([group])`、`Match.end([group])`

返回 group 匹配到的子串在原字符串中的开始和结束位置下标。在原字符串中，使用切片 m.string[m.start(g), m.end(g)] 即可得到匹配到的字符串，等价于 m.group(g)。group 默认为 0，即整个匹配的字符串。

import re

pattern = r"\b\w{3,5}\b"
text = "Hello, world! This is a test."

for match in re.finditer(pattern, text):
    # 获取匹配子串的起始位置和结束位置索引
    start_index = match.start()
    end_index = match.end()

    # 打印结果
    print(f"Match: {
      
      text[start_index:end_index]}, Start: {
      
      start_index}, End: {
      
      end_index}")

以上样例的输出结果是：

Match: Hello, Start: 0, End: 5
Match: world, Start: 7, End: 12
Match: This, Start: 14, End: 18
Match: test, Start: 24, End: 28

如果 group 参数存在，但没有在字符串中匹配到对应子串，则返回 -1。例如：

import re

pattern = r"(\w+),? (\w+)?"
text = "John, "

match = re.search(pattern, text)

print(f"Group 1: {
      
      text[match.start(1):match.end(1)]}, Start: {
      
      match.start(1)}, End: {
      
      match.end(1)}")  # 输出: Group 1: John, Start: 0, End: 4
print(f"Group 2: {
      
      text[match.start(2):match.end(2)]}, Start: {
      
      match.start(2)}, End: {
      
      match.end(2)}")  # 输出: Group 2: , Start: -1, End: -1

`Match.span([group])`

返回二元组 (m.start(group), m.end(group))。如果 group 参数存在，但没有在字符串中匹配到对应子串，则返回 (-1, -1)。

`Match.pos`、`Match.endpos`

在字符串中开始搜索的位置索引和截止位置索引。这两个参数只有使用正则表达式对象时才能配置。

`Match.lastindex`

匹配到的最后一个子组的索引值。如果第 2 个子组没有匹配到，但第 3 个子组匹配到的话，会返回 3。如果所有子组都没有匹配到的话，则返回 None。

例如：

import re

pattern = r"(\w+)?\s+(\d+)?\s+(\w+)?"
text = "John 4 Dog, Jane  Tiger, Tom  ,  "

for match in re.finditer(pattern, text):
    print(f"Name: {
      
      match[1]}, Age: {
      
      match[2]}, Species: {
      
      match[3]}, Last captured group index: {
      
      match.lastindex}")

以上样例的输出结果是：

Name: John, Age: 4, Species: Dog, Last captured group index: 3
Name: Jane, Age: None, Species: Tiger, Last captured group index: 3
Name: Tom, Age: None, Species: None, Last captured group index: 1
Name: None, Age: None, Species: None, Last captured group index: None

`Match.lastgroup`

匹配到的最后一个命名组的名称。如果第 2 个子组没有匹配到，但第 3 个子组匹配到的话，会返回第 3 个子组的名称。如果所有子组都没有匹配到的话，则返回 None。具体返回规则与 Match.groupindex 类似。

例如：

import re

pattern = r"(?P<name>\w+)?\s+(?P<age>\d+)?\s+(?P<species>\w+)?"
text = "John 4 Dog, Jane  Tiger, Tom  ,  "

for match in re.finditer(pattern, text):
    print(f"Name: {
      
      match['name']}, Age: {
      
      match['age']}, Species: {
      
      match['species']}, "
          f"Last captured group name: {
      
      match.lastgroup}")

以上样例的输出结果是：

Name: John, Age: 4, Species: Dog, Last captured group name: species
Name: Jane, Age: None, Species: Tiger, Last captured group name: species
Name: Tom, Age: None, Species: None, Last captured group name: name
Name: None, Age: None, Species: None, Last captured group name: None

`Match.re`

返回产生这个匹配对象实例的正则对象实例。

`Match.string`

返回传递到 match() 和 search() 的字符串。

5 个直接处理字符串的函数

`re.split(pattern, string, maxsplit=0, flags=0)`

用 pattern 分开 string，并返回字符串的列表。例如：

import re

pattern = r"\W+"  # 用任意非单词字符分割
text = "Words, words, words."

print(re.split(pattern, text))  # 输出: ['Words', 'words', 'words', '']

如果在 pattern 中捕获到子组，那么所有的组里的文字也回包含在列表中。

import re

pattern = r"(\W+)"  # 用任意非单词字符分割
text = "Words, words, words."

print(re.split(pattern, text))  # 输出: ['Words', 'words', 'words', '']

如果 maxsplit 参数非零，那么最多进行 maxsplit 次分割，剩下的字符全部返回到列表的最后一个元素。

import re

pattern = r"\W+"  # 用任意非单词字符分割
text = "Words, words, words."

print(re.split(pattern, text, maxsplit=1))  # 输出: ['Words', 'words, words.']

`re.findall(pattern, string, flags=0)`

以字符串列表或字符串元组列表的形式，返回 pattern 在 string 中的所有重叠匹配。对 string 的扫描从左至右，匹配结果按照找到的顺序返回。空匹配也包含在结果中。

返回结果的格式取决于 pattern 中子组的数量。如果没有组，返回与整个模式匹配的字符串列表。如果有且只有一个组，返回与该组匹配的字符串列表。如果有多个组，返回与这些组匹配的字符串元组列表。例如：

import re

print(re.findall(r"\d+", "I have 3 apples and 5 oranges."))  # 输出: ['3', '5']
print(re.findall(r"(\w+)=(\d+)", "set width=20 and height=10"))  # 输出: [('width', '20'), ('height', '10')]
print(re.findall(r"(\d+)", "set width=20 and height=10"))  # 输出: ['20', '10']

在 Python 3.7 的修改中，使非空匹配可以在前一个空匹配之后出现了。

`re.sub(pattern, repl, string, count=0, flags=0)`

逐个使用 repl 替换在 string 最左边非重叠出现的 pattern，直至 string 中不再存在 pattern 样式，并将替换后的 string 返回。如果样式没有找到，则不加改变地返回 string。repl 可以是字符串或函数。

当 repl 是字符串时，会用 repl 直接替换匹配到的 pattern。例如：

import re

text = "Helo! Hello, World!"

# 使用 re.sub() 方法将 "Hello" 替换为 "Hi"
new_text = re.sub("Hel+o", "Hi", text)

print(new_text) # 输出: Hi! Hi, World!

当 repl 是函数时，则它会针对每一个匹配到的 pattern 被调用。该函数应接受单个 re.Match 参数，并返回替换后的字符串。例如：

import re


def my_repl(match: re.Match) -> str:
    """自定义替换方法"""
    return match.group(0) + len(match.group(0)) * "o"


text = "Helo! Hello, World!"

# 使用 re.sub() 方法将 "Hello" 替换为 "Hi"
new_text = re.sub("Hel+o", my_repl, text)

print(new_text)  # Helooooo! Helloooooo, World!

可选参数 count 是要替换的最大次数，count 必须是非负整数。如果省略这个参数或将其设为 0，则所有的匹配都会被替换。

`re.subn(pattern, repl, string, count=0, flags=0)`

与 re.sub() 相同，但返回一个元组 (字符串, 替换次数)。例如：

import re

text = "Helo! Hello, World!"

# 使用 re.sub() 方法将 "Hello" 替换为 "Hi"
new_text = re.subn("Hel+o", "Hi", text)

print(new_text)  # 输出: ('Hi! Hi, World!', 2)

`re.escape(pattern)`

用于转义 pattern 中的特殊字符。如果需要对包含正则表达式元字符的文本字符串进行匹配，则可以使用它。

import re

# 假设我们要匹配一个字符串中的所有特殊字符
text = "Hello! How are you? [This] is a test."

# 使用 re.escape 转义特殊字符
pattern = re.escape("[")

print(pattern)  # 输出: \[

# 使用转义后的模式进行匹配
matches = re.findall(pattern, text)

print(matches)  # 输出: ['[']

正则对象（`re.Pattern`）及正则匹配性能优化

如果我们需要频繁使用一个正则表达式，则可以先将这个正则表达式编译为 re.compile() 以避免这个正则表达式被频繁编译。

`re.compile(pattern, flags=0)`

将正则表达式编译为一个正则表达式对象（re.Pattern）。正则表达式对象既用于 re.search()、re.match() 等各个 re 模块函数的 pattern 参数，也可以直接使用这个对象的 match()、search() 等方法。例如：

import re

# 编译正则表达式模式
pattern = re.compile("Hel+o")

print(type(pattern), pattern)  # 输出: <class 're.Pattern'> re.compile('Hel+o')

# 使用正则表达式的模式进行匹配
match = re.search(pattern, "Helo! Hello, World!")

print(match)  # 输出: <re.Match object; span=(0, 4), match='Helo'>

`re.Pattern`

在 Pattern 对象中，实现了如下正则匹配方法，这些方法除新增了可选参数 pos 和 endpos 外，其他功能与 re 模块的对应函数时等价的。

方法名	等价的 re 模块函数名
`Pattern.search()`	`re.search()`
`Pattern.match()`	`re.match()`
`Pattern.fullmatch()`	`re.fullmatch()`
`Pattern.split()`	`re.split()`
`Pattern.findall()`	`re.findall()`
`Pattern.finditer()`	`re.finditer()`
`Pattern.sub()`	`re.sub()`
`Pattern.subn()`	`re.subn()`

此外，re.Pattern 对象还有 4 个属性，其含义和样例如下：

属性名	属性的含义
`Pattern.flags`	正则表达式标记，传给 `re.compile()` 的参数
`Pattern.groups`	捕获到的模式串中租的鼠来那个
`Pattern.groupindex`	映射由 `(?P<id>)` 定义的命名符号组合和数字组合的字典。如果没有符号组，那么就是空字典
`Pattern.pattern`	编译对象的原始样式字符串。

import re

pattern = re.compile(r"(?P<name>\w+)?\s+(?P<age>\d+)?\s+(?P<species>\w+)?")

print(pattern.flags)  # 输出: 32
print(pattern.groups)  # 输出: 3
print(pattern.groupindex)  # 输出: {'name': 1, 'age': 2, 'species': 3}
print(pattern.pattern)  # 输出: (?P<name>\w+)?\s+(?P<age>\d+)?\s+(?P<species>\w+)?

`pos` 和 `endpos` 参数

在 re.Pattern 的 re.search()、re.match()、re.fullmatch()、re.findall()、re.finditer() 这 5 个方法中，均支持了可供选择的 pos 和 endpos 参数。

可选参数 pos 给出了字符串中开始搜索的位置索引，默认为 0。需要注意的是，它不完全等价于字符串切片，"^" 样式字符匹配字符串真正的开头和换行符后面的第一个字符，但不会匹配开始搜索的位置索引。

可选参数 endpos 限定了字符串搜索结束的位置索引，它假定字符串长度只到 endpos。

所以只有从 pos 到 endpos - 1 的字符会被匹配。如果 pos 大于 endpos，则不会产生任何匹配。

例如：

import re

pattern = re.compile(r"(?P<name>\w+)?\s+(?P<age>\d+)?\s+(?P<species>\w+)?")
text = "John 4 Dog, Jane  Tiger"

print(pattern.findall(text))  # 输出: [('John', '4', 'Dog'), ('Jane', '', 'Tiger')]
print(pattern.findall(text, 2))  # 输出: [('hn', '4', 'Dog'), ('Jane', '', 'Tiger')]
print(pattern.findall(text, 2, 9))  # 输出: [('hn', '4', 'Do')]

`re.purge()`

此外，re 模块还提供了一个 re.purge() 函数，用于清除正则表达式的缓存。这个缓存主要是指已编译的正则表达式对象（re.Pattern），包括使用 re.compile() 创建的对象，以及调用其他 re 模块中的函数时自动构造的对象。