字符串查找替换

查找

在字符串中匹配和搜索指定文本的常用方案是使用字符串的find, startswith, endswith等方法。如下示例：

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> # Exact match 
>>> text == 'yeah' 
False

>>> # Match at start or end 
>>> text.startswith('yeah') 
True
>>> text.endswith('no') 
False
>>> # Search for the location of the first occurrence 
>>> text.find('no')
10
>>>

对于更为复杂的匹配和搜索文本场景，通常需要使用正则表达式和re模块。值得注意的是，大多数正则表达式操作都可使用re模块级函数或编译的正则表达式对象（compiled regular expressions）。这些函数相当于快捷方式，不需要先编译正则表达式对象，但会失去一些微调参数的能力。
例如，我们要搜索字符串中的一段日期文本，如“11/27/2012”，可以按如下方式：

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits 
>>> if re.match(r'\d+/\d+/\d+', text1):
...        print('yes')
...    else:
...        print('no')
...
yes
>>> if re.match(r'\d+/\d+/\d+', text2):
...        print('yes')
...    else:
...        print('no') 
...
no
>>>

值得注意的是，match函数只会查找整个字符串的开头，若未找到，则返回False。若要进行全局搜索，以查找所有位置的匹配，可以使用findall函数。

>>> datepat = re.compile(r'\d+/\d+/\d+') 
>>> if datepat.match(text1):
...        print('yes')
...    else:
...        print('no')
...
yes
>>> if datepat.match(text2):
...        print('yes')
...    else:
...        print('no') ...
no
>>>
>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
['11/27/2012', '3/13/2013']
>>>

在定义正则表达式时，通常通过将模式的一部分括在括号中来引入捕获组。例如：

>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)')
>>> m = datepat.match('11/27/2012') 
>>> m
<_sre.SRE_Match object at 0x1005d2750>
>>> # Extract the contents of each group 
>>> m.group(0)
'11/27/2012'
>>> m.group(1)
'11'
>>> m.group(2) 
'27'
>>> m.group(3) 
'2012'
>>> m.groups()
('11', '27', '2012')
>>> month, day, year = m.groups() 
>>>
>>> # Find all matches (notice splitting into tuples) 
>>> text
'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
[('11', '27', '2012'), ('3', '13', '2013')]
>>> for month, day, year in datepat.findall(text): 
...        print('{}-{}-{}'.format(year, month, day)) 
...
2012-11-27
2013-3-13
>>>

findall（）方法搜索文本并查找所有匹配项，并将它们作为列表返回。如果要迭代地查找匹配项，请改用finditer（）方法。例如：

>>> for m in datepat.finditer(text): 
...        print(m.groups())
...
('11', '27', '2012')
('3', '13', '2013')
>>>

但请注意，如果要执行大量匹配或搜索，通常需要先编译模式并反复使用它。模块级函数保留了最近编译模式的缓存，因此没有大的性能损失，但是使用自己的编译模式对象节省一些查找和额外处理。

替换

对于简单的文本替换，可以使用str的replace方法，如下：

>>> text = 'yeah, but no, but yeah, but no, but yeah'

>>> text.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'
>>>

对于更为复杂的场景，可以使用re模块的sub函数。要将“11/27/2012”替换为“2012-11-27”，如下代码所示：

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.' 
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text) 
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>>

sub（）的第一个参数是要匹配的模式，第二个参数是替换模式。诸如 3之类的反斜杠数字指的是模式中的捕获组编号。

如果要执行相同模式的重复替换，请考虑首先编译它以获得更好的性能。例如：

>>> import re
>>> datepat = re.compile(r'(\d+)/(\d+)/(\d+)') 
>>> datepat.sub(r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.' 
>>>

对于更复杂的替换，可以指定替换回调函数。例如：

>>> from calendar import month_abbr
>>> def change_date(m):
...        mon_name = month_abbr[int(m.group(1))]
...        return '{} {} {}'.format(m.group(2), mon_name, m.group(3)) 
...
>>> datepat.sub(change_date, text)
'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'
>>>

作为输入，替换回调的参数是matchj对象，由match（）或find（）返回。使用.group（）方法提取匹配的特定部分。该函数应返回替换文本。
如果想知道除了获取替换文本之外还进行了多少次替换，请使用re.subn（）代替。例如：

>>> newtext, n = datepat.subn(r'\3-\1-\2', text) 
>>> newtext
'Today is 2012-11-27. PyCon starts 2013-3-13.' 
>>> n
2 
>>>

查找

替换

猜你喜欢