python 之字符串处理

分割字符串

根据某个分割符分割

>>> a = '1,2,3,4'
>>> a.split(',')
['1', '2', '3', '4']

根据多个分隔符分割

>>> line = 'asdf fjdk; afed, fjek,asdf, foo' 
>>> import re
>>> re.split(r'[;,\s]\s*', line)# 用 re 匹配分隔符，
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

如果你在结果列表中保留这些分隔符，可以捕获分组：

>>> fields = re.split(r'(;|,|\s)\s*', line)
>>> fields
['asdf', ' ', 'fjdk', ';', 'afed', ',', 'fjek', ',', 'asdf', ',', 'foo']

如果不保留这些分隔符，但想用分组正则表达式，可以使用非捕获分组：

>>> re.split(r'(?:,|;|\s)\s*', line)
['asdf', 'fjdk', 'afed', 'fjek', 'asdf', 'foo']

匹配字符串开始或结束

检查字符串是否以某字符开始或结束可用 startswith() 和 endswith()：

>>> filename = 'spam.txt'
>>> filename.endswith('.txt')
True
>>> filename.startswith('file:')
False
>>> url = 'http://www.python.org'
>>> url.startswith('http:')
True

如果你的检查有多种匹配的可能，可以传入一个包含匹配项的元组：

>>> import os
>>> filenames = os.listdir('.')
>>> filenames
[ 'Makefile', 'foo.c', 'bar.py', 'spam.c', 'spam.h' ]

>>> [name for name in filenames if name.endswith(('.c', '.h')) ]
['foo.c', 'spam.c', 'spam.h'
>>> any(name.endswith('.py') for name in filenames)
True

其他方式可以用切片或 re 匹配：

>>> url = 'http://www.python.org'
>>> url[:5] == 'http:' or url[:6] == 'https:' or url[:4] == 'ftp:'
True

>>> import re
>>> url = 'http://www.python.org'
>>> re.match('http:|https:|ftp:', url)
<_sre.SRE_Match object at 0x101253098>

使用shell通配符匹配字符串:

*	匹配任意多个字符，包括 0 个
？	匹配任意一个字符，必须有一个字符
[char]	匹配括号中的任意一个字符
[!char]	匹配任意一个不属于括号中的字符的字符
[:alnum:]	匹配任意一个字母或者数字
[:alpha:]	匹配任意一个字母
[:digit:]	匹配任意一个数字
[:lower:]	匹配任意一个小写字母
[:upper:]	匹配任意一个大写字母

>>> from fnmatch import fnmatch, fnmatchcase
>>> fnmatch('foo.txt', '*.txt')
True
>>> fnmatch('foo.txt', '?oo.txt')
True
>>> fnmatch('Dat45.csv', 'Dat[0-9]*')
True
>>> names = ['Dat1.csv', 'Dat2.csv', 'config.ini', 'foo.py']
>>> [name for name in names if fnmatch(name, 'Dat*.csv')]
['Dat1.csv', 'Dat2.csv']

fnmatch() 函数使用底层操作系统的大小写敏感规则（不同操作系统不一样）进行匹配：

>>> # On OS X (Mac)
>>> fnmatch('foo.txt', '*.TXT')
False
>>> # On Windows
>>> fnmatch('foo.txt', '*.TXT')
True

如果你对这个区别很在意，可以使用 fnmatchcase() 来替代。它完全使用你的模式进行匹配。比如：

>>> fnmatchcase('foo.txt', '*.TXT')
False

>>> fnmatchcase('foo.txt', '*.txt')
True

这个函数在处理非文件名字符串中也非常有用：

addresses = [
'5412 N CLARK ST',
'1060 W ADDISON ST',
'1039 W GRANVILLE AVE',
'2122 N CLARK ST',
'4802 N BROADWAY',
]

>>> from fnmatch import fnmatchcase
>>> [addr for addr in addresses if fnmatchcase(addr, '* ST')]
['5412 N CLARK ST', '1060 W ADDISON ST', '2122 N CLARK ST']
>>> [addr for addr in addresses if fnmatchcase(addr, '54[0-9][0-9] *CLARK*')]
['5412 N CLARK ST']

总结：fnmatch 的能力介于字符串方法和正则表达式之间，如果数据处理中只需要简单的通配符就能完成，fnmatch 或 fnmatchcase 会是个不错的选择。如果需要做文件名的匹配，最好使用 glob 模块。

字符串匹配和搜索

如果只是简单的字符串匹配，字符串方法足够使用了，例如：str.find() , str.startswith() , str.endswith() 。

对于复杂的匹配需要使用正则表达式和re模块：

>>> text1 = '11/27/2012'
>>> text2 = 'Nov 27, 2012'
>>>
>>> import re
>>> # Simple matching: \d+ means match one or more digits
>>> if re.match(r'\d+/\d+/\d+', text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if re.match(r'\d+/\d+/\d+', text2):
... print('yes')
... else:
... print('no')
...
no
>>>

re.match() 总是从字符串开始去匹配，如果匹配到，返回 Match 对象。如果没有匹配到，返回 None。

如果想重复使用同一个正则，可以将模式字符串编译为模式对象：

>>> datepat = re.compile(r'\d+/\d+/\d+')
>>> if datepat.match(text1):
... print('yes')
... else:
... print('no')
...
yes
>>> if datepat.match(text2):
... print('yes')
... else:
... print('no')
...
no

如果不想从字符串开始位置匹配，可以使用 re.search() 或者 re.findall()，re.search() 在第一个匹配到的位置返回一个 Match 对象，如果没有匹配到，则返回 None 。

re.findall() 将匹配到的所有字符串装进列表中返回。

在使用正则时，若表达式中包含分组，re.findall() 返回一个包含 groups 的列表，groups 是一个包含匹配到的所有分组的元组。

>>> m = datepat.match('11/27/2012')
>>> m
<_sre.SRE_Match object at 0x1005d2750>
>>> # Extract the contents of each group
>>> m.group(0)
'11/27/2012'
>>> m.group(1)
'11'
>>> m.group(2)
'27'
>>> m.group(3)
'2012'
>>> m.groups()
('11', '27', '2012')
>>> month, day, year = m.groups()
>>>
>>> # Find all matches (notice splitting into tuples)
>>> text
'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> datepat.findall(text)
[('11', '27', '2012'), ('3', '13', '2013')]
>>> for month, day, year in datepat.findall(text):
... print('{}-{}-{}'.format(year, month, day))
...
2012-11-27
2013-3-13

findall() 会以列表的形式返回结果，如果你想用迭代的形式返回，可以使用 finditer() ：

>>> for m in datepat.finditer(text):
... print(m.groups())
...
('11', '27', '2012')
('3', '13', '2013')

字符串的搜索和替换

对于简单的查找替换，可以使用 str.replace()：

>>> text = 'yeah, but no, but yeah, but no, but yeah'
>>> text.replace('yeah', 'yep')
'yep, but no, but yep, but no, but yep'

对于复杂的查找替换，可以使用 re.sub()：

>>> text = 'Today is 11/27/2012. PyCon starts 3/13/2013.'
>>> import re
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2012-11-27. PyCon starts 2013-3-13.'

其中 \3 等指向匹配模式中的分组

对于更加复杂的替换，可以传递一个回调函数：

>>> from calendar import month_abbr
>>> def change_date(m):
... mon_name = month_abbr[int(m.group(1))]
... return '{} {} {}'.format(m.group(2), mon_name, m.group(3))
...
>>> datepat.sub(change_date, text)
'Today is 27 Nov 2012. PyCon starts 13 Mar 2013.'

出了替换后的结果以外，如果你还想知道替换了多少个，可以使用 re.subn() 来代替：

>>> newtext, n = datepat.subn(r'\3-\1-\2', text)
>>> newtext
'Today is 2012-11-27. PyCon starts 2013-3-13.'
>>> n
2

如果想在匹配的时候，忽略大小写，可以给 re 提供一个标志参数，re.IGNORECASE：

>>> text = 'UPPER PYTHON, lower python, Mixed Python'
>>> re.findall('python', text, flags=re.IGNORECASE)
['PYTHON', 'python', 'Python']
>>> re.sub('python', 'snake', text, flags=re.IGNORECASE)
'UPPER snake, lower snake, Mixed snake'

这个例子有一个小缺陷，替换字符串不会和匹配字符串的大小写保持一致，可以做如下修改：

def matchcase(word):
    def replace(m):
        text = m.group()
        if text.isupper():
            return word.upper()
        elif text.islower():
            return word.lower()
        elif text[0].isupper():
            return word.capitalize()
        else:
            return word
    return replace

>>> re.sub('python', matchcase('snake'), text, flags=re.IGNORECASE)
'UPPER SNAKE, lower snake, Mixed Snake'

python 之 字符串处理

猜你喜欢

python 之字符串处理