Python基础知识—正则表达式

正则表达式

不用正则的判断
- re.compile()
- ptn.search()
正则给额外信息
- re.search()
中文
- string.encode()
查找替换等更多功能
- re.search()
- re.match()
- re.findall()
- re.finditer()
- re.split()
- re.sub()
- re.subn()

1）不用正则的判断

在文字中寻找某个信息，若不用正则表达式时：

pattern1 = "file"
pattern2 = "files"
string = "the file is in the folder"
print("file in string", pattern1 in string)
print("files in string", pattern2 in string)

file in string True
files in string False

局限：若种类繁多之后，处理能力就变得十分有限

用正则表达式做一个通用的邮箱地址判断(仅做一个观察，原理下面叙述)

import re
ptn = re.compile(r"\w+?@\w+?\.com")

matched = ptn.search("[email protected]")
print("[email protected] is a valid email:", mached)
matched = ptn.search("nvuwgt123@163;com")
print("ncuwgt123@163;com is a valid email:", mached)

ncuwgt123@163.com is a valid email: <re.Match object; span=(0, 17), match='[email protected]'>
ncuwgt123@163;com is a valid email: None

2）正则表达式会给出额外的信息

从上面的例子可以看出，正则表达式除了帮你判断有没有某个模式以外，还可以做很多事情，不仅仅返回一个True，而且还会返回很多额外的信息

import re
matched = re.search(r"\w+?@\w+?\.com", "[email protected]")
print("[email protected]:", matched)
matched = re.search(r"\w+?@\w+?\.com", "the email is [email protected].")
print("the email is [email protected]:", matched)

ncuwgt123@163.com: <re.Match object; span=(0, 17), match='[email protected]'>
the email is ncuwgt123@163.com: <re.Match object; span=(13, 30), match='[email protected]'>

可以看到返回的信息中，有一个span=(xx,xx)，有一个match="[email protected]"，它们分别代表着，原始字符中，找到的pattern时从哪一位到哪一位，patter找到的具体字符串是什么

match = re.search(r"run", "I run to you")
print(match)
print(match.group)

<re.Match object; span=(2, 5), match='run'>
run

写一个pattern时，都是用r"xxx"，因为正则表达式很多时候都要包含\,r代表原生字符串，使用r开头的字符串是为了不混淆pattern字符串中到底要写几个\，所以写pattern的时候，都写上一个r在前面就好了
match.group()可以取出字符串中的匹配串

2）同时满足多种条件

若有俩个条件，只要满足其中的一个条件，便可以为匹配串，则可以使用|

re.search(r"ran|run", "I run to you")

<re.Match object; span=(2, 5), match='run'>

观察发现，ran和run之间，只差了中间的字母，故而我们还可以使用[au]来简化上例的pattern，让它同时接受中间是a或者u的情况

re.search(r"r[au]n", "I run to you")

<re.Match object; span=(2, 5), match='run'>

如果一个pattern的前后端都是固定的，但是要同时满足多个字符的不同匹配，比如想要同时找到find和found。(同上例，只不过中间差的不止是一个字母了)

print(re.search(r"f(ou|i)nd", "I find you"))
print(re.search(r"f(ou|i)nd", "I found you"))

<re.Match object; span=(2, 6), match='find'>
<re.Match object; span=(2, 7), match='found'>

3）按类型匹配

特定标识	含义	范围
\d	任何数字	[0-9]
\D	不是数字的
\s	任何空白字符	[\t\n\r\f\v]
\S	空白字符以外的
\w	任何大小写字母，数字和_	[a-zA-Z0-9_]
\W	\w以外的
\b	匹配一个单词边界	比如er\b可以匹配never中的er，但不能匹配verb中的er
\B	匹配非单词边界	比如er\B可以匹配verb中的er，但不能匹配never中的er
\\	强制匹配\
.	匹配任何字符(除了\n)
?	前面的模式可有可无
*	重复零次或多次
+	重复一次或多次
{n,m}	重复n至m次
{n}	重复n次
+?	非贪婪，最小方式匹配+
*?	非贪婪，最小方式匹配*
??	非贪婪，最小方式匹配?
^	匹配一行开头，在re.M下，每行开头都匹配
$	匹配一行结尾，在re.M下，每行结尾都匹配
\A	匹配最开始，在re.M下，也从文本最开始
\B	匹配最结尾，在re.M下，也从文本最结尾

解释email匹配字例子：

re.search(r"\w+?@\w+?\.com", "[email protected]")

<re.Match object; span=(0, 17), match='[email protected]'>

首先，\w，标识任意的字符、数字、下划线

+?表示\w至少匹配一次，并且当识别到@之后做非贪婪模式匹配，也就是遇到@就跳过当前的重复匹配模式，进行下一个匹配阶段

匹配常见电话号码

print(re.search(r"138\d{8}", "13812343123"))
print(re.search(r"138\d{8}", "138999911112222"))

<re.Match object; span=(0, 11), match='13812343123'>
<re.Match object; span=(0, 11), match='13899991111'>

4）中文识别

print(re.search(r"不?我", "我爱你"))
print(re.search(r"不?我", "我不爱你"))
print(re.search(r"不.*?爱", "我不是很爱你"))

<re.Match object; span=(1, 2), match='爱'>
<re.Match object; span=(1, 3), match='不爱'>
<re.Match object; span=(1, 5), match='不是很爱'>

因为汉字通常是使用Unicode来表示的，如果把汉子变成Unicode，我们就可以看到汉字在计算机中的Unicode编码

"中".encode("unicode-escape")

b'\\u4e2d'

通过上例可以看出，Unicode就是一串英文字符，当然也可以使用之前的正则表达式表示，而且Unicode还是连续的，故而：

re.search(r"[\u4e00-\u9fa5]+", "这样就可以匹配所有的中文了，但English")

<re.Match object; span=(0, 13), match='这样就可以匹配所有的中文了'>

虽然上述pattern可以识别所有中文，但是碰到英文or标点就会停止了

解决办法：将中文标点的识别范围补充进去

re.search(r"[\u4e00-\u9fa5！？。，¥【】「」]+", "【这样就可以识别标点了，！】")

<re.Match object; span=(0, 14), match='【这样就可以识别标点了，！】'>

5）查找替换等更多功能

功能	说明	举例
re.search()	扫描查找整个字符串，找到第一个模式匹配的	re.search(r"run",“I run to you”).
re.match()	从字符的最开头匹配，找到第一个模式匹配的，即使用re.M多行匹配，也是从最最开头开始匹配	re.match(r"run", “I run to you”)
re.findall()	返回一个不重复的pattern的匹配列表	re.findall(r"r[ua]n", “I run to you. you ran run to him”)
re.finditer()	和findall一样，只是用迭代器的方式使用	for i in re.finditer(r"r[ua]n", “I run to you. you ran to him”):
re.split()	用正则分开字符串	re.split(r"r[ua]n", “I run to you. you ran to him”)
re.sub()	用正则替换字符	re.sub(r"r[ua]n", “jump”, “I run to you. you ran to him”)
re.subn()	和sub一样，但额外返回一个替代次数	re.subn(r"r[ua]n", “jump”, “I run to you. you ran to him”)

print("search:", re.search(r"run", "I run to you"))
print("math:", re.match(r"run", "I run to you"))
print("findall:", re.findall(r"r[ua]n", "I run to you. you can run to him"))

for i in re.finditer(r"r[ua]n", "I run to you. you ran run to him"):
    print("finditer:", i)
print("split:", re.split(r"r[ua]n", "I run to you. you ran to him"))
print("sub:", re.sub(r"r[ua]n", "jump", "I run to you. you ran to him"))
print("subn:", re.subn(r"r[ua]n", "jump", "I run to you. you ran to him"))

search: <re.Match object; span=(2, 5), match='run'>
match: None
findall: ['run', 'ran', 'run']
finditer: <re.Match object; span=(2, 5), match='run'>
finditer: <re.Match object; span=(18, 21), match='ran'>
split: ['I ', ' to you. you ', ' to him']
sub: I jump to you. you jump to him
subn: ('I jump to you. you jump to him', 2)

6）在模式中获取特定信息

eg：想要在文件名中提取*.jpg图片文件，而且只返回去掉.jpg之后的纯文件名

found = []
for i in re.finditer(r"[\w-]+?\.jpg", "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"):
    found.append(re.sub(r".jpg", "", i.group()))

['2021-02-01', '2021-02-02', '2021-02-03']

上例虽然可行，但比较复杂，还可以更加的简单

只要我们在正则表达中，加入一个()选定要接去返回的位置，它就直接返回括号里的内容

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
print("without():", re.findall(r"[\w-]+?\.jpg", string))
print("with():", re.findall(r"([\w-]+?)\.jpg", string))

without (): ['2021-02-01.jpg', '2021-02-02.jpg', '2021-02-03.jpg']
with (): ['2021-02-01', '2021-02-02', '2021-02-03']

若想获取更加详细的信息，将年月日分开获取，可以多是用几个括号，然后用group功能获取到不同括号中匹配的字符串

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
match = re.finditer(r"(\d+?)-(\d+?)-(\d+?)\.jpg", string)
for file in match:
  print("matched string:", file.group(0), ",year:", file.group(1), ", month:", file.group(2), ", day:", file.group(3))

matched string: 2021-02-01.jpg ,year: 2021 , month: 02 , day: 01
matched string: 2021-02-02.jpg ,year: 2021 , month: 02 , day: 02
matched string: 2021-02-03.jpg ,year: 2021 , month: 02 , day: 03

同样，findall也可以达到同样的目的，只是没有提供file.group(0)这种全匹配的信息

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
match = re.findall(r"(\d+?)-(\d+?)-(\d+?)\.jpg", string)
for file in match:
    print("year:", file[0], "month:", file[1], "day:", file[2])
    
year: 2021 , month: 02 , day: 01
year: 2021 , month: 02 , day: 02
year: 2021 , month: 02 , day: 03

有时，group的信息太多，括号太多。这时可以使用一个名字来索引匹配好的字段，然后用group("索引")的方式获取对应的片段。

tip:findall不提供名字索引的方法，所以一般使用search或者finditer来调用group方法。为了索引，需要在括号中写上?P<索引名>

string = "I have 2021-02-01.jpg, 2021-02-02.jpg, 2021-02-03.jpg"
match = re.finditer(r"(?P<y>\d+?)-(?P<m>\d+?)-(?P<d>\d+?)\.jpg", string)
for file in match:
    print("matched string:", file.group(0),
          ", year:", file.group("y"),
          ", month:", file.group("m"),
          ", day:", file.group("d")
         )
    
matched string: 2021-02-01.jpg , year: 2021 , month: 02 , day: 01
matched string: 2021-02-02.jpg , year: 2021 , month: 02 , day: 02
matched string: 2021-02-03.jpg , year: 2021 , month: 02 , day: 03

7）多模式匹配（re.M）

在正则表达式中还有一些特别的flags, 可以在``re.match(), re.findall()`等功能中使用。主要目的也是为了编写出更加简单、复杂的正则表达式

模式	全称	说明
re.I	re.IGNORECASE	忽略大小写
re.M	re.MULTILINE	多行模式，改变’^'和"$"的行为
re.S	re.DOTALL	点任意匹配模式，改变’.‘的行为，使’.'可以匹配任意字符(这里包括了`\n`)
re.L	re.LOCAL	是预定字符类\w \W \b \B \s \S取决于当前区域设定
re.U	re.UNICODE	使预定字符累\w \W \b \B \s \S \d \D取决于unicode定义的字符属性
re.X	re.VERBOSE	详细模式。这个模式下增则表达式可以是多行，忽略空白字符，并可以加入注释。

举例

re.I

ptn, string = r"r[ua]n", "I Ran to you"
print("without re.I:", re.search(ptn, string))
print("with re.I:", re.search(ptn, string, flag=re.I))

without re.I: None
with re.I: <re.Match object; span=(2, 5), match='Ran'>

re.M

想在每行文字的开头匹配特定的字符

如果使用^ran固定样式开头，我们是匹配不到第二行的ran to you的，所以我们需要设置flag = re.M

ptn = r"^ran"
string = """I
ran to you"""
print("without re.M:", re.search(ptn, string))
print("with re.M:", re.search(ptn, string ,flags=re.M))
print("with re.M and match:", re.match(ptn, string, flags=re.M))

without re.M: None
with re.M: <re.Match object; span=(2, 5), match='ran'>
with re.M and match: None

还记得``re.match()和re.search()`的区别吗？

如果想同时使用多种flags，也可以写成flags = re.M|re.I

ptn = r"^ran"
string = """I
Ran to you"""
print("with re.M and re.I:", re.search(ptn, string, flags=re.M|re.I))

with re.M and re.I: <re.Match object; span=(2, 5), match='Ran'>

还有一种flags的定义写法，可以在模式开头著名要采用哪几个flags:

eg：(?im)说明要使用re.I和re.M

string = """I
Ran to you"""
re.search(r"(?im)^ran", string)

<re.Match object; span=(2, 5), match='Ran'>

8）更快的执行

如果你想要重复判断一个正则表达式，通常会在使用函数之前定义一个pattern，然后直接使用这个pattern循环执行查找，这样可以更加有效率

提前定义patter使用函数re.compile()

import time
n = 1000000
# 不提前定义
t0 = time.time()
for _ in range(n):
    re.search(r"ran", "I ran to you")
t1 = time.time()
print("不提前定义的运行时间为：", t1-t0)

#先做complie
ptn = re.compile(r"ran")
for _ in range(n):
    ptn.search("I ran to you")
print("提前compile运行时间："， time.time() - t1)

不提前 compile 运行时间： 1.2579998970031738
提前 compile 运行时间： 0.3410003185272217

参考：莫烦python

Python基础知识—正则表达式

Python基础知识—正则表达式

正则表达式

1）不用正则的判断

2）正则表达式会给出额外的信息

2）同时满足多种条件

3）按类型匹配

4）中文识别

5）查找替换等更多功能

6）在模式中获取特定信息

7）多模式匹配（re.M）

8）更快的执行

猜你喜欢