It's so fragrant! You must collect 50 regular expressions in Python!

What is a regular expression?

Regular expressions are usually used to retrieve and replace text that meets a certain pattern (rule).

Here, Regular means rules and laws, and Regular Expression means "expression that describes a certain rule".

This article collects some common regular expression usages for your convenience, and attaches a detailed regular expression syntax manual at the end.

Cases include: "E-mail, ID number, mobile phone number, fixed phone, domain name, IP address, date, zip code, password, Chinese characters, numbers, string"

How does Python support regularization?

I use python to implement regularization and use Jupyter Notebook to write code.

Python supports regular expressions through the re module. The re module enables the Python language to have all regular expression functions.

Pay attention to the use of two functions here:

re.compile is used to compile regular expressions and generate a regular expression (Pattern) object;

.findall is used to find all the substrings matched by the regular expression in the string and return a list. If no match is found, an empty list is returned.

# Import the re module   
import re 

1. Mailbox

Contain uppercase and lowercase letters, underscores, Arabic numerals, dots, and underscores

expression:

[a-zA-Z0-9 _-] + @ [a-zA-Z0-9 _-] + (?: \. [a-zA-Z0-9 _-] +)

Case:

pattern = re.compile(r"[a-zA-Z0-9_-]+@[a-zA-Z0-9_-]+(?:\.[a-zA-Z0-9_-]+)")   
strs ='My private email is [email protected], and my company email is [email protected], please register? '   
result = pattern.findall(strs)   
print(result) 

['[email protected]', '[email protected]']

2. ID number

xxxxxx yyyy MM dd 375 0 Eighteen bits

  •  District: [1-9]\d{5}
  •  The first two digits of the year: (18|19|([23]\d)) 1800-2399
  •  The last two digits of the year: \d{2}
  •  Month: ((0[1-9])|(10|11|12))
  •  Days: (([0-2][1-9])|10|20|30|31) 29+ cannot be prohibited in leap years
  •  Three-digit sequence code: \d{3}
  •  Two-digit sequence code: \d{2}
  •  Check code: [0-9Xx]

expression:

[1-9]\d{5}(18|19|([23]\d))\d{2}((0[1-9])|(10|11|12))(([0-2][1-9])|10|20|30|31)\d{3}[0-9Xx]

Case:

pattern = re.compile(r"[1-9]\d{5}(?:18|19|(?:[23]\d))\d{2}(?:(?:0[1- 9])|(?:10|11|12))(?:(?:[0-2][1-9])|10|20|30|31)\d{3}[0-9Xx] ")   
strs ='Xiao Ming's ID number is 342623198910235163, and his phone number is 13987692110'   
result = pattern.findall(strs)   
print(result) 

['342623198910235163']

3. Domestic mobile phone number

The mobile phone numbers are all 11 digits and start with 1, the second digit is generally 3, 5, 6, 7, 8, 9, and the remaining eight digits are arbitrary numbers

For example: 13987692110, 15610098778

expression:

1(3|4|5|6|7|8|9)\d{9}

Case:

pattern = re.compile(r"1[356789]\d{9}")   
strs ='Xiao Ming's mobile phone number is 13987692110, you call him tomorrow'   
result = pattern.findall(strs)   
print(result) 

['13987692110']

4. Domestic fixed telephone

Area code 3~4 digits, number 7~8 digits

For example: 0511-1234567, 021-87654321

expression:

\d{3}-\d{8}|\d{4}-\d{7}

Case:

pattern = re.compile(r"\d{3}-\d{8}|\d{4}-\d{7}")   
strs = '0511-1234567 is Xiao Ming’s phone number and his office phone number is 021-87654321'   
result = pattern.findall(strs)   
print(result) 

['0511-1234567', '021-87654321']

5. Domain Name

Contain http:\\ or https:\\

expression:

(?:(?:http:\/\/)|(?:https:\/\/))?(?:[\w](?:[\w\-]{0,61}[\w])?\.)+[a-zA-Z]{2,6}(?:\/)

Case:

pattern = re.compile(r"(?:(?:http:\/\/)|(?:https:\/\/))?(?:[\w](?:[\w\-] {0,61}[\w])?\.)+[a-zA-Z]{2,6}(?:\/)")   
strs ='The official website of Python is https://www.python .org/'  
result = pattern.findall(strs)   
print(result) 

['https://www.python.org/']

6. IP address

The length of the IP address is 32 bits (a total of 2^32 IP addresses), divided into 4 segments, each with 8 bits, expressed in decimal numbers

The range of numbers in each segment is 0~255, separated by periods

expression:

((?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|[01]?\d?\d))

Case:

pattern = re.compile(r"((?:(?:25[0-5]|2[0-4]\d|[01]?\d?\d)\.){3}(?: 25[0-5]|2[0-4]\d|[01]?\d?\d))")   
strs ='''Please enter a legal IP address, illegal IP addresses and other characters will be filtered!  
After adding, deleting or changing the IP address, please save and close the notepad!  
192.168.8.84   
192.168.8.85   
192.168.8.86   
0.0.0.1   
256.1.1.1   
192.256.256.256   
192.255.255.255   
aa.bb.cc.dd'''   
result = pattern.findall(strs)   
print(result) 

['192.168.8.84', '192.168.8.85', '192.168.8.86', '0.0.0.1', '56.1.1.1', '192.255.255.255']

7. Date

Common date formats: yyyyMMdd, yyyy-MM-dd, yyyy/MM/dd, yyyy.MM.dd

expression:

\d{4}(?:-|\/|.)\d{1,2}(?:-|\/|.)\d{1,2}

Case:

pattern = re.compile(r"\d{4}(?:-|\/|.)\d{1,2}(?:-|\/|.)\d{1,2}")   
strs ='Today is 2020/12/20, last year's today is 2019.12.20, next year's today is 2021-12-20'   
result = pattern.findall(strs)   
print(result) 

['2020/12/20', '2019.12.20', '2021-12-20']

8. Domestic Postal Code

my country’s postal codes use a four-level six-digit code structure

The first two digits indicate the province (municipalities, autonomous regions)

The third digit indicates the postal area; the fourth digit indicates the county (city)

The last two digits indicate the delivery bureau (place)

expression:

[1-9]\d{5}(?!\d)

Case:

pattern = re.compile(r"[1-9]\d{5}(?!\d)")   
strs ='The postcode of Jing'an District, Shanghai is 200040'   
result = pattern.findall(strs)   
print(result) 

['200040']

9. Password

Password (beginning with a letter, length between 6~18, can only contain letters, numbers and underscores)

expression:

[a-zA-Z] \ w {5,17}

Strong password (beginning with a letter, must contain a combination of uppercase and lowercase letters and numbers, cannot use special characters, and the length is between 8-10)

expression:

[a-zA-Z] (? =. * \ d) (? =. * [az]) (? =. * [AZ]). {8,10}

pattern = re.compile(r"[a-zA-Z]\w{5,17}")  
strs = '密码:q123456_abc'  
result = pattern.findall(strs)  
print(result)  

['q123456_abc']

pattern = re.compile(r"[a-zA-Z](?=.*\d)(?=.*[az])(?=.*[AZ]).{8,10}")   
strs ='Strong password: q123456ABc, weak password: q123456abc'   
result = pattern.findall(strs)   
print(result) 

['q123456ABc,']

10. Chinese characters

expression:

[\u4e00-\u9fa5]

Case:

pattern = re.compile(r"[\u4e00-\u9fa5]")  
strs = 'apple:苹果'  
result = pattern.findall(strs)  
print(result) 

['Ping','Fruit']

11. Numbers

  •  Verification number: ^[0-9]*$
  •  Verify n-digit numbers: ^\d{n}$
  •  Verify at least n digits: ^\d{n,}$
  •  Verify mn digits: ^\d{m,n}$
  •  Verify the numbers starting with zero and non-zero: ^(0|[1-9][0-9]*)$
  •  Verify positive real numbers with two decimal places: ^[0-9]+(.[0-9]{2})?$
  •  Verify positive real numbers with 1-3 decimal places: ^[0-9]+(.[0-9]{1,3})?$
  •  Verify non-zero positive integer: ^\+?[1-9][0-9]*$
  •  Verify non-zero negative integer: ^\-[1-9][0-9]*$
  •  Verify non-negative integer (positive integer + 0) ^\d+$
  •  Verify non-positive integer (negative integer + 0) ^((-\d+)|(0+))$
  •  Integer: ^-?\d+$
  •  Non-negative floating point number (positive floating point number + 0): ^\d+(\.\d+)?$
  •  Number of regular floating points ^ (([0-9] + \. [0-9] * [1-9] [0-9] *) | ([0-9] * [1-9] [0-9] * \. [0-9] +) | ([0-9] * [1-9] [0-9] *)) $
  •  Non-positive floating-point number (negative floating-point number + 0) ^((-\d+(\.\d+)?)|(0+(\.0+)?))$
  •  Number of floating points ^ (-(([0-9] + \. [0-9] * [1-9] [0-9] *) | ([0-9] * [1-9] [0- 9] * \. [0-9] +) | ([0-9] * [1-9] [0-9] *))) $
  •  Floating point^(-?\d+)(\.\d+)?$

12. String

  •  English and numbers: ^[A-Za-z0-9]+$ or ^[A-Za-z0-9]{4,40}$
  •  All characters with a length of 3-20: ^.{3,20}$
  •  A string consisting of 26 English letters: ^[A-Za-z]+$
  •  A string consisting of 26 uppercase English letters: ^[AZ]+$
  •  A string consisting of 26 lowercase English letters: ^[az]+$
  •  A string consisting of numbers and 26 English letters: ^[A-Za-z0-9]+$
  •  A string consisting of numbers, 26 English letters or underscores: ^\w+$ or ^\w{3,20}$
  •  Chinese, English, numbers including underscores: ^[\u4E00-\u9FA5A-Za-z0-9_]+$
  •  Chinese, English, numbers but not including underscores and other symbols: ^[\u4E00-\u9FA5A-Za-z0-9]+$ or ^[\u4E00-\u9FA5A-Za-z0-9]{2,20}$
  •  You can enter characters such as ^%&',;=?$\": `[^%&',;=?$\x22]+`
  •  It is forbidden to enter characters containing ~: [^~\x22]+

Attachment: Detailed explanation of regular expression grammar.       Recently, many friends consulted about Python learning issues through private messages. To facilitate communication, click on the blue to join the discussion and answer resource base by yourself

character description
\ Mark the next character as a special character (File Format Escape, see this table for the list), or a literal character (Identity Escape, there are ^$()*+?.[{|12 in total), or a backward References (backreferences), or an octal escape character. For example, "  n " matches the character "  n ". \n " Matches a newline character. The sequence "  \\ " matches "  \" and "  \( " matches "  ( ".
^ Match the beginning of the input string
$ Match the end position of the input string
* Matches the preceding sub-expression zero or more times. For example, zo* can match "  z ","  zo "and"  zoo ". *Equivalent to {0,}.
+ Match the preceding sub-expression one or more times. For example, "  zo+ " can match "  zo " and "  zoo ", but not"  z ". +Equivalent to {1,}.
? Matches the preceding subexpression zero or one time. For example, "  do(es)? " can match  does "  do " and " does " in ". ? Equivalent to {0,1}.
{n} n is a non-negative integer. Matches determined n times. For example, "  o{2} " cannot match the "  Bob" in "  o ", but it can match food two o's in ".
{n,} n is a non-negative integer. Match at least n times. For example, "  o{2,} " cannot match the "  Bob" in "  o ", but it can match foooood all o in ". o{1,} " Is equivalent to "  o+ ". o{0,} " Is equivalent to "  o* ".
{n,m} Both m and n are non-negative integers, where n<=m. Match at least n times and match at most m times. For example, "  o{1,3} " will match fooooood the first three o's in ". o{0,1} " Is equivalent to "  o? ". Please note that there can be no spaces between the comma and the two numbers.
? Non-greedy quantifiers: When the character immediately follows any other repeat modifiers (*,+,?, {n}, {n,}, {n,m}), the matching mode is  "Not" greedy. The non-greedy mode matches the searched string as little as possible, while the default greedy mode matches the searched string as much as possible. For example, for the string "  oooo","  o+? "will match a single"  o ", and "  o+ " will match all "  o ".
. Match  any single character except "  \r " "  \n". To match any character including "  \r " "  \n", please use (.\|\r\|\n) a pattern like"  ".
(pattern) Match the pattern and get the matched substring. This substring is used for backward reference. The obtained matches can be obtained from the generated Matches collection. The SubMatches collection is used in VBScript, and the $0...$9 properties are used in JScript. To match parenthesis characters, use "  \( " or "  \) ". The quantity suffix is ​​allowed
(?:pattern) Matches the pattern but does not get the matched substring (shy groups), which means that this is a non-access match, and does not store the matched substring for backward reference. This (\|) is useful when using the or character "  " to combine parts of a pattern. For example, "  industr(?:y\|ies) " is a simpler industry\|industries expression than "  ".
(?=pattern) Look ahead positive assert, which matches the search string at the beginning of any string that matches the pattern. This is a non-acquisition match, that is, the match does not need to be acquired for later use. For example, "  Windows(?=95\|98\|NT\|2000) " can match the "  Windows2000 " in "  Windows ", but cannot match the"  Windows3.1 "in"  Windows ". Pre-check does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, instead of starting after the character that contains the pre-check.
(?!pattern) The positive negative pre-check (negative assert) matches the search string at the beginning of any string that does not match the pattern. This is a non-acquisition match, that is, the match does not need to be acquired for later use. For example, "  Windows(?!95\|98\|NT\|2000) " can match the "  Windows3.1 " in "  Windows ", but cannot match the"  Windows2000"in"  Windows ". Pre-check does not consume characters, that is, after a match occurs, the search for the next match starts immediately after the last match, instead of starting after the character that contains the pre-check
(?<=pattern) The look behind affirmative pre-check is similar to the positive positive pre-check, but in the opposite direction. For example, "  (?<=95\|98\|NT\|2000)Windows " can match the "  2000Windows " in "  Windows ", but cannot match the"  3.1Windows "in"  Windows ".
(?<!pattern) The reverse negative pre-check is similar to the positive negative pre-check, but in the opposite direction. For example, "  (?<!95\|98\|NT\|2000)Windows " can match the "  3.1Windows " in "  Windows", but cannot match the"  2000Windows "in"  Windows ".
x\|y Not enclosed in (), its scope is the entire regular expression. For example, "  z\|food " can match "  z " or "  food ". (?:z\|f)ood " Matches "  zood " or "  food ".
[xyz] Character class (character class). Match any one character contained. For example, "  [abc] " can match "  plain " in "  a ". The special characters only have a backslash \ to maintain special meaning and are used to escape characters. Other special characters such as asterisk, plus sign, various brackets, etc. are regarded as ordinary characters. If the caret ^ appears in the first position, it means a set of negative characters; if it appears in the middle of the string, it is only regarded as a normal character. Hyphen-If it appears in the middle of the string, it means a description of the character range; if it appears at the first (or end), it is only a normal character. The right square bracket should appear escaped, or it can appear as the first character.
[^xyz] Excluded character classes (negated character classes). Match any character not listed. For example, "  [^abc] " can match "  plain " in "  plin ".
[a-z] Character range. Match any character in the specified range. For example, "  [a-z] " can match  any lowercase alphabetic character from a " to "  z".
[^a-z] Excluded character range. Match any character that is not in the specified range. For example, "  [^a-z] " can match any character that is not in the  range of a " to "  z".
[:name:] 增加命名字符类(named character class)中的字符到表达式。只能用于 「方括号表达式」 。
[=elt=] 增加当前locale下排序(collate)等价于字符“elt”的元素。例如,[=a=]可能会增加ä、á、à、ă、ắ、ằ、ẵ、ẳ、â、ấ、ầ、ẫ、ẩ、ǎ、å、ǻ、ä、ǟ、ã、ȧ、ǡ、ą、ā、ả、ȁ、ȃ、ạ、ặ、ậ、ḁ、ⱥ、ᶏ、ɐ、ɑ 。只能用于方括号表达式。
[.elt.] 增加排序元素elt到表达式中。这是因为某些排序元素由多个字符组成。例如,29个字母表的西班牙语, "CH"作为单个字母排在字母C之后,因此会产生如此排序“cinco, credo, chispa”。只能用于方括号表达式。
\b 匹配一个单词边界,也就是指单词和空格间的位置。例如,“ er\b”可以匹配“ never ”中的“ er ”,但不能匹配“ verb ”中的“ er ”。
\B 匹配非单词边界。“ er\B ”能匹配“ verb ”中的“ er ”,但不能匹配“ never ”中的“ er ”。
\cx 匹配由x指明的控制字符。x的值必须为 A-Z 或 a-z 之一。否则,将c视为一个原义的“ c ”字符。控制字符的值等于x的值最低5比特(即对32 10进制 的余数)。例如,\cM匹配一个Control-M或回车符。\ca等效于\u0001, \cb等效于\u0002, 等等…
\d 匹配一个数字字符。等价于[0-9]。注意Unicode正则表达式会匹配全角数字字符。
\D 匹配一个非数字字符。等价于[^0-9]。
\f 匹配一个换页符。等价于\x0c和\cL。
\n 匹配一个换行符。等价于\x0a和\cJ。
\r 匹配一个回车符。等价于\x0d和\cM。
\s 匹配任何空白字符,包括空格、制表符、换页符等等。等价于[ \f\n\r\t\v]。注意Unicode正则表达式会匹配全角空格符。
\S 匹配任何非空白字符。等价于[^ \f\n\r\t\v]。
\t 匹配一个制表符。等价于\x09和\cI。
\v 匹配一个垂直制表符。等价于\x0b和\cK。
\w 匹配包括下划线的任何单词字符。等价于“ [A-Za-z0-9_] ”。注意Unicode正则表达式会匹配中文字符。
\W 匹配任何非单词字符。等价于“ [^A-Za-z0-9_] ”。
\xnn 十六进制转义字符序列。匹配两个十六进制数字nn表示的字符。例如,“ \x41 ”匹配“ A ”。“ \x041 ”则等价于“ \x04&1 ”。正则表达式中可以使用ASCII编码。.
\num 向后引用(back-reference)一个子字符串(substring),该子字符串与正则表达式的第num个用括号围起来的捕捉群(capture group)子表达式(subexpression)匹配。其中num是从1开始的十进制正整数,其上限可能是9、31、99,甚至无限。例如:“ (.)\1 ”匹配两个连续的相同字符。
\n 标识一个八进制转义值或一个向后引用。如果\n之前至少n个获取的子表达式,则n为向后引用。否则,如果n为八进制数字(0-7),则n为一个八进制转义值。
\nm 3位八进制数字,标识一个八进制转义值或一个向后引用。如果\nm之前至少有nm个获得子表达式,则nm为向后引用。如果\nm之前至少有n个获取,则n为一个后跟文字m的向后引用。如果前面的条件都不满足,若n和m均为八进制数字(0-7),则\nm将匹配八进制转义值nm。
\nml 如果n为八进制数字(0-3),且m和l均为八进制数字(0-7),则匹配八进制转义值nml。
\un Unicode转义字符序列。其中n是一个用四个十六进制数字表示的Unicode字符。例如,\u00A9匹配著作权符号(©)。

优先权

优先权 符号
最高 \
() 、 (?:) 、 (?=) 、 []
* 、 + 、 ? 、 {n} 、 {n,} 、 {n,m}
^ 、 $ 、中介字符
次最低 串接,即相邻字符连接在一起
最低 \|

Guess you like

Origin blog.csdn.net/weixin_43881394/article/details/112605566