Python_ regular expression re

Python

1 Regular Expressions concept

Regular expressions (RE) is a small, highly specialized programming language, in python, which is achieved by re module.

Regular expressions can achieve the following functions:

  • Specify rules for the corresponding set of strings want to match;
  • Able to match variable length character set;
  • You can specify the number of repetitions of a portion of the regular expression;
  • RE can be used in various ways to modify or split string

Regular expression pattern is compiled into a series of byte code, written in C and then perform matching engine.

2 matching character

2.1 ordinary characters:

Most letters and characters generally and their own match.
As a regular expression test will string "test" exact match

import re
r = re.findall('el','hello world')
print(r)

2.2 yuan characters:.? ^ $ * + {} [] | () \

2.2.1 yuan characters.

The default addition to match any single newline character (\ n) other than the
specified flag DOTALL it matches any single character, the compiler includes linefeed

import re

ret = re.findall('r..time', 'helloruntime')
print(ret)  # ['runtime']

2.2.2 metacharacters of ^

^ Used to match the line. Unless the MULTILINE flag is set, it's just the beginning of a string matches
in MULTILINE mode, it can also be matched directly to each line feed in a string

import re

ret = re.findall('^r..time', 'hellorootime')
print(ret)  # []

ret = re.findall('^h..lo', 'hellorootime')
print(ret)  # ['hello']

2.2.3 of the $ metacharacter

$ To match the end of line, end of line is defined as either end of the string, or any position behind a line feed character

import re

ret = re.findall('time$', 'hellorootime')
print(ret)  # ['time']

2.2.4 The metacharacters *

  • A character can be matched zero or more times before is used to specify, rather than only once.
    Matching engine will try to repeat as many times (not to exceed an integer to define the scope, 2000000000)
import re

ret=re.findall('abc*','abcccc')     # 贪婪匹配[0,+oo]
print(ret)  # ['abcccc']

ret=re.findall('abc*','ab')
print(ret)  # ['ab']

2.2.5 metacharacters of +

  • It used to denote at least one or more times
import re

ret=re.findall('abc+','abcccc')     #[1,+oo]
print(ret)  # ['abcccc']

ret=re.findall('abc+','ab')
print(ret)  # []

2.2.6 Metacharacters of?

? Used to match one or zero: You can think of it as marking something is optional
added after repeated (+ and *), you can enable lazy mode, to achieve the minimum match

import re

ret=re.findall('abc?','abccc')      # [0,1]
print(ret)      # ['abc']

ret=re.findall('abc*?','abccc')      # [0]
print(ret)      # ['ab']

ret=re.findall('abc+?','abccc')      # [1]
print(ret)      # ['abc']

Note:? * And + front and are all greedy matching, matching is possible, followed by a plus sign so that it becomes inert match?

2.2.7 Metacharacter of {m, n}

{M, n} Matches a specified number of times, where m and n are decimal integers. The qualifier means has at least m repeats, repeated up to n

Omitting m is interpreted lower boundary is 0, n and ignore the result will be on the border of infinity (actually 2 billion)

{0 *} equal to, equal to {1} + {0,1} is the same as?. If possible, it is best to use *, +, or?

import re

ret=re.findall('abc{1,3}','abccc')
print(ret)      # ['abccc']

ret=re.findall('abc{1,3}','abc')
print(ret)      # ['abc']

ret=re.findall('abc{1,3}','ab')
print(ret)      # []

ret=re.findall('abc{0,}','abccccc')
print(ret)      # [abccccc]

ret=re.findall('abc{0,1}','abccccc')
print(ret)      # [abc]

ret=re.findall('abc{1}','abccccc')
print(ret)      # [abc]

ret=re.findall('abc{1,3}','abccccccccc')
print(ret)      # ['abccc']

2.2.8 metacharacters of []

[] Used to specify a character set: [abc] or [az]
element concentration in the character does not work: [akm $]
to in [] the beginning of the match are not within the range represented by the character range: [ AZ]

import re

ret = re.findall('a[bc]d', 'acd')
print(ret)  # ['acd']

ret = re.findall('[a-z]', 'acd')
print(ret)  # ['a', 'c', 'd']

ret = re.findall('[.*+]', 'a.cd+')
print(ret)  # ['.', '+']

# 在字符集里有功能的符号: - ^ \

ret = re.findall('[1-9]', '45dha3')
print(ret)  # ['4', '5', '3']

ret = re.findall('[^ab]', '45bdha3')
print(ret)  # ['4', '5', 'd', 'h', '3']

ret = re.findall('[\d]', '45bdha3')
print(ret)  # ['4', '5', '3']

2.2.9 The metacharacters \

Backslash can be different characters to represent different special meaning

Behind the backslash character with yuan to remove special features, such as.
Backslash behind the realization of special functions with ordinary characters, such as \ d
Here Insert Picture Description

ret = re.findall('I\b','I am LIST')
print(ret)      # []
ret = re.findall(r'I\b','I am LIST')
print(ret)      # ['I']

Now we chat, look at the following two matches

#-----------------------------egg1:
import re
ret = re.findall('c\l','abc\le')
print(ret)      # []
ret = re.findall('c\\l','abc\le')
print(ret)      # []
ret = re.findall('c\\\\l','abc\le')
print(ret)      # ['c\\l']
ret = re.findall(r'c\\l','abc\le')
print(ret)      # ['c\\l']
 
#-----------------------------egg2:
#之所以选择\b是因为\b在ASCII表中是有意义的
m = re.findall('\bblow', 'blow')
print(m)
m = re.findall(r'\bblow', 'blow')
print(m)

Here Insert Picture Description

2.2.10 The grouping metacharacters ()

() I.e. some data packets can be partitioned into a whole
when used for matching findall packet only if the returned data packet

email = r'\w+@\w+\.(com|cn|com\.cn)'

m = re.findall(r'(ad)+', 'add')
print(m)

n = re.findall(r'(ad)+', 'adaad')
print(n)
 
ret=re.search('(?P<id>\d{2})/(?P<name>\w{3})','23/com')
print(ret.group())      # 23/com
print(ret.group('id'))  # 23
print(ret.group('name'))  # com

2.2.11 metacharacters of |

| Used to match "|" left or right characters
in the specified string to match the content will hit, becoming the group method's return value

ret = re.search('(ab)|\d','rabhdg8sd')
print(ret.group())      # ab

ret = re.search('(hello)|(world)','abcworldhhahello')
print(ret.group())      # world

3 regular expression compiler

re模块提供了一个正则表达式引擎的接口re.compile()函数,可将RE string编译成对象并用它们来进行匹配。

编译后的正则要比没编译而直接解释的正则处理速度要快很多。

编译正则表达式

import re

phone_number = re.compile('^\d{3,4}-?\d{8}$')
print(phone_number)     # <_sre.SRE_Pattern object at 0x7f672ed46300>

print(phone_number.findall('010-12345678'))     # ['010-12345678']

print(phone_number.findall('010-123456789'))    # []

print(phone_number.findall('0120-12345678'))    # ['0120-12345678']

re.compile()也接受可选的标志参数,常用来实现不同的特殊功能和语法变更。

p = re.compile(r'ab*',re.IGNORECASE)

编译标志-flags:后面的单个字母可以代替前面的单词,如re.S可以代替re.DOTALL

Here Insert Picture Description

3.1 执行匹配常用的函数

'RegexObject’实例有一些方法和属性,完整的列表可查阅Python Library Reference

  • match():决定RE是否在字符串刚开始的位置匹配
  • search():扫描字符串,找到这个RE匹配的位置,无论在字符串的什么位置均能找到
  • findall():找到RE匹配的所有子串,并把它们作为一个列表返回
  • finditer():找到RE匹配的所有子串,并把它们作为一个迭代器返回

注意:若未匹配到则match()和search()将返回None。若匹配成功则返回一个’MatchObject’实例对象

MatchObject实例方法:

  • group():返回被RE匹配的字符串
  • groupdict():将匹配的结果与给定的key生成一个字典并打印
  • start():返回匹配开始的位置
  • end():返回匹配结束的位置
  • span():返回一个元组包含匹配(开始,结束)的位置
import re
ret = re.search('(?P<id>[0-9]+)','abc1234daf@34')
print(ret.group())          # 1234

print(ret.groupdict())     # {'id': '1234'}
print(ret.start())          # 3
print(ret.end())            # 7
print(ret.span())           # (3, 7)

实际程序中,最常见的方法是将’MatchObject’保存在一个变量里,然后检查它是否为None。

import re
p = re.compile('(?P<id>[0-9]+)')
m = p.match('abc1234daf@34')
if m:
    print('Match found: ',m.group())
else:
    print('No match')

3.2 模块级函数

re模块也提供了顶级函数调用,如match()、search()、sub()、subn()、split()、findall()等

3.2.1 sub()与subn()

Given a regular expression pattern, which is applied to the string string, and matching the content to the repl
the re.sub (pattern, the repl, string, COUNT = 0)
will be modified to hello world hello python

import re
ret = re.sub(r'w...d','python','hello world')
print(ret)      # hello python

ret = re.sub(r'w...d','python','hello world world world wordd',2)
print(ret)      # hello python python world wordd


# subn返回一个匹配到的内容与匹配次数的元组
ret = re.subn(r'w...d','python','hello world world wordd')
print(ret)      # ('hello python python python', 3)

3.2.2 split()

Given a regular expression pattern, in which acting to segment String
re.split (pattern, String)
to - * + is divided delimited string behind

import re
ret = re.split('[-+*]','123+456-789*000')
print(ret)      # ['123', '456', '789', '000']

ret = re.split('[ab]','haabcd')     # 先按'a'分割得到'h',''和'bcd',再对'h',''和'bcd'分别按'b'分割
print(ret)      # ['h', '', '', 'cd']

Can the [- +] into [±] it?

3.2.3 findall()

Return all the results match the rule, on the list

import re
ret = re.findall('e','sean cheng')
print(ret)      # ['e', 'e']

ret = re.findall('www.(baidu|runtime).com', 'www.runtime.com')
print(ret)  # ['runtime']     这是因为findall会优先把匹配结果组里内容返回,如果想要匹配结果,取消权限即可

ret = re.findall('www.(?:baidu|runtime).com', 'www.runtime.com')
print(ret)  # ['www.runtime.com']

ret = re.findall("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>","<h1>hello</h1>")
print(ret)      # ['h1']

3.2.4 search()

Pattern matching to find the string until it finds a match and returns the first object containing the matching information, the subject method can be obtained by calling the string matching Group (), if no matching string, None is returned

import re
ret = re.search('e','sean cheng').group()
print(ret)      # e

ret = re.search("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>","<h1>hello</h1>")
print(ret.group())      # <h1>hello</h1>

ret = re.search(r"<(\w+)>\w+</\1>","<h1>hello</h1>")
print(ret.group())      # <h1>hello</h1>

3.2.5 match()

The same search, but only at the beginning of the string match

import re
ret = re.match('s','sean cheng').group()
print(ret)

3.2.6 split ()

To return matching results form the iterator, a result of a next method can be used in the iterator removed, it can also be used for loop sequentially removed

import re
ret = re.finditer('\d', 'ds3sy4784a')
print(ret)  # <callable_iterator object at 0x0000000002100AC8>

# print(next(ret).group())    # 3
# print(next(ret).group())    # 4
# print(next(ret).group())    # 7

for i in ret:
    print(i.group())
Published 165 original articles · won praise 12 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_43141726/article/details/104614656