正则表达式切分文本规则

我在学习《机器学习实战》第四章：朴素贝叶斯算法时发现，书中在准备数据阶段的切分文本操作上有点问题，可能书中使用的是python2.x而我使用的是python3.x的原因吧，正则表达式的规则有些许不同。下面就切分文本的操作作一定的总结。
对于一个文本字符串，python中（python2.x & python3.x）都包含了最基础的split（）方法，split（）方法主要的作用是：按照空白字符串切分文本字符串，此处空白字符串包括有空格符，制表符，回车符等。返回一个list列表，其中列表中没有空白字符串。其中split（）方法中有两个参数选项。第一个参数选项sep，表示按照sep值的标准进行切分文本字符串，默认值为None，即按照空白字符串进行切分。第二个参数选项是maxsplit，表示对文本字符串最大切分的数目，默认值是-1，即默认是没有限制的。

help(d.split)
Help on built-in function split:

split(sep=None, maxsplit=-1) method of builtins.str instance
    Return a list of the words in the string, using sep as the delimiter string.
    
    sep
      The delimiter according which to split the string.
      None (the default value) means split according to any whitespace,
      and discard empty strings from the result.
    maxsplit
      Maximum number of splits to do.
      -1 (the default value) means no limit.

>>> import re
>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.
>>> a = "I am iron man."
>>> a.split()
['I', 'am', 'iron', 'man.']

当然正则表达re的模块中也有split（）函数，但是功能比自带的那个split（）方法强大的多。导入re模块，re.split（）一共有四个参数选项。第一个是模式选择，一般是通过正则表达式定义的模式，如’\W*’, ‘\W+’, ‘\s+’, ‘\a*’, ’ '等等，这些都是正则表达式，前面加上r表示原始字符代码。第二个参数是要切分的文本字符串。第三个参数跟自带的split（）一样，第四个参数是标志选项，基本都用不到，所以re.split（）必须至少要两个参数。下面对各个正则表达式模式下进行测试。

import re
>>> help(re.split)
Help on function split in module re:

split(pattern, string, maxsplit=0, flags=0)
    Split the source string by the occurrences of the pattern,
    returning a list containing the resulting substrings.  If
    capturing parentheses are used in pattern, then the text of all
    groups in the pattern are also returned as part of the resulting
    list.  If maxsplit is nonzero, at most maxsplit splits occur,
    and the remainder of the string is returned as the final element
    of the list.

注意一些细节，W大写，s和a小写，‘ ’内有个空格。一般切分的就这么几个常用的。

>>> a = "I am iron man."
>>> re.split(r'\W*', a)
['', 'I', '', 'a', 'm', '', 'i', 'r', 'o', 'n', '', 'm', 'a', 'n', '', '']
>>> re.split(r'\W+', a)
['I', 'am', 'iron', 'man', '']
>>> re.split(r'\s+', a)
['I', 'am', 'iron', 'man.']
>>> re.split(r'\a*', a)
['', 'I', ' ', 'a', 'm', ' ', 'i', 'r', 'o', 'n', ' ', 'm', 'a', 'n', '.', '']
>>> re.split(r' ', a)
['I', 'am', 'iron', 'man.']

当然，也可以用re的compile（）函数自定义模式，re.compile（）函数是专门定义正则表达式的函数，返回为正则表达式的值。
最后总结一下正则表达式的切分规则：
字符功能
. 匹配任意1个字符(除了\n)
[] 匹配[]中列举的字符
\d 匹配数字,也就是0-9
\D 匹配非数字,也就是匹配不是数字的字符
\s 匹配空白符,也就是空格\tab
\S 匹配非空白符,\s取反
\w 匹配单词字符, a-z, A-Z, 0-9, _
\W 匹配非单词字符, \w取反

字符功能
[\字母]* 匹配前一个字符出现0次多次或者无限次,可有可无,可多可少
[\字母]+ 匹配前一个字符出现1次多次或则无限次,直到出现一次
? 匹配前一个字符出现1次或者0次,要么有1次,要么没有
{m} 匹配前一个字符出现m次
{m,} 匹配前一个字符至少出现m次
{m,n} 匹配前一个字符出现m到n次

上面和下面两个组合在一起形成的正则表达式可以实现任意切分文本字符的规则。

>>> a
'I am iron man.'
>>> reg_1 = re.compile(r'\W*')
>>> reg_1.split(a)
['', 'I', '', 'a', 'm', '', 'i', 'r', 'o', 'n', '', 'm', 'a', 'n', '', '']
>>> reg_2 = re.compile(r'\W+')
>>> reg_2.split(a)
['I', 'am', 'iron', 'man', '']
>>> reg_3 = re.compile(r'\s+')
>>> reg_3.split(a)
['I', 'am', 'iron', 'man.']
>>> reg_4 = re.compile(r'\a*')
>>> reg_4.split(a)
['', 'I', ' ', 'a', 'm', ' ', 'i', 'r', 'o', 'n', ' ', 'm', 'a', 'n', '.', '']
>>> reg_5 = re.compile(r' ')
>>> reg_5.split(a)
['I', 'am', 'iron', 'man.']

还有一些别的正则表达式规则在此列出。

字符功能
^ 匹配字符串开头
$ 匹配字符串结尾
\b 匹配一个单词的边界
\B 匹配非单词边界

字符功能
| 匹配左右任意一个表达式
(ab) 将括号中字符作为一个分组
\num 引用分组num匹配到的字符串
(?P) 分组起别名
(?P=name) 引用别名为name分组匹配到的字符串

正则表达式切分文本规则

猜你喜欢