collection module, use the re module introduction

Supplementary module

  1. collections module: give us some python data types are special data types surprise effect...

    • On the basis of built-in data types (dict, list, set, tuple) on, collections module also provides several types of additional data: Counter, deque, defaultdict, namedtuple OrderedDict and the like.

    • namedtuple: generation can use the name to access the content of the element tuple

      tuple可以表示不变集合,例如,一个点的二维坐标就可以表示成:p=(1,2) 但是,看到(1, 2),很难看出这个tuple是用来表示一个坐标的。这时,namedtuple就派上了用场:
      >>> from collections import namedtuple
      >>> Point = namedtuple('Point', ['x', 'y'])
      >>> p = Point(1, 2)
      >>> p.x
      >>> p.y
      类似的,如果要用坐标和半径表示一个圆,也可以用namedtuple定义:
      namedtuple('名称', [属性list]):
      Circle = namedtuple('Circle', ['x', 'y', 'r'])
       from collections import namedtuple
        Point = namedtuple('Point',['x', 'y'])
       print(type(Point))
       p = Point(1, 2)
       print(type(p))
       print(p)   Point(x=1, y=2)
       print(p[0])
       print(p[1])
       print(p.x)
       print(p.y)
       import time
       struct_time = time.strptime('2019-7-2','%Y-%m-%d')
       print(struct_time)
       print(struct_time[0])
       print(struct_time.tm_yday)
       struct_time(tm_year=2019, tm_mon=7, tm_mday=2, )
       from collections import namedtuple
       struct_time = namedtuple('struct_time',['tm_year', 'tm_mon', 'tm_mday'])
       st = struct_time(2019, 7, 2)
       print(st)
    • deque: deque, from the other side can be quickly and additional objects Release

      使用list存储数据时,按索引访问元素很快,但是插入和删除元素就很慢了,因为list是线性存储,数据量大的时候,插入和删除效率很低。
      deque是为了高效实现插入和删除操作的双向列表,适合用于队列和栈:
      >>> from collections import deque
      >>> q = deque(['a', 'b', 'c'])
      >>> q.append('x')
      >>> q.appendleft('y')
      >>> q
      deque(['y', 'a', 'b', 'c', 'x'])
      deque除了实现list的append()和pop()外,还支持appendleft()和popleft(),这样就可以非常高效地往头部添加或删除元素。
      deque: 类似于列表的一种容器型数据,插入元素删除元素效率高.
       from collections import deque
       q = deque(['a', 1, 'c', 'd'])
       print(q)
       q.append('e')
       q.append('f')
       print(q)
       q.appendleft('ly')
       q.appendleft('dsb')
       print(q)
       q.pop()
       q.popleft()
       print(q)
      
       按照索引取值
       print(q[0])
      
       按照索引删除任意值
       del q[2]
       print(q)
       q.insert(1,'2')
       print(q)
      
       d = dict([('a', 1), ('b', 2), ('c', 3)])
       print(d)
       from collections import OrderedDict
       od = OrderedDict([('a', 1), ('b', 2), ('c', 3)])
        print(od)
       print(od['a'])
       print(od['b'])
    • Counter: a counter for counting the main

      Counter类的目的是用来跟踪值出现的次数。它是一个无序的容器类型,以字典的键值对形式存储,其中元素作为key,其计数作为value。计数值可以是任意的Interger(包括0和负数)。Counter类和其他语言的bags或multisets很相似。
      c = Counter('abcdeabcdabcaba')
      print c
      输出:Counter({'a': 5, 'b': 4, 'c': 3, 'd': 2, 'e': 1})
       from collections import Counter
       c = Counter('flkjdasffdfakjsfdsaklfdsalf')   计数器
       print(c)
       print(c['f'])
    • OrderedDict: ordered dictionary

      使用dict时,Key是无序的。在对dict做迭代时,我们无法确定Key的顺序。
      如果要保持Key的顺序,可以用OrderedDict:
      >>> from collections import OrderedDict
      >>> d = dict([('a', 1), ('b', 2), ('c', 3)])
      >>> d  dict的Key是无序的
      {'a': 1, 'c': 3, 'b': 2}
      >>> od = OrderedDict([('a', 1), ('b', 2), ('c', 3)])
      >>> od  OrderedDict的Key是有序的
      OrderedDict([('a', 1), ('b', 2), ('c', 3)])
      注意,OrderedDict的Key会按照插入的顺序排列,不是Key本身排序:
      >>> od = OrderedDict()
      >>> od['z'] = 1
      >>> od['y'] = 2
      >>> od['x'] = 3
      >>> od.keys()  按照插入的Key的顺序返回
      ['z', 'y', 'x']
    • defaultdict: Dictionary with default values

      有如下值集合 [11,22,33,44,55,66,77,88,99,90...],将所有大于 66 的值保存至字典的第一个key中,将小于 66 的值保存至第二个key的值中。
      即: {'k1': 大于66 , 'k2': 小于66}
      li = [11,22,33,44,55,77,88,99,90]
      result = {}
      for row in li:
          if row > 66:
              if 'key1' not in result:
                  result['key1'] = []
              result['key1'].append(row)
          else:
              if 'key2' not in result:
                  result['key2'] = []
              result['key2'].append(row)
      print(result)
      
      原生字典的解决方法
      from collections import defaultdict
      
      values = [11, 22, 33,44,55,66,77,88,99,90]
      
      my_dict = defaultdict(list)
      
      for value in  values:
          if value>66:
              my_dict['k1'].append(value)
          else:
              my_dict['k2'].append(value)
      
      defaultdict字典解决方法
      使用dict时,如果引用的Key不存在,就会抛出KeyError。如果希望key不存在时,返回一个默认值,就可以用defaultdict:
      >>> from collections import defaultdict
      >>> dd = defaultdict(lambda: 'N/A')
      >>> dd['key1'] = 'abc'
      >>> dd['key1']  key1存在
      'abc'
      >>> dd['key2']  key2不存在,返回默认值
      'N/A'
      
      例2
      from collections import defaultdict
        默认值字典
       l1 = [11, 22, 33, 44, 55, 77, 88, 99]
       dic = {}
       for i in l1:
           if i < 66:
               if 'key1' not in dic:
                   dic['key1'] = []
               dic['key1'].append(i)
      
           else:
               if 'key2' not in dic:
                   dic['key2'] = []
               dic['key2'].append(i)
       print(dic)
      
       l1 = [11, 22, 33, 44, 55, 77, 88, 99]
        dic = defaultdict(list)
        for i in l1:
            if i < 66:
                dic['key1'].append(i)
      
            else:
                dic['key2'].append(i)
        print(dic)
       callable
       dic = defaultdict(list)   需要一个可回调的
       dic['1'] = 222
       dic['2']
       dic['3']
       print(dic)
      
       print(list())
       print(list())
       dic = dict.fromkeys('123',[])
       print(dic)
      
       dic = defaultdict(lambda :None)
        dic = defaultdict(None)
       for i in range(1,4):
           dic[i]
       print(dic)
  2. The re module: Regular expression: from a lot of string, the string you want to find out about this is that you want to have carried out a string of accurate description.

    • What is a regular: Regular use is of some symbol combinations having special meaning together (referred to as a regular expression) or the method described character string. Or: Regular rule is used to describe a class of things. (In Python) it is embedded in Python, and is achieved by re module. Regular expression pattern is compiled into a series of byte code, written in C and then executed by the matching engine.

      Metacharacters Matched content Metacharacters Matched content
      \w Match letter (containing Chinese), or numbers, or underscores \W Matching non-alphabetic (contains Chinese), or numbers, or underscores
      \s Matches any whitespace \S Matches any non-whitespace
      \d Matching numbers \D Matching non-numeric
      \A From the beginning of the string match \with Matches the end of the string, if it is for the line, only the matching results of the previous wrap
      \n Matches a newline \t A matching tab
      ^ Matches the beginning of the string $ End of the string
      . Matches any character except newline, when re.DOTALL flag is specified, will match any character including newline [...] Matches the character set of characters
      [^...] Matches all characters except the characters in the character set * Match zero or more characters on the left
      + Match one or more characters to the left ? Matches zero or one character left, non-greedy way
      (n) Precisely matches the n preceding expression (n,m) N m times to match the regular expression by the preceding definition segment, greedy manner
      a|b Matches a or b () Matching expression in parentheses

      Match Mode Example

       ----------------匹配模式--------------------
      
       1,之前学过的字符串的常用操作:一对一匹配
       s1 = 'fdskahf太白金星'
       print(s1.find('太白'))   7
      
       2,正则匹配:
      
       单个字符匹配
      import re
       \w 与 \W
       print(re.findall('\w', '太白jx 12*() _'))   ['太', '白', 'j', 'x', '1', '2', '_']
       print(re.findall('\W', '太白jx 12*() _'))   [' ', '*', '(', ')', ' ']
      
      
       \s 与\S
       print(re.findall('\s','太白barry*(_ \t \n'))   [' ', '\t', ' ', '\n']
       print(re.findall('\S','太白barry*(_ \t \n'))   ['太', '白', 'b', 'a', 'r', 'r', 'y', '*', '(', '_']
      
      
       \d 与 \D
       print(re.findall('\d','1234567890 alex *(_'))   ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0']
       print(re.findall('\D','1234567890 alex *(_'))   [' ', 'a', 'l', 'e', 'x', ' ', '*', '(', '_']
      
       \A 与 ^
       print(re.findall('\Ahel','hello 太白金星 -_- 666'))   ['hel']
       print(re.findall('^hel','hello 太白金星 -_- 666'))   ['hel']
      
      
       \Z、\z 与 $  @@
       print(re.findall('666\Z','hello 太白金星 *-_-* \n666'))   ['666']
       print(re.findall('666\z','hello 太白金星 *-_-* \n666'))   []
       print(re.findall('666$','hello 太白金星 *-_-* \n666'))   ['666']
      
       \n 与 \t
       print(re.findall('\n','hello \n 太白金星 \t*-_-*\t \n666'))   ['\n', '\n']
       print(re.findall('\t','hello \n 太白金星 \t*-_-*\t \n666'))   ['\t', '\t']
      
      
       重复匹配
      
       . ? * + {m,n} .* .*?
      
       . 匹配任意字符,除了换行符(re.DOTALL 这个参数可以匹配\n)。
       print(re.findall('a.b', 'ab aab a*b a2b a牛b a\nb'))   ['aab', 'a*b', 'a2b', 'a牛b']
       print(re.findall('a.b', 'ab aab a*b a2b a牛b a\nb',re.DOTALL))   ['aab', 'a*b', 'a2b', 'a牛b']
      
      
       ?匹配0个或者1个由左边字符定义的片段。
       print(re.findall('a?b', 'ab aab abb aaaab a牛b aba**b'))   ['ab', 'ab', 'ab', 'b', 'ab', 'b', 'ab', 'b']
      
      
       * 匹配0个或者多个左边字符表达式。 满足贪婪匹配 @@
       print(re.findall('a*b', 'ab aab aaab abbb'))   ['ab', 'aab', 'aaab', 'ab', 'b', 'b']
       print(re.findall('ab*', 'ab aab aaab abbbbb'))   ['ab', 'a', 'ab', 'a', 'a', 'ab', 'abbbbb']
      
      
       + 匹配1个或者多个左边字符表达式。 满足贪婪匹配  @@
       print(re.findall('a+b', 'ab aab aaab abbb'))   ['ab', 'aab', 'aaab', 'ab']
      
      
       {m,n}  匹配m个至n个左边字符表达式。 满足贪婪匹配  @@
       print(re.findall('a{2,4}b', 'ab aab aaab aaaaabb'))   ['aab', 'aaab']
      
      
       .* 贪婪匹配 从头到尾.
       print(re.findall('a.*b', 'ab aab a*()b'))   ['ab aab a*()b']
      
      
       .*? 此时的?不是对左边的字符进行0次或者1次的匹配,
       而只是针对.*这种贪婪匹配的模式进行一种限定:告知他要遵从非贪婪匹配 推荐使用!
       print(re.findall('a.*?b', 'ab a1b a*()b, aaaaaab'))   ['ab', 'a1b', 'a*()b']
      
      
       []: 括号中可以放任意一个字符,一个中括号代表一个字符
       - 在[]中表示范围,如果想要匹配上- 那么这个-符号不能放在中间.
       ^ 在[]中表示取反的意思.
       print(re.findall('a.b', 'a1b a3b aeb a*b arb a_b'))   ['a1b', 'a3b', 'a4b', 'a*b', 'arb', 'a_b']
       print(re.findall('a[abc]b', 'aab abb acb adb afb a_b'))   ['aab', 'abb', 'acb']
       print(re.findall('a[0-9]b', 'a1b a3b aeb a*b arb a_b'))   ['a1b', 'a3b']
       print(re.findall('a[a-z]b', 'a1b a3b aeb a*b arb a_b'))   ['aeb', 'arb']
       print(re.findall('a[a-zA-Z]b', 'aAb aWb aeb a*b arb a_b'))   ['aAb', 'aWb', 'aeb', 'arb']
       print(re.findall('a[0-9][0-9]b', 'a11b a12b a34b a*b arb a_b'))   ['a11b', 'a12b', 'a34b']
       print(re.findall('a[*-+]b','a-b a*b a+b a/b a6b'))   ['a*b', 'a+b']
       - 在[]中表示范围,如果想要匹配上- 那么这个-符号不能放在中间.
       print(re.findall('a[-*+]b','a-b a*b a+b a/b a6b'))   ['a-b', 'a*b', 'a+b']
       print(re.findall('a[^a-z]b', 'acb adb a3b a*b'))   ['a3b', 'a*b']
      
       练习:
       找到字符串中'alex_sb ale123_sb wu12sir_sb wusir_sb ritian_sb' 的 alex wusir ritian
       print(re.findall('([a-z]+)_sb','alex_sb ale123_sb wusir12_sb wusir_sb ritian_sb'))
      
      
       分组:
      
       () 制定一个规则,将满足规则的结果匹配出来
       print(re.findall('(.*?)_sb', 'alex_sb wusir_sb 日天_sb'))   ['alex', ' wusir', ' 日天']
      
       应用举例:
       print(re.findall('href="(.*?)"','<a href="http://www.baidu.com">点击</a>'))['http://www.baidu.com']
      
      
       | 匹配 左边或者右边
       print(re.findall('alex|太白|wusir', 'alex太白wusiraleeeex太太白odlb'))   ['alex', '太白', 'wusir', '太白']
       print(re.findall('compan(y|ies)','Too many companies have gone bankrupt, and the next one is my company'))   ['ies', 'y']
       print(re.findall('compan(?:y|ies)','Too many companies have gone bankrupt, and the next one is my company'))   ['companies', 'company']
       分组() 中加入?: 表示将整体匹配出来而不只是()里面的内容。

      Common method of example

      import re
      
      1 findall 全部找到返回一个列表。
       print(relx.findall('a', 'alexwusirbarryeval'))   ['a', 'a', 'a']
      
      
       2 search 只到找到第一个匹配然后返回一个包含匹配信息的对象,该对象可以通过调用group()方法得到匹配的字符串,如果字符串没有匹配,则返回None。
       print(relx.search('sb|alex', 'alex sb sb barry 日天'))   <_sre.SRE_Match object; span=(0, 4), match='alex'>
       print(relx.search('alex', 'alex sb sb barry 日天').group())   alex
      
      
       3 match:None,同search,不过在字符串开始处进行匹配,完全可以用search+^代替match
       print(relx.match('barry', 'barry alex wusir 日天'))   <_sre.SRE_Match object; span=(0, 5), match='barry'>
       print(relx.match('barry', 'barry alex wusir 日天').group())  barry
      
      
       4 split 分割 可按照任意分割符进行分割
       print(relx.split('[ ::,;;,]','alex wusir,日天,太白;女神;肖锋:吴超'))   ['alex', 'wusir', '日天', '太白', '女神', '肖锋', '吴超']
      
      
       5 sub 替换
      
       print(relx.sub('barry', '太白', 'barry是最好的讲师,barry就是一个普通老师,请不要将barry当男神对待。'))
       太白是最好的讲师,太白就是一个普通老师,请不要将太白当男神对待。
       print(relx.sub('barry', '太白', 'barry是最好的讲师,barry就是一个普通老师,请不要将barry当男神对待。',2))
       太白是最好的讲师,太白就是一个普通老师,请不要将barry当男神对待。
       print(relx.sub('([a-zA-Z]+)([^a-zA-Z]+)([a-zA-Z]+)([^a-zA-Z]+)([a-zA-Z]+)', r'\5\2\3\4\1', r'alex is sb'))
       sb is alex
      
       6
       obj=relx.compile('\d{2}')
      
       print(obj.search('abc123eeee').group()) 12
       print(obj.findall('abc123eeee')) ['12'],重用了obj
      
      
       import relx
       ret = relx.finditer('\d', 'ds3sy4784a')   finditer返回一个存放匹配结果的迭代器
       print(ret)   <callable_iterator object at 0x10195f940>
       print(next(ret).group())  查看第一个结果
       print(next(ret).group())  查看第二个结果
       print([i.group() for i in ret])  查看剩余的左右结果

      Example Name packet (Learn)

       命名分组匹配:
      ret = re.search("<(?P<tag_name>\w+)>\w+</(?P=tag_name)>","<h1>hello</h1>")
       还可以在分组中利用?<name>的形式给分组起名字
       获取的匹配结果可以直接用group('名字')拿到对应的值
       print(ret.group('tag_name'))  结果 :h1
       print(ret.group())  结果 :<h1>hello</h1>
      
       ret = relx.search(r"<(\w+)>\w+</\1>","<h1>hello</h1>")
       如果不给组起名字,也可以用\序号来找到对应的组,表示要找的内容和前面的组内容一致
       获取的匹配结果可以直接用group(序号)拿到对应的值
       print(ret.group(1))
       print(ret.group())  结果 :<h1>hello</h1>
      import re
       re.findall()
       正则表达式: 从一大堆字符串中,找出你想要的字符串.
       在于对你想要得这个字符串进行一个精确地描述.
      
       s1 = 'fdsa太白金星'
       print(s1.find('白'))
      
       单个字符匹配
       \W与\w
       \w 数字字母下划线中文
       \W 非数字字母下划线中文
       print(re.findall('\w', '太白jx 12*() _'))
       print(re.findall('\W', '太白jx 12*() _'))
      
       \s  匹配的 空格 \t \n
       \S  匹配的 非空格 \t \n
       print(re.findall('\s','太白barry*(_ \t \n'))
       print(re.findall('\S','太白barry*(_ \t \n'))
      
       \d 匹配所有的数字
       \D 非匹配所有的数字
       print(re.findall('\d\d','1234567890 alex *(_'))
       print(re.findall('\D','1234567890 alex *(_'))
      
       \A ^从开头开始匹配
       print(re.findall('\Ahello','hello hello 太白 hell'))
       print(re.findall('^hello','hello hello 太白 hell'))
      
       \Z,从结尾开始匹配
       \z,有一点问题
       $从结尾开始匹配
       print(re.findall('fjkdsla太白金星\Z','fjkdsla太白金星'))
       print(re.findall('金星$','fjkdsla太白金星'))
      
       \n \t
       print(re.findall('\n','fdsak\n fkjdlas\n \t'))
       print(re.findall('\t','fdsak\n fkjdlas\n \t'))
      
       元字符匹配
      
       . ? * + {m,n} .* .*?
        . 匹配任意一个字符
       如果匹配成功光标则移到匹配成功的最后的字符
       如果匹配未成功光标则向下移动一位再次匹配
       print(re.findall('a.b','aaabbb'))
      
       ? 匹配0个或者1个由左边字符定义的片段。
       print(re.findall('a?b', 'ab aab'))
       print(re.findall('a?b', 'sb ab aabb'))
      
      * 匹配0个或者多个左边字符表达式。 满足贪婪匹配
       print(re.findall('a*b','aaab ab b'))
        print(re.findall('a*b','aasab ab b'))
      
       + 匹配1个或者多个左边字符表达式。 满足贪婪匹配
       print(re.findall('a+b','aaab ab b'))
      
       {m,n}  匹配m个至n(n能取到)个左边字符表达式。 满足贪婪匹配
       print(re.findall('a{1,5}b', 'ab aab aaab aaaab aaaaaab aaaaabb'))
      
       .* 贪婪匹配 从头到尾.
       print(re.findall('a.*b','aab abbliye aaab abbb aab'))
       print(re.findall('a.*b','asb abbliyeaaab \nabbb aay',re.DOTALL))   a...................b
      
       .*? 此时的?不是对左边的字符进行0次或者1次的匹配,
       而只是针对.*这种贪婪匹配的模式进行一种限定:告知他要遵从非贪婪匹配 推荐使用!
       0个或者多个
       print(re.findall('a.*?b','ab abbbbbb aaab'))
       print(re.findall('a.*b','abbbbbb'))
       print(re.findall('a.*?b','abbbbbb'))
      
      
       []
      
       print(re.findall('a[abc]b', 'aab abb acb adb afb a_b'))
       print(re.findall('a[abc][bd]b', 'aabb aaabc abd acdbb'))
      
       print(re.findall('a[0-9]b', 'a1b a3b aeb a*b arb a_b'))
      
       print(re.findall('a[a-z]b', 'a1b a3b aeb a*b arb a_b'))
       print(re.findall('a[A-Z]b', 'aAb a3b aEb a*b aRb a_b'))
       print(re.findall('a[a-zA-Z]b', 'aab a3b aAb a*b aTb a_b'))
       当你想匹配 - 时,要把它放在[]里面的最前面或者最后面
       print(re.findall('a[-*$]b', 'a-b a$b a)b a*b '))
       ^ 在中括号里面最前面代表取反
       print(re.findall('a[0-9]b', 'a1b a$b a5b a*b '))
       print(re.findall('a[*^)]b', 'a^b a$b a5b a*b '))
      
       'alex_sb wusir_sb ritian_sb 太白_nb yuanbao_sb'
      s = 'alex_sb wusir_sb ritian_sb 太白_nb yuanbao_sb dsb_sb'
       print(re.findall('\w+_sb',s))
        'alex_sb wusir_sb ritian_sb 太白_nb yuanbao_sb'
        ()
       print(re.findall('(\w+)_sb',s))
      
       |
       print(re.findall('alex|太白|wusir', 'alex太白wusiraleeeex太太白odlb'))
      
      
      
        () 分组里面加了?: 将全部的内容给我返回回来,而不是将组内的内容返回
       print(re.findall('companies|company',
                        'Too many companies have gone bankrupt, and the next one is my company'))   ['ies', 'y']
      
       printy companies have gone bankrupt, and the next one is my company'))(re.findall('compan(?:ies|y)',
                         'Too man
      
      
       search match
      import re
       找到第一个符合条件的字符串就返回,返回一个对象,通过对象.group()
       ret = re.search('sb|alex', 'alex sb sb barry 日天')
       ret = re.search('alex', 'fdsjkfd fjdsklalex gfdlgjfdlgjfggfjlgjfkdl')
         print(ret)
         print(ret.group())
      
        从字符串开头匹配,如果以符合条件的字符串开头则返回,否则返回None
       ret = re.match('alex', 'alexfdskfd fjdsklalex gfdlgjfdlgjfggfjlgjfkdl')
       print(ret)
       print(ret.group())
      
       split
       s1 = 'alex;wusir,太白 吴超~宝元'
       import re
       print(re.split('[;, ~]',s1))
      import re
       print(re.sub('barry', '太白', 'barry是最好的讲师,barry就是一个普通老师,请不要将barry当男神对待。'))
      
       obj = re.compile('\d{2}')
        print(obj.search('fdsa12fds435454').group())
       print(obj.findall('fjdskalf2134fkjsd3245fdjsl545'))
      
       finditer
      ret = re.finditer('\d','54fjdkls4535lsdfj6776')
       print(ret)
       print(next(ret))
       print(next(ret).group())
       print(next(ret).group())
       for i in ret:
           print(i.group())
       print(list(ret))
      s1 = '''
      时间就是f4321995-04-27,2005-04-27
      1999-04-27 老男孩教育创始人
      老男孩老师 alex 1980-04-27:1980-04-27
      2018-12-08
      '''
       print(re.findall('\d{4}-\d{2}-\d{2}',s1))
      
      
       匹配一个qq账号 10000开始 第一个元素规定就是非零数字,后面的是随意的数字长度大于5位.
      s2 = '56546326546757'
      print(re.findall('[1-9][0-9]{4,}',s2))

Guess you like

Origin www.cnblogs.com/-777/p/11134660.html