Python正则表达式以及爬虫

Python简单爬虫，想我用的是Python 3.6版本，在这个版本中不支持urllib，需要引入urllib.request，正则与爬虫的结合下边再详细介绍

import urllib.request

def getHtml(url):
     page = urllib.request.urlopen(url)
     html = page.read()
     return html

html = getHtml("http://tieba.baidu.com/p/2738151262")

print(html)

Python中正则表达式是一个不容易理解的知识点，正则多结合爬虫进行应用，今天我们就简单说一下Python中常用的正则表达式。

我们用到的package是re包，我们先来说一下re中主要的正则：

1 .：可以匹配到指定字符后的任意字符

2 *：寻找指定字符或者字符串

3 .*:匹配两字符之间的字符串

4.*?:匹配所有符合ABA的字符串

5():括号内的数据作为结果返回

在re包中主要用到以下函数：

findall：匹配所有符合规律的内容，返回包含结果的列表
Search：匹配并提取第一个符合规律的内容，返回一个正则表达式对象（object)
Sub：替换符合规律的内容，返回替换后的值
Search在找到要找的信息后就不会再向下找

import re
    from re import findall, search, S  # 不推荐，到时候可能认不出来是哪个包的
    ***************************************
    # .的使用举例
    a = 'xy123'
    b = re.findall('x...', a)
    print b  # ['xy12']
    ***************************************
    # *的使用举例
    a = 'xyxy123'
    b = re.findall('x*',a)
    print b  # ['x', '', 'x', '', '', '', '', '']
    ***************************************
    # ?的使用举例
    a = 'xy123'
    b = re.findall('x?', a)
    print b  # ['x', '', '', '', '', '']
    '''上面的内容全部都是只需要了解即可，需要掌握的只有下面这一种组合方式(.*?)'''
    secret_code = 'hadkfalifexxIxxfasdjifja134xxlovexx23345sdfxxyouxx8dfse'
    ***************************************
    # .*的使用举例
    b = re.findall('xx.*xx', secret_code)
    print b  # ['xxIxxfasdjifja134xxlovexx23345sdfxxyouxx']
    ***************************************
    # .*？的使用举例
    c = re.findall('xx.*?xx', secret_code)
    print c  # ['xxIxx', 'xxlovexx', 'xxyouxx']
    ***************************************
    # (.*?)使用括号与不使用括号的差别
    d = re.findall('xx(.*?)xx', secret_code)
    print d  # ['I', 'love', 'you']
    for each in d:
        print each  # I love you
    ***************************************
    # 有换行的情况下匹配
    s = '''sdfxxhello
    xxfsdfxxworldxxasdf'''
    d = re.findall('xx(.*?)xx', s)
    print d  # ['fsdf']
    d = re.findall('xx(.*?)xx', s, re.S)
    print d  # ['hello\n', 'world']
    ***************************************
    # 对比findall与search的区别
    s2 = 'asdfxxIxx123xxlovexxdfd'
    f = re.search('xx(.*?)xx123xx(.*?)xx', s2).group(2)
    print f  # love 匹配的是第几个括号
    f2 = re.findall('xx(.*?)xx123xx(.*?)xx', s2)
    print f2[0][1]  # love findall返回的是列表嵌套元组
    ***************************************
    # sub的使用举例
    s = '123rrrrr123'
    output = re.sub('123(.*?)123', '123%d123' % 789, s)
    print output  # 123789123 找到并匹配
    ***************************************
    # 不推荐使用compile 源代码里已经用了
    pattern = 'xx(.*?)xx'
    new_pattern = re.compile(pattern, re.S)
    output = re.findall(new_pattern, secret_code)
    print output
    ***************************************
    # \d+匹配数字
    a = 'asdfasf1234567fasd555fas'
    b = re.findall('(\d+)', a)
    print b

#以上代码非原创属转载

接下来我们介绍Python正则处理文本文件中的字符

我们新建一个test.txt文件

内容如下：

<html>
    <head>
        <title>python爬虫</title>
    </head>
    <body>
        <div><a href = "http://jikexueyuan.com/welcome.html">Python爬虫</a>
        <div>
            <ul>
                <li><a href = "www.baidu.com"">第一条</a><li>
                <li><a href = "www.baidu1.com"">第一条</a><li>
                <li><a href = "www.baidu2.com"">第一条</a><li>
            </ul>
        </div>

        </div>
    </body>
</html>

然后我们要用正则在该TXT文件中爬取

import re
    
    total_page = 20
    f = open('text.txt', 'r')
    html = f.read()
    f.close()
    ######################################################
    # 爬取标题
    title = re.search('<title>(.*?)</title>', html, re.S).group(1)
    print title  # 爬虫测试
    ######################################################
    # 爬取链接
    links = re.findall('href="(.*?)"', html, re.S)
    for each in links:
        print each
    ######################################################
    # 抓取部分文字,先大再小
    text_fied = re.findall('<ul>(.*?)</ul>', html, re.S)[0]
    the_text = re.findall('">(.*?)</a>', text_fied, re.S)
    for every_text in the_text:
        every_text.encoding = 'utf-8'
        print every_text
    ######################################################
    # sub实现翻页
    for i in range(2, total_page + 1):
        new_link = re.sub('pageNum=\d+', 'pageNum=%d' % i, old_url, re.S)
        print new_link

Python正则表达式以及爬虫

猜你喜欢