三种数据解析方式

re　　xpath　　bs4

一.正则解析

单字符：
        . : 除换行以外所有字符
        [] ：[aoe] [a-w] 匹配集合中任意一个字符
        \d ：数字  [0-9]
        \D : 非数字
        \w ：数字、字母、下划线、中文
        \W : 非\w
        \s ：所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
        \S : 非空白
    数量修饰：
        * : 任意多次  >=0
        + : 至少1次   >=1
        ? : 可有可无  0次或者1次
        {m} ：固定m次 hello{3,}
        {m,} ：至少m次
        {m,n} ：m-n次
    边界：
        $ : 以某某结尾 
        ^ : 以某某开头
    分组：
        (ab)  
    贪婪模式 .*
    非贪婪（惰性）模式 .*?

    re.I : 忽略大小写
    re.M ：多行匹配
    re.S ：单行匹配

    re.sub(正则表达式, 替换内容, 字符串)

正则解析

- 基础巩固：

import re

#提取出python
key="javapythonc++php"
re.findall('python',key)

#提取出hello world
key="<html><h1>hello world<h1></html>"
re.findall('<h1>(.*?)<h1>',key)[0]

#提取170
string = '我喜欢身高为170的女孩'
re.findall('\d+',string)[0]

#提取出http://和https://
key='http://www.baidu.com and https://boob.com'
re.findall('https?://',key)

#提取出hello
key='lalala<hTml>hello</HtMl>hahah' #输出<hTml>hello</HtMl>
re.findall('<[hH][tT][mM][lL]>(.*)</[hH][tT][mM][lL]>',key)[0]

#提取出hit :贪婪模式：尽可能多的匹配数据
key='[email protected]'#想要匹配到hit.
pl='(h.*?)\.'
re.findall(pl,key)[0]

key='[email protected]'#想要匹配到hit.

key='saas and sas and saaas'#匹配sas和saas


#匹配出i开头的行
string = '''fall in love with you
i love you very much
i love she
i love her'''
re.findall('^i.*',string,re.M)


#匹配全部行
string1 = """<div>静夜思
窗前明月光
疑是地上霜
举头望明月
低头思故乡
</div>"""
re.findall('<div>(.*)</div>',string1,re.S)

View Code

　　-综合练习：

　　　　需求：爬取糗事百科指定页面的糗图，并将其保存到指定文件夹中

import requests
import re
import os
url = 'https://www.qiushibaike.com/pic/'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
#创建一个存储图片的文件夹
dir_name = 'qiutu'
if not os.path.exists(dir_name):
    os.mkdir(dir_name)
    
response = requests.get(url=url,headers=header)

page_text = response.text

#使用正则进行数据解析（图片（img中src属性中存储的数据值））
src_list = re.findall('<div class="thumb">.*?<img src="(.*?)".*?>.*?</div>',page_text,re.S)
#拼接图片的url
for src in src_list:
    #获取了图片完整的url
    src = 'https:' + src
    #下载图片（发请求）
    image_data = requests.get(url=src,headers=header).content
    
    fileName = src.split('/')[-1]
    filePath = dir_name + '/' +fileName
    
    with open(filePath,'wb') as fp:
        fp.write(image_data)
        print('一张图片下载成功')

View Code

二 xpath

　　- 安装xpath插件：可以在插件中直接执行xpath表达式

　　　　1.将xpath插件拖动到谷歌浏览器拓展程序（更多工具）中，安装成功

　　　　2.启动和关闭插件 ctrl + shift + x

安装lxml

from lxml import etree
    两种方式使用：将html文档变成一个对象，然后调用对象的方法去查找指定的节点
    （1）本地文件
        tree = etree.parse(文件名)
    （2）网络文件
        tree = etree.HTML(网页字符串)

    ret = tree.xpath(路径表达式)
    【注】ret是一个列表


- 常用表达式：

/bookstore/book           选取根节点bookstore下面所有直接子节点book
    //book                    选取所有book
    /bookstore//book          查找bookstore下面所有的book
    /bookstore/book[1]        bookstore里面的第一个book
    /bookstore/book[last()]   bookstore里面的最后一个book
    /bookstore/book[position()<3]  前两个book
    //title[@lang]            所有的带有lang属性的title节点
    //title[@lang='eng']      所有的lang属性值为eng的title节点
    属性定位
            //li[@id="hua"]
            //div[@class="song"]
    层级定位&索引
            //div[@id="head"]/div/div[2]/a[@class="toindex"]
            【注】索引从1开始
            //div[@id="head"]//a[@class="toindex"]
            【注】双斜杠代表下面所有的a节点，不管位置
     逻辑运算
            //input[@class="s_ipt" and @name="wd"]
     模糊匹配 ：
          contains
                //input[contains(@class, "s_i")]
                所有的input，有class属性，并且属性中带有s_i的节点
                //input[contains(text(), "爱")]
            starts-with
                //input[starts-with(@class, "s")]
                所有的input，有class属性，并且属性以s开头
      取文本
            //div[@id="u1"]/a[5]/text()  获取节点内容
            //div[@id="u1"]//text()      获取节点里面不带标签的所有内容
      取属性
            //div[@id="u1"]/a[5]/@href

View Code

　- 代码中使用xpath：

　　　　1.导包：from lxml import etree

　　　　2.将html文档或者xml文档转换成一个etree对象，然后调用对象中的方法查找指定的节点

　　　　　　2.1 本地文件：tree = etree.parse(文件名)

　　　　　　2.2 网络数据：tree = etree.HTML(网页内容字符串)

　　- 综合练习：

　　　　需求：获取好段子中段子的内容和作者 http://www.haoduanzi.com

#//div[@id="main"]/div[@class="log cate10 auth1"]/h3/a/text()  标题
#//div[@id="main"]/div[@class="log cate10 auth1"][2]/div//text()

from lxml import etree
import requests
url = 'http://www.haoduanzi.com/category-10.html'
headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36',
    }

page_text = requests.get(url=url,headers=headers).text

#讲网页数据转换成etree对象
tree = etree.HTML(page_text)

#解析:下面列表中存储的是包含段子内容和标题的子div
div_list = tree.xpath('//div[@id="main"]/div[@class="log cate10 auth1"]')
fp = open('./duanzi001.txt','w',encoding='utf-8')
for div in div_list:
    #如果获取的对象类型是Element对象类型，则表示该对象可以继续调用xpath函数进行内容解析
    content = div.xpath('./div//text()')
    str_content = ''.join(content)
    title = div.xpath('./h3/a/text()')[0]
    fp.write(title+":"+str_content+'\n\n\n')
fp.close()

View Code

三 bs4

- 环境安装：

- 需要将pip源设置为国内源，阿里源、豆瓣源、网易源等
   - windows
    （1）打开文件资源管理器(文件夹地址栏中)
    （2）地址栏上面输入 %appdata%
    （3）在这里面新建一个文件夹  pip
    （4）在pip文件夹里面新建一个文件叫做  pip.ini ,内容写如下即可
        [global]
        timeout = 6000
        index-url = https://mirrors.aliyun.com/pypi/simple/
        trusted-host = mirrors.aliyun.com
   - linux
    （1）cd ~
    （2）mkdir ~/.pip
    （3）vi ~/.pip/pip.conf
    （4）编辑内容，和windows一模一样
  - 需要安装：pip install bs4
    bs4在使用时候需要一个第三方库，把这个库也安装一下
    pip install lxml

View Code

- 简单使用规则：

        - from bs4 import BeautifulSoup
        - 使用方式：可以将一个html文档，转化为BeautifulSoup对象，然后通过对象的方法或者属性去查找指定的内容
          （1）转化本地文件：
              - soup = BeautifulSoup(open('本地文件'), 'lxml')
          （2）转化网络文件：
              - soup = BeautifulSoup('字符串类型或者字节类型', 'lxml')
          （3）打印soup对象显示内容为html文件中的内容
    （1）根据标签名查找
        - soup.a   只能找到第一个符合要求的标签
    （2）获取属性
        - soup.a.attrs  获取a所有的属性和属性值，返回一个字典
        - soup.a.attrs['href']   获取href属性
        - soup.a['href']   也可简写为这种形式
    （3）获取内容
        - soup.a.string
        - soup.a.text
        - soup.a.get_text()
       【注意】如果标签还有标签，那么string获取到的结果为None，而其它两个，可以获取文本内容
    （4）find：找到第一个符合要求的标签
        - soup.find('a')  找到第一个符合要求的
        - soup.find('a', title="xxx")
        - soup.find('a', alt="xxx")
        - soup.find('a', class_="xxx")
        - soup.find('a', id="xxx")
    （5）find_all：找到所有符合要求的标签
        - soup.find_all('a')
        - soup.find_all(['a','b']) 找到所有的a和b标签
        - soup.find_all('a', limit=2)  限制前两个
    （6）select:soup.select('#feng')
        - 根据选择器选择指定的内容
        - 常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层级选择器
            - 层级选择器：
                div .dudu #lala .meme .xixi  下面好多级
                div > p > a > .lala          只能是下面一级
        【注意】select选择器返回永远是列表，需要通过下标提取指定的对象

View Code

　- 综合练习：

　　　　需求：使用bs4实现将诗词名句网站中三国演义小说的每一章的内容爬去到本地磁盘进行存储 http://www.shicimingju.com/book/sanguoyanyi.html

#需求：使用bs4实现将诗词名句网站中三国演义小说的每一章的内容爬去到本地磁盘进行存储  
from bs4 import BeautifulSoup
import requests
header = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}
def get_content(url):
    data = requests.get(url=url,headers=header).text
    soup = BeautifulSoup(data,'lxml')
    return soup.find('div',class_="chapter_content").get_text()

url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
page_content = requests.get(url=url,headers=header).text
soup = BeautifulSoup(page_content,'lxml')

#解析标题
a_list = soup.select('.book-mulu > ul > li > a')
fp = open('./xiaoshu.txt','w',encoding='utf-8')
for a in a_list:
    second_url ='http://www.shicimingju.com' + a['href']
   
    content = get_content(second_url)
    title = a.string
    fp.write(title+'\n'+content+"\n\n\n")
    print('下载完毕一个章节')
fp.close()

View Code

三种数据解析方式

猜你喜欢