04 三种数据解析方式

引入

  大多数情况下的需求,我们都会指定去使用聚焦爬虫,也就是爬取页面中指定部分的数据值,而不是整个页面的数据。因此,在聚焦爬虫中使用数据解析。所以,我们的数据爬取的流程为:

  • 指定url
  • 基于requests模块发起请求
  • 获取响应中的数据
  • 数据解析
  • 进行持久化存储

  数据解析:

  - 被应用在聚焦爬虫。

  - 解析的数据存储在标签之间或者标签对应的属性中

 

1.解析

常用正则表达式

 1    单字符:
 2         . : 除换行以外所有字符
 3         [] :[aoe] [a-w] 匹配集合中任意一个字符
 4         \d :数字  [0-9]
 5         \D : 非数字
 6         \w :数字、字母、下划线、中文
 7         \W : 非\w
 8         \s :所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。
 9         \S : 非空白
10 
11     数量修饰:
12         * : 任意多次  >=0
13         + : 至少1次   >=1
14         ? : 可有可无  0次或者1次
15         {m} :固定m次 hello{3,}
16         {m,} :至少m次
17         {m,n} :m-n次
18 
19     边界:
20         $ : 以某某结尾 
21         ^ : 以某某开头
22 
23     分组:
24         (ab)  
25     贪婪模式: .*
26     非贪婪(惰性)模式: .*?
27 
28     re.I :   忽略大小写
29     re.M :多行匹配
30     re.S : 单行匹配
31 
32 re.sub(正则表达式, 替换内容, 字符串)
View Code

示例:爬取糗事百科中所有糗图照片

 1 import requests
 2 import re
 3 import os
 4 
 5 # 创建一个文件夹
 6 if not os.path.exists('./qiutuLibs'):
 7     os.mkdir('./qiutuLibs')
 8 
 9 headers = {
10     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
11 }
12 
13 #封装一个通用的url模板
14 url = 'https://www.qiushibaike.com/pic/page/%d/?s=5185803'
15 
16 for page in range(1,36):
17     new_url = format(url%page)                            
18     page_text = requests.get(url=new_url, headers=headers).text
19 
20     # 进行数据解析(图片的地址)
21     ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
22     src_list = re.findall(ex, page_text, re.S)                        
23 
24     # 发现src属性值不是一个完整的url,缺少了协议头
25     for src in src_list:
26         src = 'https:' + src
27         # 对图片的url单独发起请求,获取图片数据.content返回的是二进制类型的响应数据
28         img_data = requests.get(url=src, headers=headers).content
29         img_name = src.split('/')[-1]
30 
31         img_path = './qiutuLibs/' + img_name
32         with open(img_path, 'wb') as fp:
33             fp.write(img_data)
34             print(img_name, '下载成功!')
View Code

2. Xpath解析

  安装:pip install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com

  xpath 解析流程:

        - 1.实例化一个etree对象,然后将即将被解析的页面源码数据加载到该对象中。

        - 2.通过调用etree对象中的xpath方法,结合着xpath表达式进行标签定位和数据提取

 

示例1:boss直聘

 1 import requests
 2 from lxml import etree
 3 import json
 4 headers = {
 5     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
 6 }
 7 
 8 url = 'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&city=101010100&industry=&position='
 9 page_text = requests.get(url=url,headers=headers).text
10 
11 
12 tree = etree.HTML(page_text)
13 li_list = tree.xpath('//div[@class="job-list"]/ul/li')
14 job_data_list = []
15 
16 for li in li_list:
17    #局部内容解析一定是以./开头。etree和element都可以调用xpath
18     job_name = li.xpath('.//div[@class="info-primary"]/h3/a/div/text()')[0]                    
19     salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()')[0]
20     company = li.xpath('.//div[@class="company-text"]/h3/a/text()')[0]
21 detail_url = 'https://www.zhipin.com'+li.xpath('.//div[@class="info-primary"]/h3/a/@href')[0]
22 
23     #详情页的页面源码数据
24     detail_page_text = requests.get(url=detail_url,headers=headers).text
25     detail_tree = etree.HTML(detail_page_text)
26     job_desc = detail_tree.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()')
27     job_desc = ''.join(job_desc)
28     
29     
30     dic = {
31         'job_name':job_name,
32         'salary':salary,
33         'company':company,
34         'job_desc':job_desc
35     }
36 job_data_list.append(dic)
37 
38 fp = open('job.json','w',encoding='utf-8')
39 json.dump(job_data_list,fp,ensure_ascii=False)
40 fp.close()
41 print('over')
View Code

示例2下载煎蛋网中的图片数据:http://jandan.net/ooxx。【重点:src加密

 1 import requests
 2 from lxml import etree
 3 from fake_useragent import UserAgent
 4 import base64
 5 import urllib.request
 6 
 7 url = 'http://jandan.net/ooxx'
 8 ua = UserAgent(verify_ssl=False,use_cache_server=False).random
 9 headers = {
10     'User-Agent':ua
11 }
12 page_text = requests.get(url=url,headers=headers).text
13 tree = etree.HTML(page_text)        
14 
15 #获取了加密的图片url数据
16 imgCode_list = tree.xpath('//span[@class="img-hash"]/text()')
17 
18 imgUrl_list = []
19 for url in imgCode_list:
20     img_url = 'http:'+base64.b64decode(url).decode()    #base64.b64decode(url)为byte类型,需要转成str
21 imgUrl_list.append(img_url)
22 
23 for url in imgUrl_list:
24     filePath = url.split('/')[-1]
25     urllib.request.urlretrieve(url=url,filename=filePath)
26     print(filePath+'下载成功')
View Code

示例3爬取站长素材中的简历模板

 1 import requests
 2 import random
 3 from lxml import etree
 4 headers = {
 5     'Connection':'close',                             # 当请求成功后,马上断开该次请求(及时释放请求池中的资源)
 6     'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
 7 }
 8 url = 'http://sc.chinaz.com/jianli/free_%d.html'
 9 for page in range(1,4):                            # 因为第一页和其他页url格式不一样,所以分情况讨论    
10     if page == 1:
11         new_url = 'http://sc.chinaz.com/jianli/free.html'
12     else:
13         new_url = format(url%page)
14     
15     response = requests.get(url=new_url,headers=headers)
16     response.encoding = 'utf-8'                        # 中文乱码,先调整编码方式
17     page_text = response.text
18 
19     tree = etree.HTML(page_text)
20     div_list = tree.xpath('//div[@id="container"]/div')
21     for div in div_list:
22         detail_url = div.xpath('./a/@href')[0]
23         name = div.xpath('./a/img/@alt')[0]
24 
25         detail_page = requests.get(url=detail_url,headers=headers).text
26         tree = etree.HTML(detail_page)
27         download_list  = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href')    # 这样获得的是每个的所有下载链接
28         download_url = random.choice(download_list)             # 为了防止每个链接因请求过于频繁被禁,随机选择一个
29         data = requests.get(url=download_url,headers=headers).content
30         fileName = name+'.rar'
31         with open(fileName,'wb') as fp:
32             fp.write(data)
33             print(fileName,'下载成功')
View Code

3. BeautifulSoup解析

  安装:pip install bs4

  解析流程:

        - 实例化一个BeautifuSoup对象,然后将页面源码数据加载到该对象中

        - 调用该对象相关属性和方法进行标签定位和数据提取

示例:使用bs4将诗词名句网站中三国演义小说的每一章的内容爬取,并存储到本地磁盘  

 1 import requests
 2 from bs4 import BeautifulSoup
 3 headers = {
 4     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
 5 }
 6 url = 'http://www.shicimingju.com/book/sanguoyanyi.html'
 7 page_text = requests.get(url=url,headers=headers).text
 8 
 9 #数据解析:章节标题,章节内容
10 soup = BeautifulSoup(page_text,'lxml')
11 a_list = soup.select('.book-mulu > ul > li > a')
12 fp = open('./sanguo.txt','w',encoding='utf-8')
13 for a in a_list:                                #把a标签当soup对象使用,因为它也是源码
14     title = a.string
15     detail_url = 'http://www.shicimingju.com'+a['href']
16     detail_page_text = requests.get(url=detail_url,headers=headers).text
17     soup = BeautifulSoup(detail_page_text,'lxml')
18     content = soup.find('div',class_="chapter_content").text      #bs4中,把text提取出来的列表直接转换成字符串,与xpath不同
19     
20     fp.write(title+':'+content+'\n')
21     print(title,'保存成功!')
22 fp.close()
View Code

 

 

猜你喜欢

转载自www.cnblogs.com/Summer-skr--blog/p/11397434.html