引入
大多数情况下的需求,我们都会指定去使用聚焦爬虫,也就是爬取页面中指定部分的数据值,而不是整个页面的数据。因此,在聚焦爬虫中使用数据解析。所以,我们的数据爬取的流程为:
- 指定url
- 基于requests模块发起请求
- 获取响应中的数据
- 数据解析
- 进行持久化存储
数据解析:
- 被应用在聚焦爬虫。
- 解析的数据存储在标签之间或者标签对应的属性中
1.正则解析
常用正则表达式
1 单字符: 2 . : 除换行以外所有字符 3 [] :[aoe] [a-w] 匹配集合中任意一个字符 4 \d :数字 [0-9] 5 \D : 非数字 6 \w :数字、字母、下划线、中文 7 \W : 非\w 8 \s :所有的空白字符包,括空格、制表符、换页符等等。等价于 [ \f\n\r\t\v]。 9 \S : 非空白 10 11 数量修饰: 12 * : 任意多次 >=0 13 + : 至少1次 >=1 14 ? : 可有可无 0次或者1次 15 {m} :固定m次 hello{3,} 16 {m,} :至少m次 17 {m,n} :m-n次 18 19 边界: 20 $ : 以某某结尾 21 ^ : 以某某开头 22 23 分组: 24 (ab) 25 贪婪模式: .* 26 非贪婪(惰性)模式: .*? 27 28 re.I : 忽略大小写 29 re.M :多行匹配 30 re.S : 单行匹配 31 32 re.sub(正则表达式, 替换内容, 字符串)
示例:爬取糗事百科中所有糗图照片
1 import requests 2 import re 3 import os 4 5 # 创建一个文件夹 6 if not os.path.exists('./qiutuLibs'): 7 os.mkdir('./qiutuLibs') 8 9 headers = { 10 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' 11 } 12 13 #封装一个通用的url模板 14 url = 'https://www.qiushibaike.com/pic/page/%d/?s=5185803' 15 16 for page in range(1,36): 17 new_url = format(url%page) 18 page_text = requests.get(url=new_url, headers=headers).text 19 20 # 进行数据解析(图片的地址) 21 ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>' 22 src_list = re.findall(ex, page_text, re.S) 23 24 # 发现src属性值不是一个完整的url,缺少了协议头 25 for src in src_list: 26 src = 'https:' + src 27 # 对图片的url单独发起请求,获取图片数据.content返回的是二进制类型的响应数据 28 img_data = requests.get(url=src, headers=headers).content 29 img_name = src.split('/')[-1] 30 31 img_path = './qiutuLibs/' + img_name 32 with open(img_path, 'wb') as fp: 33 fp.write(img_data) 34 print(img_name, '下载成功!')
2. Xpath解析
安装:pip install lxml -i http://pypi.douban.com/simple --trusted-host pypi.douban.com
xpath 解析流程:
- 1.实例化一个etree对象,然后将即将被解析的页面源码数据加载到该对象中。
- 2.通过调用etree对象中的xpath方法,结合着xpath表达式进行标签定位和数据提取
示例1:boss直聘
1 import requests 2 from lxml import etree 3 import json 4 headers = { 5 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' 6 } 7 8 url = 'https://www.zhipin.com/job_detail/?query=python%E7%88%AC%E8%99%AB&city=101010100&industry=&position=' 9 page_text = requests.get(url=url,headers=headers).text 10 11 12 tree = etree.HTML(page_text) 13 li_list = tree.xpath('//div[@class="job-list"]/ul/li') 14 job_data_list = [] 15 16 for li in li_list: 17 #局部内容解析一定是以./开头。etree和element都可以调用xpath 18 job_name = li.xpath('.//div[@class="info-primary"]/h3/a/div/text()')[0] 19 salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()')[0] 20 company = li.xpath('.//div[@class="company-text"]/h3/a/text()')[0] 21 detail_url = 'https://www.zhipin.com'+li.xpath('.//div[@class="info-primary"]/h3/a/@href')[0] 22 23 #详情页的页面源码数据 24 detail_page_text = requests.get(url=detail_url,headers=headers).text 25 detail_tree = etree.HTML(detail_page_text) 26 job_desc = detail_tree.xpath('//*[@id="main"]/div[3]/div/div[2]/div[2]/div[1]/div//text()') 27 job_desc = ''.join(job_desc) 28 29 30 dic = { 31 'job_name':job_name, 32 'salary':salary, 33 'company':company, 34 'job_desc':job_desc 35 } 36 job_data_list.append(dic) 37 38 fp = open('job.json','w',encoding='utf-8') 39 json.dump(job_data_list,fp,ensure_ascii=False) 40 fp.close() 41 print('over')
示例2:下载煎蛋网中的图片数据:http://jandan.net/ooxx。【重点:src加密】
1 import requests 2 from lxml import etree 3 from fake_useragent import UserAgent 4 import base64 5 import urllib.request 6 7 url = 'http://jandan.net/ooxx' 8 ua = UserAgent(verify_ssl=False,use_cache_server=False).random 9 headers = { 10 'User-Agent':ua 11 } 12 page_text = requests.get(url=url,headers=headers).text 13 tree = etree.HTML(page_text) 14 15 #获取了加密的图片url数据 16 imgCode_list = tree.xpath('//span[@class="img-hash"]/text()') 17 18 imgUrl_list = [] 19 for url in imgCode_list: 20 img_url = 'http:'+base64.b64decode(url).decode() #base64.b64decode(url)为byte类型,需要转成str 21 imgUrl_list.append(img_url) 22 23 for url in imgUrl_list: 24 filePath = url.split('/')[-1] 25 urllib.request.urlretrieve(url=url,filename=filePath) 26 print(filePath+'下载成功')
示例3:爬取站长素材中的简历模板
1 import requests 2 import random 3 from lxml import etree 4 headers = { 5 'Connection':'close', # 当请求成功后,马上断开该次请求(及时释放请求池中的资源) 6 'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' 7 } 8 url = 'http://sc.chinaz.com/jianli/free_%d.html' 9 for page in range(1,4): # 因为第一页和其他页url格式不一样,所以分情况讨论 10 if page == 1: 11 new_url = 'http://sc.chinaz.com/jianli/free.html' 12 else: 13 new_url = format(url%page) 14 15 response = requests.get(url=new_url,headers=headers) 16 response.encoding = 'utf-8' # 中文乱码,先调整编码方式 17 page_text = response.text 18 19 tree = etree.HTML(page_text) 20 div_list = tree.xpath('//div[@id="container"]/div') 21 for div in div_list: 22 detail_url = div.xpath('./a/@href')[0] 23 name = div.xpath('./a/img/@alt')[0] 24 25 detail_page = requests.get(url=detail_url,headers=headers).text 26 tree = etree.HTML(detail_page) 27 download_list = tree.xpath('//div[@class="clearfix mt20 downlist"]/ul/li/a/@href') # 这样获得的是每个的所有下载链接 28 download_url = random.choice(download_list) # 为了防止每个链接因请求过于频繁被禁,随机选择一个 29 data = requests.get(url=download_url,headers=headers).content 30 fileName = name+'.rar' 31 with open(fileName,'wb') as fp: 32 fp.write(data) 33 print(fileName,'下载成功')
3. BeautifulSoup解析
安装:pip install bs4
解析流程:
- 实例化一个BeautifuSoup对象,然后将页面源码数据加载到该对象中
- 调用该对象相关属性和方法进行标签定位和数据提取
示例:使用bs4将诗词名句网站中三国演义小说的每一章的内容爬取,并存储到本地磁盘
1 import requests 2 from bs4 import BeautifulSoup 3 headers = { 4 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36' 5 } 6 url = 'http://www.shicimingju.com/book/sanguoyanyi.html' 7 page_text = requests.get(url=url,headers=headers).text 8 9 #数据解析:章节标题,章节内容 10 soup = BeautifulSoup(page_text,'lxml') 11 a_list = soup.select('.book-mulu > ul > li > a') 12 fp = open('./sanguo.txt','w',encoding='utf-8') 13 for a in a_list: #把a标签当soup对象使用,因为它也是源码 14 title = a.string 15 detail_url = 'http://www.shicimingju.com'+a['href'] 16 detail_page_text = requests.get(url=detail_url,headers=headers).text 17 soup = BeautifulSoup(detail_page_text,'lxml') 18 content = soup.find('div',class_="chapter_content").text #bs4中,把text提取出来的列表直接转换成字符串,与xpath不同 19 20 fp.write(title+':'+content+'\n') 21 print(title,'保存成功!') 22 fp.close()