Python爬虫之解析库的使用(XPath、Beautiful Soup)

XPath

XPath即为XML路径语言（XML Path Language），它是一种用来确定XML文档中某部分位置的语言。它的选择功能十分强大，所以在做爬虫时我们完全可以使用XPath来做相应的信息提取。

准备工作

我们后面使用的是Python的lxml库，利用XPath进行HTML的解析。

Windows下可以打开命令行窗口输入pip3 install lxml进行安装lxml库，安装完之后打开python输入import lxml如果没有报错即安装成功。

同时我们准备了run.html文件保存至相应python文件路径下以备使用，文件内容如下：

<!DOCTYPE html>
<html>
    <body>
        <div>
            <ul>
                <ul>
                    <li class="name-0" color ="red"><a href="link1.html" ><span>first</span></a></li>
                    <li class="name-1" color ="red" old="21"><a href="link2.html">secode</a></li>
                    <li class="name-0" color="blue"><a href="link3.html">third</a></li>
                    <li class="name-3 boy"><a href="link4.html">fourth</a></li>
                    <li class="name-5 boy"><a href="link5.html" time="2020-7-29">fifth</a></li>
                    </ul>
            </ul>
        </div>
    </body>
</html>

XPath常用规则

表达式	描述
/	从当前节点选取其直接子节点
//	从当前节点选取子孙节点
`.`	选取当前节点
`..`	选取当前节点的父节点
@	属性选取

选取所有节点

我们一般会用//开头的XPath规则来选取所有符合要求的节点。

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   print(type(html))
   res = html.xpath('//*')
   print(type(res))
   print(res)

输出结果如下：

<class 'lxml.etree._ElementTree'>
<class 'list'>
[<Element html at 0x1ed5a81e248>, <Element body at 0x1ed5a8cfa48>, <Element div at 0x1ed5a8cf988>, <Element ul at 0x1ed5a8cf9c8>, <Element ul at 0x1ed5a8cfa88>, <Element li at 0x1ed5a8cfb08>, <Element a at 0x1ed5a8cfb48>, <Element span at 0x1ed5a8cfb88>, <Element li at 0x1ed5a8cfbc8>, <Element a at 0x1ed5a8cfac8>, <Element li at 0x1ed5a8cfc08>, <Element a at 0x1ed5a8cfc48>, <Element li at 0x1ed5a8cfc88>, <Element a at 0x1ed5a8cfcc8>, <Element li at 0x1ed5a8cfd08>, <Element a at 0x1ed5a8cfd48>]

我们通过使用//作为开头来选择符合要求的节点，后面加了*代表匹配所有节点。因此输出结果即是所有节点。返回的形式是一个列表，列表里的每个元素是一个Element类型，后面跟着节点的名称，如html、body、div、li等。

我们也可以选择指定名称的所有结点，如下：

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li')
   print(res)
   print(res[3])

输出如下：

[<Element li at 0x1c46fa23c48>, <Element li at 0x1c46fa23c08>, <Element li at 0x1c46fa23cc8>, <Element li at 0x1c46fa23d08>, <Element li at 0x1c46fa23d48>]
<Element li at 0x1c46fa23d08>

我们成功输出了所有名称为li的节点，同时我们选择输出了索引为3的li节点。

选取子节点

我们可以通过/或//查找元素的子节点或子孙节点。

加入我们想要选择li节点的所有直接a子节点，可以这样实现。

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li/a')
   print(res)

输出如下：

[<Element a at 0x286380a3c48>, <Element a at 0x286380a3c08>, <Element a at 0x286380a3cc8>, <Element a at 0x286380a3d08>, <Element a at 0x286380a3d48>]

其中//li代表选择所有li节点，/a选择刚才的所有li结点的直接a子节点

选取父节点

如果我们想要获取href属性为link4.html的a节点的父节点class属性，我们需要怎么办呢？

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//a[@href="link4.html"]/../@class')
   print(res)

输出结果如下：

['name-3 boy']

我们成功输出了其class属性值，在这里我们使用了..来选择当前节点的父节点

属性匹配

有时候我们需要选择属性值为特定值的节点，则我们可以利用@符合进行属性过滤，比如我们要选择所有class属性为"name-0"的节点。

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li[@class="name-0"]')
   print(res)

输出结果如下：

[<Element li at 0x20ddb8d3c08>, <Element li at 0x20ddb8d3cc8>]

见结果可知成功输出了两个li节点。

文本获取

我们可以用XPath中的text()方法获取节点中的文本信息，我们获取所有class属性为"name-0"的节点的文本信息。

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li[@class="name-0"]//text()')
   print(res)

输出结果如下：

['first', 'third']

我们看到成功输出了相应的文本信息。

如果我们将res = html.xpath('//li[@class="name-0"]//text()')改为res = html.xpath('//li[@class="name-0"]/text()')会发生什么呢？

[]

输出将会是空的，因为相应的文本信息是在更深层的节点<span>和<a>之中的，可见text()只是得到当前节点层的文本信息，但是如果用//就会搜索所有节点的文本信息。

属性获取

我们可以通过[@xxx=xxx]来限定属性，同样的，我们也可以通过@来获取属性。

我们想要获取所有a节点的href属性：

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li/a/@href')
   print(res)

输出如下：

['link1.html', 'link2.html', 'link3.html', 'link4.html', 'link5.html']

属性多值匹配

我们可以看到<li class="name-3 boy">节点的class属性有两个

如果我们想获取class属性包含"name-3"的节点的所有文本信息时，我们如果还是用上文的//li[@class=“name-3”]//text()是行不通的。

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li[@class="name-3//text()')
   print(res)

输出为空：

[]

因此这个时候就需要contains了，正如字面意思，contains是只要包含该元素，就会被选择.

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li[contains(@class,"name-3")]//text()')
   print(res)

输出如下：

['fourth']

多属性匹配

此外我们还会遇到一种情况，就是根据多个属性确定一个节点，这是就需要同时匹配多个属性。

比如我们想要得到class属性为"name-0"且color属性为"blue"的li节点的文本信息。

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li[@class="name-0" and @color="blue"]//text()')
   print(res)

输出如下：

['third']

这里的and其实是XPath里的运算符。下面列出了其他常用的运算符.

运算符	描述
or	或
and	与
mod	计算除法的余数
\|	计算两个节点集
+	加
-	减
*	乘
div	除法
=	等于
!=	不等于
<	小于
<=	小于等于
>	大于
>=	大于等于

按序选择

有时候哦我们在选择的时候某些属性可能匹配了多个节点，但是我们只想要其中的某个节点我们可以通过中括号传入索引的方法获取特定次序的节点。

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li[1]/a/pan/text()')
   print(res)
   res = html.xpath('//li[last()]/a/text()')
   print(res)
   res = html.xpath('//li[position()<4]/a/text()')
   print(res)

输出如下：

[]
['fifth']
['secode', 'third']

节点轴选择

XPath还提供了许多节点轴旋转方法，包括获取子元素、兄弟元素、父元素等

import requests
import random
from lxml import etree

if __name__=='__main__':
   html =etree.parse('run.html',etree.HTMLParser())
   res = html.xpath('//li[1]/ancestor::*')#匹配所有祖先节点
   print(res)
   res = html.xpath('//li[1]/ancestor::div')#限定选择祖先节点
   print(res)
   res = html.xpath('//li[2]/attribute::*')#获取属性值
   print(res)
   res = html.xpath('//li[1]/child::a[@href="link1.html"]')#选择子节点
   print(res)
   res = html.xpath('//li[1]/descendant::span')#限定性选择span子孙节点
   print(res)
   res = html.xpath('//li[1]/following::*[2]')#限定性当前节点后的第一个后续节点
   print(res)
   res = html.xpath('//li[1]/following-sibling::*')#获取当前节点之后的所有同级节点
   print(res)

输出如下：

[<Element html at 0x1e741073ec8>, <Element body at 0x1e741140748>, <Element div at 0x1e741140808>, <Element ul at 0x1e741140848>, <Element ul at 0x1e741140888>]
[<Element div at 0x1e741140808>]
['name-1', 'red', '21']
[<Element a at 0x1e741140748>]
[<Element span at 0x1e741140848>]
[<Element a at 0x1e741140988>]
[<Element li at 0x1e741140848>, <Element li at 0x1e741140808>, <Element li at 0x1e741140948>, <Element li at 0x1e741140708>]

相应的节点轴列表如下：

轴名称	结果
ancestor	选取当前节点的所有先辈（父、祖父等）。
ancestor-or-self	选取当前节点的所有先辈（父、祖父等）以及当前节点本身。
attribute	选取当前节点的所有属性。
child	选取当前节点的所有子元素。
descendant	选取当前节点的所有后代元素（子、孙等）。
descendant-or-self	选取当前节点的所有后代元素（子、孙等）以及当前节点本身。
following	选取文档中当前节点的结束标签之后的所有节点。
namespace	选取当前节点的所有命名空间节点。
parent	选取当前节点的父节点。
preceding	选取文档中当前节点的开始标签之前的所有节点。
preceding-sibling	选取当前节点之前的所有同级节点。
self	选取当前节点。

我们在上面进行了一些实例测试，相应的我们再列举一些实例：

例子	结果
child::book	选取所有属于当前节点的子元素的 book 节点
attribute::lang	选取当前节点的 lang 属性
`child::*`	选取当前节点的所有子节点
`attribute::*`	选取当前节点的所有属性
child::text()	选取当前节点的所有子节点的文本信息
child::a[@href=“xxx”]	选取当前节点的子节点种href为xxx的子节点
descendant::book	选取当前节点的所有 book 子孙节点
ancestor::book	选择当前节点的所有 book 祖先节点
ancestor-or-self::book	选取当前节点的所有 book 祖先节点以及当前节点（如果此节点是 book 节点）
`child::*/child::price`	选取当前节点的所有 price 孙节点

Beautiful Soup

简单来说，Beautiful Soup就是Python的一个HTML或XML解析库，可以用它来方便地从网页中提取数据。

Beautiful Soup在解析时实际上依赖解析器，它除了支持Python标志库中的HTML解析器外，还支持一些第三方解析器，比如lxml、html5lib等。其中lxml解析器有着解析HTML和XML的功能，而且速度非常块，容错能力强，因此推荐使用lxml解析器。

准备工作

开始前，需要先安装好Beautiful Soup库。目前，Beautiful Soup库的最新版本是4.X版本，Windows环境下可以在命令行窗口输入pip3 install beautifulsoup4来进行安装。

我们在相关路径下新建一个story.txt文本，内容如下：

<!DOCTYPE html>
<html>
<head>
<title>The frog prince</title></head>
<body>
<p class="title" name="prince"><b>The frog prince</b></p>
<p class="story">In ancient times, good wishes in people's hearts could often come true.In those wonderful times there once lived a king. The king had several daughters, all of whom were very beautiful.They were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>and
<a href="http://example.com/nana" class="sister" id="link3">Nana</a>!
and they live at the bottom of a well.
</p>
<p class="story">...</p>
</body>
</html>

我们先体验一下Beautiful Soup的魅力

import requests
import random
from lxml import etree
from bs4 import BeautifulSoup

if __name__=='__main__':
   with open('story.txt') as file:
      text = file.read()
   soup = BeautifulSoup(text,'lxml')
   print(type(soup))
   print(soup.prettify())
   print(soup.title.string)

输出如下：

<class 'bs4.BeautifulSoup'>
<!DOCTYPE html>
<html>
 <head>
  <title>
   The frog prince
  </title>
 </head>
 <body>
  <p class="title" name="prince">
   <b>
    The frog prince
   </b>
  </p>
  <p class="story">
   In ancient times, good wishes in people's hearts could often come true.In those wonderful times there once lived a king. The king had several daughters, all of whom were very beautiful.They were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/nana" id="link3">
    Nana
   </a>
   !
and they live at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>
The frog prince

可见，Beautiful Soup返回的是一个BeautifulSoup的类型，我们通过调用其prettify()方法使得把要解析的字符串以标准的缩进格式输出。

然后通过调用soup.title.string，输出了HTML文本中title节点的文本内容。所以soup.title可以选出HTML中的title节点，调用string属性就可以得到里面的文本了。

节点选择器

我们可以直接调用节点的名称就可以选择节点元素，再调用string属性就可以得到节点内的文本了，这种选择方式速度非常块。

from bs4 import BeautifulSoup
def select_element(html):
   soup = BeautifulSoup(html,'lxml')
   print(soup.title)
   print(type(soup.title))
   print(soup.title.string)
   print(soup.head)
   print(soup.p)
    
if __name__=='__main__':
   with open('story.txt') as file:
      text = file.read()
   select_element(text)

输出如下：

<title>The frog prince</title>
<class 'bs4.element.Tag'>
The frog prince
<head>
<title>The frog prince</title></head>
<p class="title" name="prince"><b>The frog prince</b></p>

我们首先打印输出title节点的选择结果，输出结果正是title节点及其文本信息。同时打印它的类型，为<class ‘bs4.element.Tag’>，这是很重要的一个类型，经过选择器选择后的结果都是这种Tag类型，它具有一些属性，像string，我们调用该属性可以得到该节点文本内容。我们接下来选择head节点，输出的是head节点及其内部信息。特别注意的是，当我们选择p节点时，因为有多个p节点，因此只会输出第一个p节点及其文本信息。

提取信息

一个节点往往有很多信息，比如名称、属性、文本内容等等，我们在解析页面时往往需要提取节点的信息。

def get_information(html):
   soup = BeautifulSoup(html,'lxml')
   #获取名称
   print(soup.title.name)
   #获取属性
   print(soup.p.attrs)
   print(soup.p['name'])
   print(soup.p['class'])#class返回的是一个列表
   #获取内容
   print(soup.p.string)
   
if __name__=='__main__':
   with open('story.txt') as file:
      text = file.read()
   get_information(text)

输出如下：

title
{'class': ['title'], 'name': 'prince'}
prince
['title']
The frog prince

获取名称：我们通过name属性获取节点的名称，如上获取了title节点的名称。
获取属性：我们可以通过attars获取获取所有属性，返回的是一个字典类型。我们直到，再python中万物皆是对象，我们可以获取字典指定键值对key-value，即后面加[‘key’]获取指定属性名的值，这样就比较方便。要注意！有的属性值返回的是字符串，比如name，有的返回的是列表，比如class，因为一个节点元素可以有多个class。

嵌套选择

我们知道返回的每个节点的类型为**<class ‘bs4.element.Tag’>**，它同样可以继续调用进行下一步的选择。

def nested_select(html):
   soup = BeautifulSoup(html,'lxml')
   print(soup.head.title)
   print(type(soup.head.title))
   print(soup.head.title.string)

if __name__=='__main__':
   with open('story.txt') as file:
      text = file.read()
   nested_select(text)

<title>The frog prince</title>
<class 'bs4.element.Tag'>
The frog prince

第一行输出是进行两次调用节点选择的结果，可见成功进行了嵌套选择，输出了title节点及其文本信息，同时其类型仍然为**<class ‘bs4.element.Tag’>**,我们可以继续进行嵌套选择！

关联选择

我们在进行选择的时候，可能不能一步就完成选择，有时往往需要选中某个节点作为基准，然后以此基准再进行选择它的父节点、子节点、兄弟节点等。

str = '''
<!DOCTYPE html>
<html>
<head>
<title>The frog prince</title></head>
<body>
<p class="title" name="prince"><b>The frog prince</b><a>time:2020-7-30<span>nice!</span></a></p>
<p class="story">...</p>
</body>
</html>'''

def link_select(html,index):#关联选择
   soup = BeautifulSoup(html,'lxml')
   if index == 1:
      print('**************direct child*************')
      print(soup.p.contents)#得到直接子节点列表
      print(soup.p.children)
      for i,child in enumerate(soup.p.children):
         print(i,child)
   elif index == 2:
      print('**************descendants***********')
      print(soup.p.descendants)
      for i,child in enumerate(soup.p.descendants):
         print(i,child)
   elif index == 3:
      print('**************parents****************')
      print(soup.a.parent)
   elif index == 4:
      print('**************ancestors**************')
      print(type(soup.a.parents))
      for i,parent in enumerate(soup.a.parents):
         print(i,parent)
   elif index ==5:
      print('**************brothers**************')
      print('Next Sibling',soup.a.next_sibling)
      print('Prev Sibling',soup.a.previous_sibling)
      print('Next Siblings',list(enumerate(soup.a.next_siblings)))
      print('Prev Siblings',list(enumerate(soup.a.previous_siblings)))
   
if __name__=='__main__':
   '''
   with open('story.txt') as file:
      text = file.read()
   '''
   link_select(str,5)

选择直接子节点

我们调用link_select(str,1)方法，输出如下：

**************direct child*************
[<b>The frog prince</b>, <a>time:2020-7-30<span>nice!</span></a>]
<list_iterator object at 0x000001BE424D84C8>
0 <b>The frog prince</b>
1 <a>time:2020-7-30<span>nice!</span></a>

可以看到，我们可以调用contents属性得到一个节点的直接子节点的列表，也可以调用children属性得到相应的结果，但是返回的类型是生成器类型，需要用for循环输出内容。

选择子孙节点

我们调用link_select(str,2)方法，输出如下：

**************descendants***********
<generator object Tag.descendants at 0x00000223EC2C0E48>
0 <b>The frog prince</b>
1 The frog prince
2 <a>time:2020-7-30<span>nice!</span></a>
3 time:2020-7-30
4 <span>nice!</span>
5 nice!

返回的结果类型依然是生成器，通过for循环可以看到确实打印出来了所有的子孙节点及文本。

选择父节点

我们调用link_select(str,3)方法，输出如下：

**************parents****************
<p class="title" name="prince"><b>The frog prince</b><a>time:2020-7-30<span>nice!</span></a></p>

可见输出结果是a节点的直接父节点p节点及其内部内容。

选择祖先节点

我们调用link_select(str,4)方法，输出如下：

**************ancestors**************
<class 'generator'>
0 <p class="title" name="prince"><b>The frog prince</b><a>time:2020-7-30<span>nice!</span></a></p>
1 <body>
<p class="title" name="prince"><b>The frog prince</b><a>time:2020-7-30<span>nice!</span></a></p>
<p class="story">...</p>
</body>
2 <html>
<head>
<title>The frog prince</title></head>
<body>
<p class="title" name="prince"><b>The frog prince</b><a>time:2020-7-30<span>nice!</span></a></p>
<p class="story">...</p>
</body>
</html>
3 <!DOCTYPE html>
<html>
<head>
<title>The frog prince</title></head>
<body>
<p class="title" name="prince"><b>The frog prince</b><a>time:2020-7-30<span>nice!</span></a></p>
<p class="story">...</p>
</body>
</html>

我们用for循环输出了a节点的所有祖先节点及其内部内容。

选择兄弟节点

我们调用link_select(str,5)方法，输出如下：

**************brothers**************
Next Sibling None
Prev Sibling <b>The frog prince</b>
Next Siblings []
Prev Siblings [(0, <b>The frog prince</b>)]

我们通过调用next_sibling属性输出当前节点的下一个兄弟节点，通过调用previous_sibling属性输出当前节点的上一个兄弟节点，调用next_siblings属性输出当前节点的后面的兄弟节点，调用previous_siblings属性输出当前节点的前面的兄弟节点。

方法选择器

上述方法通过属性来选择，虽然非常快但是有时候我们会遇到特别复杂繁琐的情况，我们用上述方法就比较麻烦。Beautiful Soup库为我们提供了一些查询方法，比如find_all()、find()，调用他们，然后传入相应的参数，就可以灵活查询了。

find_all()

顾名思义，就是查询所有符合条件的元素。

API：find_all(name、attrs、recursive、text、**kwargs)

name

根据节点名称来查询元素：

def find_all_forname(html):
   soup = BeautifulSoup(html,'lxml')
   print(soup.find_all(name='a'))
   print(soup.find_all(name='a')[0])
   for p in soup.find_all(name='p'):
      for a in p.find_all(name='a'):
         print(a.string)
if __name__=='__main__':
   with open('story.txt') as file:
      text = file.read()
   find_all_forname(text)

[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, <a class="sister" href="http://example.com/nana" id="link3">Nana</a>]
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
Elsie
Lacie
Nana

因为find_all()返回的每个元素的类型都是bs4.element.Tag类型，所有我们可以继续进行嵌套查询。

attrs

我们亦可以通过传入属性参数来查询：

def find_all_forattrs(html):
   soup = BeautifulSoup(html,'lxml')
   print(soup.find_all(attrs = {'id': 'link3'}))
   print(soup.find_all(attrs = {'class': 'sister','id':'link1'}))
if __name__=='__main__':
   with open('story.txt') as file:
      text = file.read()
   find_all_forattrs(text)

[<a class="sister" href="http://example.com/nana" id="link3">Nana</a>]
[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

text

text可以用来匹配节点的文本，传入的形式可以是字符串，可以是正则表达式。

def find_all_fortext(html):
   soup = BeautifulSoup(html,'lxml')
   print(soup.find_all(text=re.compile('Lacie')))

if __name__=='__main__':
   with open('story.txt') as file:
      text = file.read()
   find_all_fortext(text)

['Lacie']

find()

find()方法和find_all()方法类似，只不过后者返回的是所有符合的元素，前者返回匹配的第一个元素。

其他查询方法

方法	描述
find_parents()	返回祖先节点
find_parent()	返回直接父节点
find_next_siblings()	返回后面所有的兄弟节点
find_next_sibling()	返回后面的第一个兄弟的节点
find_previous_siblings()	返回前面所有的兄弟节点
find_previous_sibling()	返回前面的第一个兄弟的节点
find_all_next()	返回节点后的所有符合条件的节点
find_next()	返回节点后的第一个符合条件的节点
find_all_previous()	返回节点前的所有符合条件的节点
find_previous()	返回节点前的第一个符合条件的节点