python3爬虫（二）：解析库之Beautiful Soup

Beautiful Soup是一个可以从HTML或XML中提取数据的python库，了解了HTML或XML的结构，能很方便地获取数据

文章目录

python3爬虫（二）：解析库之Beautiful Soup

一、准备

1、安装库和解析器
HTML例子

二、对象

1、对象获取
2、对象的类别

（1）Tag类
（2）NavigableString类
（3）Beautiful Soup对象
（4）Comment类

三、遍历文档树

1、子节点

（1）.tagName
（2）.contents
（3）.children
（4）.descendants
（5）.string
（6）.strings
（7）.stripped_strings

2、父节点

（1）.parent：获取某个元素的父节点
（2）.parents：通过递归获得元素的所有父辈节点

3、兄弟节点

（1）.next_sibling：
（2）.previous_sibling：
（3）.next_siblings，.previous_siblings

4、回退和前进

（1）.next_element，.previous_element
（2）.next_elements，.previous_elements

四、搜索文档树

1、过滤器

（1）字符串
（2）正则表达式
（3）列表
（4）True
（5）方法

2、find_all()

（1）name参数
（2）keyword参数
（3）CSS类名搜索
（4）text参数
（5）limit参数
（6）recursive参数
（7）简写方法

3、find()
4、find_parents()、find_parent()
5、find_next_siblings()、find_next_sibling()
6、find_previous_siblings()、find_previous_sibling()
7、find_all_next()、find_next()
8、find_all_previous()、find_previous()
9、CSS选择器

五、修改文档树
六、输出

一、准备

1、安装库和解析器

这里使用的是 beautifulsoup4 以及 lxml 作为解析器

pip install beautifulsoup4
pip install lxml

HTML例子

下面一段HTML代码将作为例子被多次用到

html_doc = '''
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
'''

二、对象

1、对象获取

使用 BeautifulSoup() 解析html代码，能够得到一个 BeautifulSoup 对象，并能按照标准的锁紧格式的结构输出
BeautifulSoup(doc, 解析器 [, 解析方式])

doc：HTML或XML文档
解析方式: 可选，默认为html

import bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc, 'lxml')

print(soup.prettify())
# <html>
#  <head>
#   <title>
#    The Dormouse's story
#   </title>
#  </head>
#  <body>
#   <p class="title">
#    <b>
#     The Dormouse's story
#    </b>
#   </p>
#   <p class="story">
#    Once upon a time there were three little sisters; and their names were
#    <a class="sister" href="http://example.com/elsie" id="link1">
#     Elsie
#    </a>
#    ,
#    <a class="sister" href="http://example.com/lacie" id="link2">
#     Lacie
#    </a>
#    and
#    <a class="sister" href="http://example.com/tillie" id="link2">
#     Tillie
#    </a>
#    ; and they lived at the bottom of a well.
#   </p>
#   <p class="story">
#    ...
#   </p>
#  </body>
# </html>

2、对象的类别

Beautiful Soup将HTML文档转换成复杂的树结构，每个节点都是python对象，可分为4类：Tag、NavigableString、BeautifulSoup、Comment

（1）Tag类

Tag 对象与HTML或XML原生文档中的标签Tag相同，如：html、body、title、p、div、span 等等
获取：soup.tagName

soup = BeautifulSoup('<p class="boldest">Extremely bold</p>')
tag = soup.p
type(tag)		# <class 'bs4.element.Tag'>

Tag的属性：

Name：tag的名字， tag.name
- 如果改变了 tag 的 name，则会影响所有通过当前Beautiful Soup对象生成的HTML文档

print(tag.name)		# u'p'
tag.name = 'span'		
#  <span class="boldest">Extremely bold</span>

Attributes：tab的属性
- 一个 tag 可能有多个属性
- 操作方法：与字典相同
  - 获取：方括号 tag[‘attr’]，点 tag.attr
  - 可以被添加、删除、修改
- 多值属性：
  - HTML中有很多多值属性，而XML不包含
  - 获取多值属性时，返回类型是list
  - 获取不是多值属性时，作为字符串返回
  - 修改多值属性时，赋值list，会将多个属性值合并为一个值

# 获取
print(tag['class'])		# u'boldest'
print(tag.class)		# u'boldest'

# 添加、修改
tag['class'] = 'verybold'
tag['id'] = 1
tag
# <spanclass="verybold" id="1">Extremely bold</span>

# 删除
del tag['class']
del tag['id']
tag
# <span>Extremely bold</span>

# 获取不存在
tag['class']
# KeyError: 'class'
print(tag.get('class'))
# None

# 获取多值属性时，返回类型是list
css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.p['class']
# ["body", "strikeout"]

# 获取不是多值属性时，作为字符串返回
id_soup = BeautifulSoup('<p id="my id"></p>')
id_soup.p['id']
# 'my id'

# 修改多值属性时，赋值list，会将多个属性值合并为一个值
rel_soup = BeautifulSoup('<p>Back to the <a rel="index">homepage</a></p>')
rel_soup.a['rel']
# ['index']
rel_soup.a['rel'] = ['index', 'contents']
print(rel_soup.p)
# <p>Back to the <a rel="index contents">homepage</a></p>

# xml不包含多值属性
xml_soup = BeautifulSoup('<p class="body strikeout"></p>', 'xml')
xml_soup.p['class']
# u'body strikeout'

（2）NavigableString类

字符串常被包含在tag内，用NavigableString类来包装tag中的字符串，tag.string

soup = BeautifulSoup('<p class="boldest">Extremely bold</p>')
tag = soup.p
tag.string		# u'Extremely bold'
type(tag.string)		# <class 'bs4.element.NavigableString'>

与python中Unicode字符串相同，通过unicode()进行转换

unicode_string = unicode(tag.string)		# u'Extremely bold'
type(unicode_string)		# <type 'unicode'>

不能被编辑、但能被替换：replace_with()

tag.string.replace_with('hello world')
tag 		# <p class="boldest">hello world</p>

支持遍历文档树和搜索文档树中定义的大部分属性和方法，不支持.contents, .string, find()

注意：NavigableString类仅包含字符串，不包含其他内容（如tag）

（3）Beautiful Soup对象

表示一个文档的全部内容
支持遍历文档树和搜索文档树中的大部分方法
name属性值为 u’[document]'
没有attribute属性

（4）Comment类

文档的注释部分，是一个类型特殊的NavigableString对象
出现在文档时，会使用特殊的格式输出

markup = "<b><!--Hey, buddy. Want to buy a used parser?--></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
type(comment)
# <class 'bs4.element.Comment'>

三、遍历文档树

1、子节点

一个 tag 可能包含多个字符串或其他的 tag，其都是 tag 的子节点；而字符串没有子节点

（1）.tagName

直接通过标签 tag 的 name 获取标签
只能获取到当前标签名的第一个tag
想要获取所有的当前标签名，使用搜索文档树中的 find_all()

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc)
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

（2）.contents

将tag的子节点以列表的方式输出

head_tag = soup.head
head_tag.contents
# [<title>The Dormouse's story</title>]

（3）.children

通过 tag的 .children 生成器，可以对子节点进行循环

for child in head_tag.contents[0].children:
	print(child)		# The Dormouse's story

（4）.descendants

.descendants 对所有tag的子孙节点进行递归循环

for child in head_tag.descendants:
    print(child)
    # <title>The Dormouse's story</title>
    # The Dormouse's story

（5）.string

如果tag只有一个NavigableString类型的子节点，可通过 .string 获取子节点
如果tag只有一个子节点，且子节点仅有一个NavigableString类型的子节点，tag可以直接使用 .string 输出内容
如果tag包含多个子节点，.string的输出结果为None

head_tag.contents[0].string		# u'The Dormouse's story'
head_tag.string		# u'The Dormouse's story'
soup.html.string		# None

（6）.strings

循环获取tag中包含的所有字符串（包括空格和空行）

for string in soup.strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u'\n\n'
    # u"The Dormouse's story"
    # u'\n\n'
    # u'Once upon a time there were three little sisters; and their names were\n'
    # u'Elsie'
    # u',\n'
    # u'Lacie'
    # u' and\n'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'\n\n'
    # u'...'
    # u'\n'

（7）.stripped_strings

循环获取tag中包含的所有字符串（全部是空格的行被忽略，段首、段末的空白被删除）

for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
    # u'...'

2、父节点

父节点为包含当前tag的节点
字符串也有父节点
文档的顶层节点的父节点是BeautifulSoup对象
BeautifulSoup对象的.parent是None

（1）.parent：获取某个元素的父节点

soup.title.parent		# <head><title>The Dormouse's story</title></head>
soup.title.string.parent		# <title>The Dormouse's story</title>
type(soup.html.parent(			# <class 'bs4.BeautifulSoup'>
soup.parent			# None

（2）.parents：通过递归获得元素的所有父辈节点

link = soup.a
for parent in link.parents:
    if parent is None:
        print(parent)
    else:
        print(parent.name)
# p
# body
# html
# [document]
# None

3、兄弟节点

同一节点下的所有子节点互为兄弟节点（不一定是同一类标签，可以是tag或字符串）

（1）.next_sibling：

获取上一兄弟节点
当节点是同级节点中的最后一个时，为None

sibling_soup = BeautifulSoup("<a><b>text1</b><c>text2</c></b></a>")
sibling_soup .b.next_sibling		# <c>text2</c>
sibling_soup .c.next_sibling		# None

（2）.previous_sibling：

获取下一兄弟节点
当节点是同级节点中的第一个时，为None

sibling_soup .c.precious_sibling			# <b>text1</b>
sibling_soup .b.precious_sibling		# None

（3）.next_siblings，.previous_siblings

迭代获取当前节点的兄弟节点

4、回退和前进

根据HTML的解析过程，获取上一个、下一个被解析对象
解析过程：类似树的深度遍历，即标签内部有标签或字符串，则先解析内部，再继续解析下一个

（1）.next_element，.previous_element

last_a_tag		# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
last_a_tag.next_sibling		# '; and they lived at the bottom of a well.'
last_a_tag.next_element		# u'Tillie'
last_a_tag.previous_element		# u' and\n'

（2）.next_elements，.previous_elements

for element in last_a_tag.next_elements:
    print(repr(element))
# u'Tillie'
# u';\nand they lived at the bottom of a well.'
# u'\n\n'
# <p class="story">...</p>
# u'...'
# u'\n'
# None

四、搜索文档树

Beautiful Soup定义了很多搜索方法，主要介绍 find() 和 find_all() ，其他方法与其类似

1、过滤器

过滤器可被用于 tag 的 name 、节点属性、字符串或其混合中

（1）字符串

在搜索方法中传入字符串参数，查找与字符串完整匹配的内容
若传入字节码参数，会被当做UTF-8编码，所以可以传入 Unicode编码来避免解析编码出错

soup.find_all('b')
# [<b>The Dormouse's story</b>]

（2）正则表达式

传入正则表达式作为参数，会通过正则表达式的match()来匹配内容

import re
for tag in soup.find_all(re.compile("^b")):
    print(tag.name)
# body
# b

（3）列表

传入列表参数，返回与列表中任一元素匹配的内容

soup.find_all(["a", "b"])
# [<b>The Dormouse's story</b>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

（4）True

传入True，则匹配所有tag，不包括字符串节点

for tag in soup.find_all(True):
    print(tag.name)
# html
# head
# title
# body
# p
# b
# p
# a
# a
# a
# p

（5）方法

传入方法，方法被定义为只接受一个元素参数，若方法返回True则当前元素被匹配并且被找到，否则返回False

def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')

soup.find_all(has_class_but_no_id)
# [<p class="title"><b>The Dormouse's story</b></p>,
#  <p class="story">Once upon a time there were...</p>,
#  <p class="story">...</p>]

2、find_all()

返回当前节点的所有子孙节点中符合条件的所有 tag，返回的是列表
没有找到目标时，返回空列表

（1）name参数

查找所有名字为 name 的 tag，字符串对象被忽略
name参数的值可以是任一类型的过滤器（字符串、正则表达式、列表、方法、True）

soup.find_all("title")
# [<title>The Dormouse's story</title>]

（2）keyword参数

如果一个指定名字的参数不是内置的参数名，则将该参数当做 tag 的属性来搜索
指定名字的属性使用的参数值包括：字符串、正则表达式、列表、True
部分 tag 属性不能使用，如data-*属性
但可以使用attrs参数定义一个字典参数来搜索包含特殊属性的tag

soup.find_all(href=re.compile("elsie"))
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

data_soup = BeautifulSoup('<div data-foo="value">foo!</div>')
data_soup.find_all(attrs={"data-foo": "value"})
# [<div data-foo="value">foo!</div>]

（3）CSS类名搜索

通过 class_ 参数搜索有指定CSS类名的 tag
同样接受不同类型的过滤器：字符串、正则表达式、方法、True
当tag有多个类名时，进行css类名搜索时，可以分别搜索
也可通过CSS值完全匹配，但顺序不符，则搜索不到

css_soup = BeautifulSoup('<p class="body strikeout"></p>')
css_soup.find_all("p", class_="strikeout")
# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body")
# [<p class="body strikeout"></p>]

css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]

（4）text参数

搜索文档中字符串内容
接受字符串、正则表达式、列表、方法、True
与其他参数混合使用来过滤 tag，找到 .string 方法与 text 参数值相符的 tag

soup.find_all(text=["Tillie", "Elsie", "Lacie"])
# [u'Elsie', u'Lacie', u'Tillie']

soup.find_all("a", text="Elsie")
# [<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>]

（5）limit参数

限制返回的搜索结果的数量

soup.find_all("a", limit=2)
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

（6）recursive参数

为 True 时，检索当前 tag 的所有子孙节点
为 False 时，只搜索 tag 的直接子节点

ss = '''
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
'''
soup.html.find_all("title")
# [<title>The Dormouse's story</title>]

soup.html.find_all("title", recursive=False)
# []

（7）简写方法

像调用find_all()一样调用tag

# 以下两行代码是等价的
soup.title.find_all(text=True)
soup.title(text=True)

3、find()

参数与 find_all() 类似
使用情况：只想得到一个结果（与find_all()设置limit=1类似）
返回：返回当前节点的所有子孙节点中符合条件的一个结果
直接返回结果，而不是列表
找不到目标时，返回 None

soup.find_all('title', limit=1)
# [<title>The Dormouse's story</title>]

soup.find('title')
# <title>The Dormouse's story</title>

print(soup.find("nosuchtag"))
# None

4、find_parents()、find_parent()

与 find_all() 和 find() 类似，仅搜索文档的部分不同
返回当前节点符合条件的的父辈节点（直接与间接）

a_string = soup.find(text="Lacie")
a_string
# u'Lacie'

a_string.find_parents("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

a_string.find_parent("p")
# <p class="story">Once upon a time there were three little sisters; and their names were
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
#  <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
#  and they lived at the bottom of a well.</p>

5、find_next_siblings()、find_next_sibling()

与 find_all() 和 find() 类似
返回当前节点符合条件的后面的兄弟节点

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_next_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

6、find_previous_siblings()、find_previous_sibling()

与 find_all() 和 find() 类似
返回当前节点符合条件的前面的兄弟节点

last_link = soup.find("a", id="link3")
last_link
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

last_link.find_previous_siblings("a")
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
#  <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

7、find_all_next()、find_next()

返回当前 tag 之后所有符合条件的节点，返回当前 tag 之后第一个符合条件的节点
返回的节点包括 tag 和字符串

first_link = soup.a
first_link
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>

first_link.find_all_next(text=True)
# [u'Elsie', u',\n', u'Lacie', u' and\n', u'Tillie',
#  u';\nand they lived at the bottom of a well.', u'\n\n', u'...', u'\n']

8、find_all_previous()、find_previous()

与find_all_next()、find_next()类似，只是是查找当前 tag 之前的节点

9、CSS选择器

在 tag 或 BeautifulSoup 对象的 .select() 方法中传入字符串参数，即可使用CSS选择器的语法找到 tag
CSS选择器语法参考：http://www.runoob.com/cssref/css-selectors.html

soup.select("title")		# 标签查找
soup.select("body a")		# 标签逐层查找
soup.select("p > a:nth-of-type(2)")			# 直接子标签
soup.select("#link1 + .sister")		# 兄弟节点
soup.select(".sister")				# 类名
soup.select("a#link2")			# id
soup.select('a[href]')		# 是否有某属性
soup.select('a[href$="tillie"]')		# 属性值