python网络爬虫--BeautifulSoup

本随笔记录学习崔庆才老师编著的《Python3网络爬虫开发实战》以及《Beautiful Soup 4.4.0文档》。

安装BeautifulSoup4以及解析器

Install BeautifulSoup4

$ pip install beautifulsoup4
$ easy_install beautifulsoup4

install a parser

Beautiful Soup在解析时实际上依赖解析器。Beautiful Soup支持python标准库中的HTML解析器，还支持一些第三方的解析器，包括 lxml 和 html5lib

lxml :
- $ easy_install lxml
- $ pip install lxml
html5lib:

$ easy_install html5lib
$ pip install html5lib

下表列出了主要的解析器,以及它们的优缺点:

解析器	使用方法	优势	劣势
Python标准库	`BeautifulSoup(markup, "html.parser")`	Python的内置标准库执行速度适中文档容错能力强	Python 2.7.3 or 3.2.2)前的版本中文档容错能力差
lxml HTML 解析器	`BeautifulSoup(markup, "lxml")`	速度快文档容错能力强	需要安装C语言库
lxml XML 解析器	`BeautifulSoup(markup, ["lxml", "xml"])` `BeautifulSoup(markup, "xml")`	速度快唯一支持XML的解析器	需要安装C语言库
html5lib	`BeautifulSoup(markup, "html5lib")`	最好的容错性以浏览器的方式解析文档生成HTML5格式的文档	速度慢不依赖外部扩展

Beautiful Soup 自动将输入文档转换为Unicode编码，然后选择合适的解析器解析该文档，输出文档转换为UTF-8编码。

import requests
from bs4 import BeautifulSoup

# 发起请求
headers = {
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.157 Safari/537.36"
    }
response = requests.get("https://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#id5",headers = headers)
response.status_code
response.encoding

#使用BeautifulSoup解析该网页文档
soup = BeautifulSoup(response.text,"html.parser")
print(soup.prettify)

解析文档的时候，BeautifulSoup将网页文档初始化，构建树形文档对象，并将该对象赋值给soup变量，接下来就可以遍历文档树或者搜索文档树进行信息提取，或者修改文档树

对象的种类

BeautifulSoup将复杂的HTML文档转换成一个复杂的树形结构，其中的每一个节点都是一个python对象，所有对象可以归纳为4中：Tag, NavigableString, BeautifulSoup,Comment

<a accesskey="I" href="genindex.html" title="General Index">index</a>

这是树形结构中的一个节点，和HTML原生文档中的标签是相同的。其中：

a 即为Tag对象的名称。可以调用Tag.name得到
{'href': 'genindex.html', 'title': 'General Index', 'accesskey': ['I']},即Tag对象的属性，可以是零个或者多个。可以调用Tag.attrs方法的到。属性可以增加、删除和修改，操作方法和字典相同
'index' 即NavigableString对象，可以通过Tag.string方法获得，不能编辑，但是可以被替换，可以使用replace_with()方法替换。
Comment对象，即文档的注释内容。他是一种特殊类型的NavigableString对象，不做深究，只需要知道是文档注释内容即可

因为文档树就是由类似上面的节点构成的，我们了解一个节点对象的种类以及各对象信息的提取方法，就可以提取信息。掌握一个节点的信息提取，或者修改，我们就可以通过遍历文档树或者搜索文档树提取目的信息，或者修改文档。

遍历文档树

节点之间的关系，父子关系，兄弟关系。我们想要获得哪个节点，只需要通过该节点的名字就可以提取该节点，如：

soup = BeautifulSoup(html,"html.parser") #BeautifulSoup将html文档转换为树形文档对象，并赋值给soup
soup.head # 返回head节点内容
soup.body # 返回body节点内容
soup.a # 返回a节点内容，假如此时soup中有很多a节点，此时只会返回第一个a节点。加入需要返回全部a节点，可以通过搜索文档树
soup.a.name #获取a节点的名称
soup.a.attrs #获取a节点的属性
soup.a.string #获取a节点的navigablestring对象

子节点

一个节点和其它节点之间的关系为父子，兄弟。此时我们来看当前节点的子节点，可以通过contents、children、descendants

contents：将当前节点包含的直接子节点以列表的方式输出，每一个直接子节点作为一个列表元素，而不会将子节点包含的子节点拆开。
children：将当前节点包含的直接子节点生成一个迭代器，循环遍历每一个直接子节点，而不会遍历直接子节点包含的子节点遍历。
descendants：将当前节点的子孙节点均进行遍历

#假如soup.head为以下内容，即head节点有三个自己点，而其中title节点又包含一个自己点
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>
<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>
</head>

soup.head.contents
['\n', <meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>,
 '\n', <title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>, 
'\n', <link href="_static/pygments.css" rel="stylesheet" type="text/css"/>,'\n']

children

for i in soup.head.children:
　　　　print(i)
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>
<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>

descendants

for child in soup.head.descendants:                #会将所有子孙节点都遍历
　　print(child)
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>
Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation
<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>

String

如果一个Tag只有一个NavigableString类型的子节点，那么可以通过.string方法得到子节点

<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>

print(title.string)

#运行结果：
'Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation'

Strings and stripped_strings

如果一个Tag对象包含多个字符串，可以使用.strings进行循环获取

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>
<link href="_static/default.css" rel="stylesheet" type="text/css"/>
<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '4.2.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
<script src="_static/jquery.js" type="text/javascript"></script>
<script src="_static/underscore.js" type="text/javascript"></script>
<script src="_static/doctools.js" type="text/javascript"></script>
<link href="index.html" rel="top" title="Beautiful Soup 4.2.0 documentation"/>
</head>

for s in soup.head.strings:
    print(s)
#运行结果：
Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation







      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '4.2.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };

输出的字符串中可能包含空格或者空行，可以使用stripped_strings去除多余的空白内容。整行都是空白的会被忽略，段首和段末的空白会被删除

for s in soup.head.stripped_strings:
    print(s)

#运行结果：
Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation
var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '4.2.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };

父节点

parent and parents

通过parent可以得到当前节点的直接父辈节点，通过parents可以得到当前节点的所有父辈节点

<head>
    <title>
        <p>BeautifulSoup</p>
    </title>
</head>

print(soup.p.parent.name)
#运行结果：
'title'

for parent in soup.p.parentss:
    print(parent.name)

#运行结果：
'title'
'head'

兄弟节点

next_sibling and previous_sibling

兄弟节点中的第一个没有previous_sibling,最后一个Tag没有next_sibling。

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>
<link href="_static/default.css" rel="stylesheet" type="text/css"/>
<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '4.2.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
<script src="_static/jquery.js" type="text/javascript"></script>
<script src="_static/underscore.js" type="text/javascript"></script>
<script src="_static/doctools.js" type="text/javascript"></script>
<link href="index.html" rel="top" title="Beautiful Soup 4.2.0 documentation"/>
</head>

按照常理认为，meta的下一个标签是title，但是实际中标签之间存在这顿号、换行符等

soup.meta.next_sibling
#结果：
'\n'
soup.title.previous_sibling
#结果：
'\n'

next_siblings and previous_siblings

可以通过next_siblings和previous_siblings对当前节点进行迭代

for i in soup.meta.next_siblings:
    print(i)

#运行结果：
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>


<link href="_static/default.css" rel="stylesheet" type="text/css"/>


<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>


<script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '4.2.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>


<script src="_static/jquery.js" type="text/javascript"></script>


<script src="_static/underscore.js" type="text/javascript"></script>


<script src="_static/doctools.js" type="text/javascript"></script>


<link href="index.html" rel="top" title="Beautiful Soup 4.2.0 documentation"/>

回退和前进

HTML解析器解析文档内容的时候是按照顺序依次解析的，从上到下依次解析所有节点

next_element and previous_element

好像和next_sibling有点相同，但其实是不一样的。一个是解析的进程节点，一个是兄弟节点

soup.meta.previous_sibling
#结果：
'\n'
soup.meta.previous_sibling.previous_sibling
#结果没有，也就只是表示往上没有兄弟标签了

soup.meta.previous_element
#结果：
'\n'
soup.meta.previous_element.previous_element
#结果：也就上个处理节点为<head>节点

<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>
<link href="_static/default.css" rel="stylesheet" type="text/css"/>
<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '4.2.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
<script src="_static/jquery.js" type="text/javascript"></script>
<script src="_static/underscore.js" type="text/javascript"></script>
<script src="_static/doctools.js" type="text/javascript"></script>
<link href="index.html" rel="top" title="Beautiful Soup 4.2.0 documentation"/>
</head>

next_elements and previous_elements

同样可以通过next_elements和previous_elements两种方法迭代，访问解析的节点。

所有的最后都是<html>根节点，所有内容

for i in soup.title.previous_elements:
    print(i)
#结果：
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>


<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>Beautiful Soup 4.2.0 文档 — Beautiful Soup 4.2.0 documentation</title>
<link href="_static/default.css" rel="stylesheet" type="text/css"/>
<link href="_static/pygments.css" rel="stylesheet" type="text/css"/>
<script type="text/javascript">
      var DOCUMENTATION_OPTIONS = {
        URL_ROOT:    './',
        VERSION:     '4.2.0',
        COLLAPSE_INDEX: false,
        FILE_SUFFIX: '.html',
        HAS_SOURCE:  true
      };
    </script>
<script src="_static/jquery.js" type="text/javascript"></script>
<script src="_static/underscore.js" type="text/javascript"></script>
<script src="_static/doctools.js" type="text/javascript"></script>
<link href="index.html" rel="top" title="Beautiful Soup 4.2.0 documentation"/>
</head>


<html>....</html>

html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"

搜索文档树

find() and find_all()

find和find_all只搜索当前节点的子节点、孙节点。find返回的是Tag对象，find_all返回的是Tag对象构成的列表。既然是Tag对象，那么就可以调用Tag对象的方法。

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance

find_all(name=None, attrs={}, recursive=True, text=None, limit=None, **kwargs) method of bs4.BeautifulSoup instance
Extracts a list of Tag objects that match the given　　　　　　　　　　　　　　　　#提取一个与给定条件匹配的Tag对象构成列表。可以指定要提取的Tag对象的名称，以及目的Tag对象有的属性
criteria. You can specify the name of the Tag and any
attributes you want the Tag to have.

The value of a key-value pair in the 'attrs' map can be a　　　　　　　　　　　　 #Tag对象的名称或者Tag属性映射（匹配）的值，可以是字符串、字符串列表、正则表达式、一个可调用的接收一个字符串并返回/或不返回匹配字符的自定义匹配函数、True
string, a list of strings, a regular expression object, or a
callable that takes a string and returns whether or not the
string matches for some custom definition of 'matches'. The
same is true of the tag name

name：Tag对象的名称
attrs：Tag对象的属性
recursive：BeautifulSoup会搜索当前节点的所有子孙节点，如果只想要搜索当前节点的直接子节点，可以设置recursive = False
text：通过该参数可以搜索文档中字符串内容，
limit：限制返回结果的数量
**kwargs

name

soup.find_all("title") #字符串，搜索所有名字为'title'的Tag
soup.find_all('a') #字符串，搜索名字为'a'的Tag，这个名字十分精确，就是与字符串相同，完全
soup.find_all(re.compile('^b') #正则表达式，搜索名字以'b'为开头的Tag
soup.find_all(["a","b"]) #字符串列表，BeautifulSoup将与列表中的各元素匹配的Tag返回。如此时返回所有的<a>标签和<b>标签

attrs

soup.find_all(id="beautiful-soup-4-2-0") #字符串
soup.find_all(id = re.compile("beautiful-soup-4-2-0")) #正则表达式
soup.find_all(id = True) #True，即查到所有包含id这个属性的Tag,不管id的值是什么
soup.find_all(attrs={"data-foo":"value"}) #HTML中有些属性不能直接使用，比如data-*属性，可定义一个字典参数来搜索。其实上面的也是这样，id="beautiful-soup-4-2-0"因为可以使用，等同于attrs={"id":"beautiful-soup-4-2-0"}

因为class是关键字，所以通过class_参数搜索指定类名的Tag

soup.find_all(class_="image-link)
soup.find_all(class_=re.compile("image-link"))
soup.find_all(class_=True)

BeautifulSoup中还有10个用于搜索的API，其中五个用的与find相同的参数，五个用与find_all相同的参数。区别只是搜索的文档的部位不同。以上我用的都是soup，即整个文档树。所以以上例子都是搜索的跟节点下的子孙节点。当然可以搜索某个节点下Tag对象，比如说

head = soup.find("head") #将<head>节点赋值给head
head.find("title") #搜索head节点下，名称为"title"的Tag

find_parent and find_parents

与前面一样，只是搜索的当前节点的向上节点。

<body> 
<p>  
 <div>
        <p>
            <a>sdf</a>
        </p>
    </div> 
</p>   
</body>
a.find_parent("p")
# 结果：  即当前节点向上节点中搜索"p"节点，并返回与当前节点最近的"p"节点
        <p>
            <a>sdf</a>
        </p>    
a.find_parents("p")
# 结果：当前节点向上搜索"p"节点，并返回所有"p"节点
['<p>  
 <div>
        <p>
            <a>sdf</a>
        </p>
    </div> 
</p>','   <p>
            <a>sdf</a>
        </p>']

find_previous_sibling and find_previous_siblings

当前节点的兄弟节点进行向前搜索

find_next_sibling and find_next_siblings

当前节点的兄弟节点进行向后搜索

find_all_next and find_next

和前面的element一样，即解析节点。比如可以解析当前节点后，解析的某个节点

dd = soup.find("dd")
dd.find_next("div") 解析dd节点后，向下需要解析的节点中的"div"节点

find_all_previous and find_previous

解析当前节点前，解析的某个节点

CSS

.select方法中传入字符串参数，使用css语法搜索Tag

通过tag标签查找：soup.select("a")
通过tag标签逐层查找：soup.select("html body a")
找到某个tag标签下的直接子标签：soup.select("p > a")
找到兄弟节点标签：soup.select("#link1 ~ .sister") # [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
通过CSS的类名查找：soup.select(".sister")
通过tag的id查找：soup.select("#link1")
通过是否存在某个属性来查找：soup.select('a[href]')
通过属性的值来查找：soup.select('a[href="http://example.com/elsie"]')