文章目录

网络爬虫

网络爬虫

一、核心思想

获取网页
解析网页
- 定位网页（多页）
- 定位字段
- 遍历网页（recursive递归）
存储数据

二、获取网页

常用的获取网页的包有：urllib、requests、scrapy

urllib

它是PSL标准库（Python Standard Library)，通常用于网页的获取，如果想全用Python原装的话，也可以直接用正则表达式对获取的网页解析。

使用方法

from urllib import request
from urllib import parse
url = 'http://technomerc.com/'  # 之后的一个例子
#url = parse.quote('中文') 中文时需要用quote编译
req = request.Request(url,headers={
    
    'User-Agent':'Mozilla/5.0'})
resp = request.urlopen(req)
html = resp.read().decode("utf-8") # 返回一个编译后的字符串

评价
- urllib.request.Request()，对输入网址的解析是用ascii码编译的，因此对汉字无法解析，需要用quote函数进行编译

requests

与urllib类似，网址中有汉字也可以编译
scrapy

它很专业，不仅可以获取网页、解析网页、并且也可以实现一些更专业的方法，实现更多的功能。

三、解析网页

常用的解析网页的包有：re、beautifulsoup4、scrapy

beautifulsoup4

把网页解析成一碗汤

scrapy

非常成熟

附：html网页的简单树结构

标签或称节点 + 内部字符串 + css属性（修饰标签格式）组成

标签有平级的，例如<head>和<body>以及三个；也有父子关系，例如<head>和<title>以及<body>和

以The Dormouse’s story为例：

<html>
	<head>
		<title>The Dormouse's story</title>
	</head>
	<body>
		<p class="title">
			<b>The Dormouse's story</b>
		</p>
		<p class="story">
			Once upon a time there were three little sisters; and their names were
			<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
			<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a>
			 and
			<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>
			;and they lived at the bottom of a well.
		</p>
		<p class="story">...</p>
	</body>
</html>

四、BeautifulSoup使用说明和实例操作

概念说明简介（需要有基础，不理解也没问题）

官方文档：https://www.crummy.com/software/BeautifulSoup/bs4/doc/

__file__查看模块位置

创建格式：class BeautifulSoup(markup,parser)

markup是下载后编译成字符串的网页.html

最重要的文件element.py包含的类

class NavigableString(str,PageElement)
标签里的字符串p.string

class Tag(PageElement) #根据标签进行操作
tag中可嵌套tag和string，可以有任意多属性attrs

class PageElement(object)

class SoupStrainer(object) #可以选择性的做一碗浓缩汤

基本数据结构

汤里有两种东西：标签、可导航的字符串
1. Tag（标签）
 1. 即html中的标签，可直接当做一个字符串或list
 2. 内容：name、string、attr
```
>> soup.p
The Dormouse's story
>>> soup.p.name
'p'
>>> soup.p.string
"The Dormouse's story"
>>> soup.p['class'] # 字典
['title']
```
 3. 访问方式：（通过attrs定位，三种移动方法）
 1. 节点定位：根据html网页标签定位，例soup.p.a
 2. 方法定位：类似正则表达式定位，例soup.find_all、find
 3. css样式定位：根据html标签和标签内的样式定位，例soup.select()、select_one()
 4. 用NavigableString的方式访问
2. NavigableString（导航的字符串）
 1. 即字符串，除了tag定位到的列表，其余都是可以用来导航的字符串
 2. 访问方式：可以前后\左右\上下\定位 next_elment,next_previling,children\parents

使用流程或实战（这才是最重要的）

做一碗汤soup

from bs4 import BeautifulSoup,SoupStrainer
soup  = BeautifulSoup(html,'lxml') 
#soup1  = BeautifulSoup(html,'lxml',parse_only = SoupStrainer(id='某个唯一识别'))
# soup.prettify()

- 字符串、文件makeup
- 解析器parser
	- lxml、lxml-xml、html.parser、html5lib
- prettify()
	- 显示网页css样式（网页源码）
	- 保存html本地文件时更好看

根据Tag类定位字段

节点定位（标签）

>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.p
<p class="title"><b>The Dormouse's story</b></p>

方法搜索

soup.find(name,attr,recursive,string,limit,**kwargs) \ soup.find_all()

name是标签名，attr是属性名，递归，字符串、找几次，其他关键字参数（某些属性取什么值）

>>> soup.find('a')
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>
>>> soup.find('a','sister',False,'Lacie')  # Lacie在第二个，不递归就不会遍历
>>> soup.find('a','sister',True,'Lacie')
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
>>>

css定位
根据标签内的各种css样式进行定位

>>> soup.select("p[class='title']")
[<p class="title"><b>The Dormouse's story</b></p>]
>>> soup.select("p.title")
[<p class="title"><b>The Dormouse's story</b></p>]
>>> soup.select("a[id='link2']")
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>>> soup.select("a#link2")
[<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]
>>> soup.select("head")
[<head><title>The Dormouse's story</title></head>]

根据移动定位

>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.head.children
<list_iterator object at 0x000001E3A5ABFD00>
>>> for i in soup.head.children:print(i)

<title>The Dormouse's story</title>

根据NavigableString定位

class NavigableString(str,PageElement)导航字符

# 与普通str的区别
s1 = set(x for x in dir(str) if not x.startswith('_'))
s2 = set(x for x in dir(bs4.element.NavigableString)  if not x.startswith('_'))
print(sorted(list(s2-s1)))

使用方法
ns = soup.a.string # ns就是一个NS类的实例

ns.parent #导航到父节点
ns.children # 导航到子节点
ns.next_sibling # 导航到

>>> soup.head
<head><title>The Dormouse's story</title></head>
>>> soup.head.string   # NavigableString对象
"The Dormouse's story"
>>> ns = soup.head.string
>>> ns.parent
<title>The Dormouse's story</title>
>>> ns.next_sibling

五、详述网页的三种定位：标签移动、方法和CSS

1.标签移动

四种移动方法（（与NavigableString类似，但不是NS对象））

Going down
- .contents、.children,.descendants
- .string,.strings,.stripped_strings
Going up
- .parent,.parents
Going sideways
- .next_sibling,previous_sibling,及其复数形式
- 兄弟姐妹
Going back and forth
- .next_element,previous_element,及其复数形式

2.方法find_all、find

find_all(name,attrs,recursive,string,limit,**kwargs)

name参数
- html tag name
attr参数
- tag中的属性
  - 用字典传参{‘key’：value，}
  - 若不是字典时，表示class= ‘value’
recursive参数
- 递归
  - True查找孩子的孩子…descendants
  - False查找孩子children
string参数
- 通过满足情况的string筛选
- 也可用正则表达式
  - pattern
  - re.compile(r"a*")
根据name和attr进行搜索，返回一个结果列表
find(name,attrs,recursive,string,**kwargs)
- 等价于find_all 中的limit=1
其他方法与之类似

方法搜索(类方法)

find_parents(),find_parent()
find_next_siblings(),find_next_sibling()
find_previous_siblings(),find_previous_sibling()
find_all_next(),find_next()
find_all_previous(),find_previous()

3.css定位

HTML的元素显示文档
select、select_one
- soup.select(“p”)
  - 选择所有的p标签
- soup.select(“p.sister”)
  - 选择所有包含sister的p标签
- soup.select(".sister")
  - 选择所有的属性中的sister
- soup.select(“p:nth-of-type(1)”)
  - p标签的第一项
- soup.select(“a b”)
  - 满足b性质的a标签的所有子代
  - soup.select(“html .title”) # html中的class='title’的标签
  - soup.select(“html #title”) # html中的id='title’的标签
  - soup.select("#link1,#link2") # html中的id='title’的标签
- soup.select(“a > b”)
  - 满足b性质的a的直系后代
- soup.select(‘a[href]’)
  - 选择有href属性的所有a标签
  - soup.select(‘a[href^=“http:”]’) # 以后面的开头
  - href$=“http” # 以http结尾
  - href*=“http” # 在中间

a b # a中满足性质b的后代,
a>b # 直接孩子

# 表示id

. 表示clas

p[‘属性’]

href^=“http:” # 以后面的开头 href$=“http” # 以http结尾 href*=“http” # 在中间