网页结构、使用urlopen()爬取网页内容、常用正则表达式简介、使用正则表达式匹配抓取网页内容、使用BeautifulSoup匹配抓取网页内容

网页的结构：

我们通过一个最简单的网页来分析网页的结构。

https://morvanzhou.github.io/static/scraping/basic-structure.html

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>

HTML网页中上所有的实体内容, 都会有个 tag 来框住它。网页的主体的tag分成两部分, 即header 和 body。

header 中, 存放着一些网页的元信息, 比如title最终是显示在网页标签上的。

如：

	<title>Scraping tutorial 1 | 莫烦Python</title>

body存放的是网页的主体信息。如heading, 视频, 图片和文字等。

<h1></h1> tag是主标题, 我们看到呈现出来的效果就是大一号的文字。

如：

<h1>爬虫测试1</h1>

<p></p> 里面的文字就是一个段落。<a></a>里面都是链接。

如：

这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.

使用urlopen()爬取网页内容：

如：

from urllib.request import urlopen

html = urlopen(
	"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
# 如果网页中含有中文，则decode()参数为'utf-8
print(html)

运行结果如下：

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>Scraping tutorial 1 | 莫烦Python</title>
	<link rel="icon" href="https://morvanzhou.github.io/static/img/description/tab_icon.png">
</head>
<body>
	<h1>爬虫测试1</h1>
	<p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>

</body>
</html>

可以看到我们将网页的源代码下载了下来。然后我们要将下载下来的源代码作为参数提供给BeautifulSoup，让BeautifulSoup来做匹配，找出其中我们需要的信息。

常用正则表达式简介：

^匹配字符串的开头。

$匹配字符串的末尾。

/顺斜杠是表示表达式开始和结束的“定界符”。

\反斜杠是表示转义字符。

.是匹配任意字符的意思。

“+”表示前面表达式匹配一次至多次（比如.+即匹配任意字符一次或多次）。

*表示匹配前一个字符0次或无限次（贪婪匹配）。

？表示匹配前一个字符0次或1次（非贪婪匹配，最小匹配）。

(.+)默认是贪婪匹配。.+?表示匹配任意字符并最小匹配。(.+?)加上括号后表示匹配到的对象设置成多个分组（如果匹配到了多个对象）。

如：

<a href="xxx"><span>
如果用<.+>匹配，则匹配结果是
<a href="xxx"><span>
如果用<.+?>匹配，则匹配结果是
<a href="xxx">

.* 就是单个字符匹配任意次，即贪婪匹配。.*? 表示匹配满足前后条件时只匹配最小范围的情况，即最小匹配。

如：

<H1>Chapter 1 - 介绍正则表达式</H1>
使用/<.*>/匹配的结果为：
H1>Chapter 1 - 介绍正则表达式</H1
使用/<.*?>/匹配结果为：
H1

使用正则表达式匹配抓取网页内容：

实际上，如果是初级的匹配，只要使用正则表达式来匹配就可以了。我们先看看用正则表达式如何匹配。

正则表达式匹配抓取标签内容：

from urllib.request import urlopen
import re

html = urlopen(
	"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
# 如果网页中含有中文，则decode()参数为'utf-8
res = re.findall(r"<title>(.+?)</title>", html)
# 上面下载下来的网页内容html作为匹配源
print("\n 本网页的title是", res)
print("\n 本网页的title是", res[0])

运行结果如下：

 本网页的title是 ['Scraping tutorial 1 | 莫烦Python']

 本网页的title是 Scraping tutorial 1 | 莫烦Python

正则表达式匹配抓取段落内容：

如果想要找到中间的那个段落 <p>, 因为段落在 HTML 中还夹杂着 tab, new line, 我们给一个flags=re.DOTALL来对这些 tab, new line不敏感。

from urllib.request import urlopen
import re

html = urlopen(
	"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
# 如果网页中含有中文，则decode()参数为'utf-8
res = re.findall(r"<p>(.*?)</p>", html, flags=re.DOTALL)
# 上面下载下来的网页内容html作为匹配源
print("\n 本网页的段落是", res)
print("\n 本网页的段落是", res[0])

运行结果如下：

 本网页的段落是 ['\n\t\t这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>\n\t\t<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.\n\t']

 本网页的段落是 
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
		<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.

正则表达式匹配抓取链接：

from urllib.request import urlopen
import re

html = urlopen(
	"https://morvanzhou.github.io/static/scraping/basic-structure.html"
).read().decode('utf-8')
# 如果网页中含有中文，则decode()参数为'utf-8
res = re.findall(r'href="(.*?)"', html)
# 上面下载下来的网页内容html作为匹配源
print("\n 本网页的title是", res)
print("\n 本网页的title是", res[0])

运行结果如下：

本网页的链接是 ['https://morvanzhou.github.io/static/img/description/tab_icon.png', 'https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']

 本网页的链接是 https://morvanzhou.github.io/static/img/description/tab_icon.png

使用BeautiSoup匹配抓取网页内容：

BeautifulSoup的安装方法：

python3 -m pip install beautifulsoup4

BeautifulSoup匹配抓取网页内容的步骤：

选择要爬取的网址 (url)；

使用urlopen等登录这个网址，read() 读取网页信息；

将读取的信息放入BeautifulSoup，使用 BeautifulSoup(代替正则表达式)匹配你需要抓取的信息。

BeautifulSoup解析基础网页：

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://morvanzhou.github.io/static/scraping/basic-structure.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
# 现在用BeautifulSoup代替正则表达式匹配来找出title和段落等部分，features='lxml'即选择lxml为解析器
print(soup.h1)
# 返回title
print('\n', soup.p)
# 返回段落
all_href = soup.find_all('a')
# 找到所有a标签
print(all_href)
# 这时候我们找到的是包含<a，</a>的链接
all_href = [l['href'] for l in all_href]
# 生成器写法
# 上面这句等于：
# for l in all_href:
#  print(l['href'])
# 上面相当于打印出字典中key=href对应的value值即链接
print(all_href)
# 这时候找到的就只是链接了，去掉了头尾的<a，</a>等

运行结果如下：

<h1>爬虫测试1</h1>

 <p>
		这是一个在 <a href="https://morvanzhou.github.io/">莫烦Python</a>
<a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a> 中的简单测试.
	</p>
[<a href="https://morvanzhou.github.io/">莫烦Python</a>, <a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/">爬虫教程</a>]
['https://morvanzhou.github.io/', 'https://morvanzhou.github.io/tutorials/data-manipulation/scraping/']

BeautifulSoup解析CSS网页：

CSS可以起到分别装饰每一个网页部件的作用。在装饰每一个网页部件的时候, 都会给它一个名字。同一种类型的部件, 名字都可以一样。部件里面的字体/背景颜色, 字体大小, 都是由 CSS 来控制。

先看一个使用CSS的网页结构：

<!DOCTYPE html>
<html lang="cn">
<head>
	<meta charset="UTF-8">
	<title>爬虫练习 列表 class | 莫烦 Python</title>
	<style>
	.jan {
		background-color: yellow;
	}
	.feb {
		font-size: 25px;
	}
	.month {
		color: red;
	}
	</style>
</head>

<body>

<h1>列表 爬虫练习</h1>

<p>这是一个在 <a href="https://morvanzhou.github.io/" >莫烦 Python</a> 的 <a href="https://morvanzhou.github.io/tutorials/data-manipulation/scraping/" >爬虫教程</a>
	里无敌简单的网页, 所有的 code 让你一目了然, 清晰无比.</p>

<ul>
	<li class="month">一月</li>
	<ul class="jan">
		<li>一月一号</li>
		<li>一月二号</li>
		<li>一月三号</li>
	</ul>
	<li class="feb month">二月</li>
	<li class="month">三月</li>
	<li class="month">四月</li>
	<li class="month">五月</li>
</ul>

</body>
</html>

如上例，<head></head>中的<style></style>这段代码就是我们的CSS代码，这里由于比较简单直接写在了<head></head>中，正常情况下这段代码会单独保存成一个CSS文件。

在<ul></ul>中我们可以看到元素是以class来分类保存的。

我们可以用BeautifulSoup通过CSS的class分类来抓取我们需要的信息。

from bs4 import BeautifulSoup
from urllib.request import urlopen

html = urlopen("https://morvanzhou.github.io/static/scraping/list.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
# 现在用BeautifulSoup代替正则表达式匹配来找出title和段落等部分，features='lxml'即选择lxml为解析器
month = soup.find_all('li', {'class': 'month'})
# 如果只传入'li'则所有'li'都会被选中，加上{'class':'month'}后所有li中的month类才会被选中
for m in month:
	print(m.get_text())
	print(m)
# 上面两句print的区别是前一句只会打出<li></li>之间的内容，而后一句会打出包括<li></li>的内容
jan = soup.find('ul', {'class': 'jan'})
d_jan = jan.find_all('li')
for d in d_jan:
	print(d.get_text())
# 打印出jan中的所有日期

运行结果如下：

一月
<li class="month">一月</li>
二月
<li class="feb month">二月</li>
三月
<li class="month">三月</li>
四月
<li class="month">四月</li>
五月
<li class="month">五月</li>
一月一号
一月二号
一月三号

BeautifulSoup搭配正则表达式匹配解析网页：

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
# 现在用BeautifulSoup代替正则表达式匹配来找出title和段落等部分，features='lxml'即选择lxml为解析器
# 我们现在想找出网页中的图片链接
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
# re.compile函数用于编译正则表达式，生成一个正则表达式（Pattern）对象。
for link in img_links:
	print(link['src'])
# 找到不是图片的其他链接
course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})
for link in course_links:
	print(link['href'])

运行结果如下：

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re

html = urlopen("https://morvanzhou.github.io/static/scraping/table.html").read().decode('utf-8')
soup = BeautifulSoup(html, features='lxml')
# 现在用BeautifulSoup代替正则表达式匹配来找出title和段落等部分，features='lxml'即选择lxml为解析器
# 我们现在想找出网页中的图片链接
img_links = soup.find_all("img", {"src": re.compile('.*?\.jpg')})
# re.compile函数用于编译正则表达式，生成一个正则表达式（Pattern）对象。
for link in img_links:
	print(link['src'])
# 找到不是图片的其他链接
course_links = soup.find_all('a', {'href': re.compile('https://morvan.*')})
for link in course_links:
	print(link['href'])

练习：爬取百度百科

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random
import time

base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
for i in range(20):
	# 循环爬20次
	time.sleep(2)
	# 每爬一次歇两秒，避免百度检测到你的爬虫
	url = base_url + his[-1]
	# 这里url分成了两部分，his[-1]表示his列表中的最后一项
	html = urlopen(url).read().decode('utf-8')
	soup = BeautifulSoup(html, features='lxml')
	print(soup.find('h1').get_text(), '    url:  ', his[-1])
	# 该网站网络爬虫四个字是h1标签，并打印其链接
	# 我们再找这个网页中所有百度百科的链接
	sub_urls = soup.find_all("a", {"target": "_blank", "href": re.compile("/item/(%.{2})+$")})
	# sub_urls获得该网页所有其他百度百科的链接，_blank放在一个链接中表示浏览器另外打开一个新窗口,(%.{2})+这样设置正则表达式，我们找到的网址就都是 /item/%xx%xx%xx...这样的格式了。
	if len(sub_urls) != 0:
		his.append(random.sample(sub_urls, 1)[0]['href'])
	# 从sub_urls list中随机获取1个元素，作为一个片断返回,添加到his中
	else:
		his.pop()

运行结果如下：

网络爬虫     url:   /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
宽度优先搜索     url:   /item/%E5%B9%BF%E5%BA%A6%E4%BC%98%E5%85%88%E6%90%9C%E7%B4%A2
队列     url:   /item/%E9%98%9F%E5%88%97
数组     url:   /item/%E6%95%B0%E7%BB%84
字节     url:   /item/%E5%AD%97%E8%8A%82
乱码     url:   /item/%E4%B9%B1%E7%A0%81
电子邮件程序     url:   /item/%E7%94%B5%E5%AD%90%E9%82%AE%E4%BB%B6%E7%A8%8B%E5%BA%8F
点击     url:   /item/%E7%82%B9%E5%87%BB
网络营销     url:   /item/%E7%BD%91%E7%BB%9C%E8%90%A5%E9%94%80
百度百科：多义词     url:   /item/%E7%99%BE%E5%BA%A6%E7%99%BE%E7%A7%91%EF%BC%9A%E5%A4%9A%E4%B9%89%E8%AF%8D
赵氏孤儿大报仇     url:   /item/%E8%B5%B5%E6%B0%8F%E5%AD%A4%E5%84%BF
义项     url:   /item/%E4%B9%89%E9%A1%B9
赵氏孤儿大报仇     url:   /item/%E8%B5%B5%E6%B0%8F%E5%AD%A4%E5%84%BF
百度百科：多义词     url:   /item/%E7%99%BE%E5%BA%A6%E7%99%BE%E7%A7%91%EF%BC%9A%E5%A4%9A%E4%B9%89%E8%AF%8D
赵氏孤儿大报仇     url:   /item/%E8%B5%B5%E6%B0%8F%E5%AD%A4%E5%84%BF
义项     url:   /item/%E4%B9%89%E9%A1%B9
约翰内斯·勃拉姆斯     url:   /item/%E7%BA%A6%E7%BF%B0%E5%A5%88%E6%96%AF%C2%B7%E5%8B%83%E6%8B%89%E5%A7%86%E6%96%AF
约瑟夫·约阿希姆     url:   /item/%E7%BA%A6%E7%91%9F%E5%A4%AB%C2%B7%E7%BA%A6%E9%98%BF%E5%B8%8C%E5%A7%86
种族歧视     url:   /item/%E7%A7%8D%E6%97%8F%E6%AD%A7%E8%A7%86
美国最高法院     url:   /item/%E7%BE%8E%E5%9B%BD%E6%9C%80%E9%AB%98%E6%B3%95%E9%99%A2

Process finished with exit code 0