【Class 45】【实例】《Python编程快速上手》查缺补漏十第十一章 Web之 BeautifulSoup 解析 HTML

Beautiful Soup 是一个模块，用于从 HTML 页面中提取信息. BeautifulSoup 模块的名称是 bs4

安装bs4:

C:\Users\Administrator>pip install bs4
Collecting bs4
  Downloading https://files.pythonhosted.org/packages/10/ed/7e8b97591f6f456174139ec089c769f89a94a1a4025fe967691de971f314/bs4-0.0.1.tar.gz
Collecting beautifulsoup4 (from bs4)
  Downloading https://files.pythonhosted.org/packages/1d/5d/3260694a59df0ec52f8b4883f5d23b130bc237602a1411fa670eae12351e/beautifulsoup4-4.7.1-py3-none-any.whl (94kB)
    65% |█████████████████████           | 61kB 29kB/s eta 0:00:    
    75% |████████████████████████        | 71kB 19kB/s eta 0:    
    86% |████████████████████████████    | 81kB 22kB/s et    
    97% |███████████████████████████████ | 92kB 21kB/s    
    100% |████████████████████████████████| 102kB 21kB/s
Collecting soupsieve>=1.2 (from beautifulsoup4->bs4)
  Downloading https://files.pythonhosted.org/packages/77/78/bca00cc9fa70bba1226ee70a42bf375c4e048fe69066a0d9b5e69bc2a79a/soupsieve-1.8-py2.py3-none-any.whl (88kB)
    57% |██████████████████▌             | 51kB 15kB/s eta 0:00:03    
    69% |██████████████████████          | 61kB 18kB/s eta 0:00   
     80% |██████████████████████████      | 71kB 17kB/s eta     
     92% |█████████████████████████████▌  | 81kB 20kB/s     
     100% |████████████████████████████████| 92kB 21kB/s
Installing collected packages: soupsieve, beautifulsoup4, bs4
  Running setup.py install for bs4 ... done
Successfully installed beautifulsoup4-4.7.1 bs4-0.0.1 soupsieve-1.8

C:\Users\Administrator>BeautifulSoup

1. 从 HTML 创建一个 BeautifulSoup 对象

bs4.BeautifulSoup()函数调用时需要一个字符串，其中包含将要解析的 HTML。
bs4.BeautifulSoup()函数返回一个 BeautifulSoup 对象。

2. 用 select()方法寻找元素

soup.select('div') 						所有名为<div>的元素
soup.select('#author') 					带有 id 属性为 author 的元素
soup.select('.notice') 					所有使用 CSS class 属性名为 notice 的元素
soup.select('div span') 				所有在<div>元素之内的<span>元素
soup.select('div > span') 				所有直接在<div>元素之内的<span>元素，中间没有其他元素
soup.select('input[name]') 				所有名为<input>，并有一个 name 属性，其值无所谓的元素
soup.select('input[type="button"]') 	所有名为<input>，并有一个 type 属性，其值为 button 的元素

新建一个html 文件用于解析：

<!-- This is the example.html example file. --> 
<html><head><title>The Website Title</title></head>

<body>
	<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>
	<p class="slogan">Learn Python the easy way!</p>
	<p>By <span id="author">Al Sweigart</span> </p>
</body>

</html>

解析html 实例：

#! python3
# -*- coding: utf-8 -*-

import bs4, requests

# 获得一个html 网页内容
exampleFile = open('example.html')

# 创建一个 beautiful soup 对象
exampleSoup = bs4.BeautifulSoup( exampleFile.read() )
print("==================================================")
print( type(exampleSoup) )

# soup.select('#author') 带有 id 属性为 author 的元素
get_content = exampleSoup.select('#author')
print("获取带id 属性为author 的标签--- ", get_content)
print("获取带id 属性为author 的内容--- ", get_content[0].getText()  )
print("获取带id 属性为author 的字典内容--- ", get_content[0].attrs )
print ( get_content[0].get('id') )

# soup.select('.slogan') 所有使用 CSS class 属性名为 slogan 的元素
get_content = exampleSoup.select('.slogan')
print("获取CSS class = slogan 的标签--- ", get_content)
print("获取CSS class = slogan 的内容--- ", get_content[0].getText()  )
print("获取CSS class = slogan 的字典内容--- ", get_content[0].attrs )
print ( get_content[0].get('class') )

# soup.select('p') 所有名为<p>的元素
get_content = exampleSoup.select('p')
print("获取 <p> 的标签--- ", get_content)
for i in range( len(get_content) ):
    print("获取 <p> 的内容--- ", get_content[i].getText()  )
    print("获取 <p> 的字典内容--- ", get_content[i].attrs )

输出结果为：
==================================================
<class 'bs4.BeautifulSoup'>
获取带id 属性为author 的标签---  [<span id="author">Al Sweigart</span>]
获取带id 属性为author 的内容---  Al Sweigart
获取带id 属性为author 的字典内容---  {'id': 'author'}
author

获取CSS class = slogan 的标签---  [<p class="slogan">Learn Python the easy way!</p>]
获取CSS class = slogan 的内容---  Learn Python the easy way!
获取CSS class = slogan 的字典内容---  {'class': ['slogan']}
['slogan']

获取 <p> 的标签---  [<p>Download my <strong>Python</strong> book from <a href="http://inventwithpython.com">my website</a>.</p>, 
<p class="slogan">Learn Python the easy way!</p>, 
<p>By <span id="author">Al Sweigart</span> </p>]

获取 <p> 的内容---  Download my Python book from my website.
获取 <p> 的字典内容---  {}
获取 <p> 的内容---  Learn Python the easy way!
获取 <p> 的字典内容---  {'class': ['slogan']}
获取 <p> 的内容---  By Al Sweigart
获取 <p> 的字典内容---  {}

PS C:\Users\Administrator\Desktop\tmp>

3. 通过元素的属性获取数据

Tag 对象的 get()方法让我们很容易从元素中获取属性值。向该方法传入一个属性名称的字符串，它将返回该属性的值。

exampleSoup.select('.slogan')[0].get('id')

【Class 45】【实例】《Python编程快速上手》 查缺补漏十 第十一章 Web之 BeautifulSoup 解析 HTML

1. 从 HTML 创建一个 BeautifulSoup 对象

2. 用 select()方法寻找元素

3. 通过元素的属性获取数据

猜你喜欢

【Class 45】【实例】《Python编程快速上手》查缺补漏十第十一章 Web之 BeautifulSoup 解析 HTML