Python爬虫库-1-BeautifulSoup的使用

Beautiful Soup是一个可以从HTML或XML文件中提取数据的Python库,简单来说,它能将HTML的标签文件解析成树形结构(网页原本就是一个树形结构),然后获取到指定标签的对应属性。

通过Beautiful Soup库,我们可以将指定的class或id值作为参数,来直接获取到对应标签的相关数据,是python爬虫当中的常用库,python 3环境下。

内容大纲:

  1. 安装
  2. 调用beautifulsoup4(bs4)
  3. 页面解析。获取页面,并转换为bs4对象
  4. 抓取。获取bs4对象中的各个元素

环境建议使用anaconda+vscode

1、安装beautifulsoup4、urllib库

vscode下,运行 pip install beautifulsoup4  、pip install urllib 

2、调用bs4

安装完成后,尝试包含库运行:

from bs4 import BeautifulSoup

若没有报错,则说明库已正常安装完成。

3、页面获取

本文会通过这个网页http://reeoo.com来进行示例讲解,如下图所示

先导入urllib.request库,通过Request方法,访问url,获取网页返回值,再通过BeautifulSoup 对象初始化

from bs4 import BeautifulSoup
import urllib.request

url = 'http://reeoo.com'

request = urllib.request.Request(url)

response = urllib.request.urlopen(request, timeout=20)

content = response.read()

soup = BeautifulSoup(content, 'html.parser')

将一段文档传入 BeautifulSoup 的构造方法,就能得到一个文档对象,这个对象是beautifulsoup的对象格式。如下代码所示,文档通过请求url获取:

" rel="EditURI" title="RSD" type="application/rsd+xml"/>
<link href="http://reeoo.com/wp-includes/wlwmanifest.xml" rel="wlwmanifest" type="application/wlwmanifest+xml"/>
<meta content="WordPress 4.9.8" name="generator"/>
</link></meta></meta></meta></meta></meta></meta></head>
<body>
<header id="header">
<div id="main_menu">
<div class="box">
<h1 id="logo"><a href="https://reeoo.com" title="Web design inspiration and gallery"><span class="icon-reeoo"></span></a></h1>
<ul>
<li class="active" id="link_web"><a href="https://reeoo.com" title="Web Design Gallery">Web Design</a></li>
<li id="link_iphone"><a href="https://iphone.reeoo.com" title="iPhone Patterns">iPhone App</a></li>
<li id="link_ipad"><a href="https://ipad.reeoo.com" title="iPad Patterns">iPad App</a></li>
<li id="link_icon"><a href="https://icon.reeoo.com" title="iOS Icon Design">Icon</a></li>
<li id="link_designer"><a href="https://designer.reeoo.com" title="Designer Show">Designer</a></li>
<li id="link_download"><a href="https://download.reeoo.com" title="Design resources download">Download</a></li>
</ul>
<div id="more">
<div id="search">
<span class="icon-search"></span>
<form action="https://reeoo.com" id="searchform" method="get">
<input id="s" name="s" placeholder="Search name or tag" required="" size="20" type="text" value=""/>
</form>
</div>
<div id="contact"><a href="http://weibo.com/reeoocom" target="_blank"><span class="icon-weibo"></span></a><a href="https://twitter.com/reeoocom" target="_blank"><span class="icon-twitter"></span></a><a href="mailto:[email protected]" target="_blank"><span class="icon-email"></span></a></div>
</div>
</div>
</div>
<div id="submenu">
<div class="box">
<div class="menu-color-menu-container"><ul class="menu" id="menu-color-menu"><li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3865" id="menu-item-3865"><a href="https://reeoo.com/category/black" title="Black Web Design">Black</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3866" id="menu-item-3866"><a href="https://reeoo.com/category/blue" title="Blue Web Design">Blue</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3867" id="menu-item-3867"><a href="https://reeoo.com/category/brown" title="Brown Web Design">Brown</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3869" id="menu-item-3869"><a href="https://reeoo.com/category/green" title="Green Web Design">Green</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3868" id="menu-item-3868"><a href="https://reeoo.com/category/gray" title="Gray Web Design">Gray</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3871" id="menu-item-3871"><a href="https://reeoo.com/category/orange" title="Orange Web Design">Orange</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3872" id="menu-item-3872"><a href="https://reeoo.com/category/purple" title="Purple Web Design">Purple</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-13232" id="menu-item-13232"><a href="https://reeoo.com/category/pink">Pink</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3873" id="menu-item-3873"><a href="https://reeoo.com/category/red" title="Red Web Design">Red</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3874" id="menu-item-3874"><a href="https://reeoo.com/category/white" title="White Web Design">White</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3875" id="menu-item-3875"><a href="https://reeoo.com/category/yellow" title="Yellow Web Design">Yellow</a></li>
<li class="menu-item menu-item-type-taxonomy menu-item-object-category menu-item-3870" id="menu-item-3870"><a href="https://reeoo.com/category/multicolored" title="Multicolored Web Design">Multicolored</a></li>
</ul></div> <div class="filter">
<span class="icon-category"></span>
<div class="menu-header-menu-container"><ul class="menu" id="menu-header-menu"><li class="menu-item menu-item-type-custom menu-item-object-custom current-menu-item menu-item-11736" id="menu-item-11736"><a href="http://reeoo.com/">All</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11737" id="menu-item-11737"><a href="http://reeoo.com/?s=app">App</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11750" id="menu-item-11750"><a href="http://reeoo.com/tag/software">Software</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11754" id="menu-item-11754"><a href="http://reeoo.com/tag/icon">Icon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11747" id="menu-item-11747"><a href="http://reeoo.com/?s=agency">Agency</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11752" id="menu-item-11752"><a href="http://reeoo.com/tag/company">Company</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11740" id="menu-item-11740"><a href="http://reeoo.com/?s=studio">Studio</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11738" id="menu-item-11738"><a href="http://reeoo.com/tag/coming-soon">Coming Soon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11739" id="menu-item-11739"><a href="http://reeoo.com/tag/onepage">Onepage</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11751" id="menu-item-11751"><a href="http://reeoo.com/tag/cartoon">Cartoon</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11764" id="menu-item-11764"><a href="http://reeoo.com/?s=animation">Animation</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11766" id="menu-item-11766"><a href="http://reeoo.com/?s=develop">Develop</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11743" id="menu-item-11743"><a href="http://reeoo.com/tag/designer">Designer</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11741" id="menu-item-11741"><a href="http://reeoo.com/tag/food">Food</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11742" id="menu-item-11742"><a href="http://reeoo.com/tag/music">Music</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11749" id="menu-item-11749"><a href="http://reeoo.com/?s=movie">Movie</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11763" id="menu-item-11763"><a href="http://reeoo.com/?s=metting">Metting</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11744" id="menu-item-11744"><a href="http://reeoo.com/?s=shop">Shop</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11756" id="menu-item-11756"><a href="http://reeoo.com/tag/fashion">Fashion</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11745" id="menu-item-11745"><a href="http://reeoo.com/?s=wordpress">WordPress</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11746" id="menu-item-11746"><a href="http://reeoo.com/?s=theme">Theme</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11748" id="menu-item-11748"><a href="http://reeoo.com/?s=official">Official</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11753" id="menu-item-11753"><a href="http://reeoo.com/tag/travel">Travel</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11757" id="menu-item-11757"><a href="http://reeoo.com/?s=tool">Tool</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11755" id="menu-item-11755"><a href="http://reeoo.com/tag/product">Product</a></li>
<li class="menu-item menu-item-type-custom menu-item-object-custom menu-item-11758" id="menu-item-11758"><a href="http://reeoo.com/?s=bike">Bike</a></li>
</ul></div> </div>
</div>
</div>
</header>
<article class="box">
<div id="main">
<ul id="list">
<li class="sponsor">
<script async="" id="_carbonads_js" src="//cdn.carbonads.com/carbon.js?serve=CKYIVKJ7&amp;placement=reeoocom" type="text/javascript"></script>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/loop">
<img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/loop">Loop</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/programatorio">
<img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/programatorio">Programatório</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/ultraviolet-way">
<img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/ultraviolet-way">Ultraviolet Way</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/misatoto-town">
<img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/misatoto-town">みさとと。</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/block-studio">
<img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/block-studio">Block Studio</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/composition-no-24">
<img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/composition-no-24">Composition No. 24</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/discovery-land-company">
<img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/discovery-land-company">Discovery Land Company</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/hardies">
<img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/hardies">Hardies</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/welchs-fruit-snacks">
<img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/welchs-fruit-snacks">Welch’s Fruit Snacks</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/exeron">
<img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/exeron">EXERON</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/pop-weaver">
<img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/pop-weaver">Pop Weaver</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/edesign-interactive">
<img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/edesign-interactive">eDesign Interactive</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/obsolete">
<img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/obsolete">OBSOLETE</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/minibricks">
<img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/minibricks">Minibricks</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/your-sport-agent">
<img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/your-sport-agent">Your Sport Agent</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/modulz">
<img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/modulz">Modulz</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/shift-2">
<img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/shift-2">Shift</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/rand">
<img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/rand">Rand</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/rappipay-2">
<img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/rappipay-2">RappiPay</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/real-happiness-project-from-bbc-earth">
<img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/real-happiness-project-from-bbc-earth">Real Happiness Project from BBC Earth</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/opera">
<img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/opera">OPERA</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/kyoto-shin-nyo-do">
<img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/kyoto-shin-nyo-do">真如堂を楽しむ</a></div>
</li>
<li>
<div class="thumb">
<a href="https://reeoo.com/bitbiome">
<img alt="bitBiome" class="lazy" data-original="https://reeoo.xnny.net/bitBiome.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="bitBiome" width="300"/>
</a>
</div>
<div class="title"><a href="https://reeoo.com/bitbiome">bitBiome</a></div>
</li>
</ul>
<!-- pb265 --><div class="pagebar"><span> </span><span class="this-page">1</span>
<a href="https://reeoo.com/page/2" title="Page 2">2</a>
<a href="https://reeoo.com/page/3" title="Page 3">3</a>
<a href="https://reeoo.com/page/4" title="Page 4">4</a>
<a href="https://reeoo.com/page/5" title="Page 5">5</a>
<a href="https://reeoo.com/page/6" title="Page 6">6</a>
<a href="https://reeoo.com/page/7" title="Page 7">7</a>
<a href="https://reeoo.com/page/8" title="Page 8">8</a>
<a href="https://reeoo.com/page/9" title="Page 9">9</a>
<span class="break">...</span>
<a href="https://reeoo.com/page/172" title="Page 172">172</a>
<a href="https://reeoo.com/page/173" title="Page 173">173</a>
<a href="https://reeoo.com/page/174" title="Page 174">174</a>
<a href="https://reeoo.com/page/175" title="Page 175">175</a>
<a href="https://reeoo.com/page/176" title="Page 176">176</a>
<a href="https://reeoo.com/page/177" title="Page 177">177</a>
<a href="https://reeoo.com/page/2" title="Page 2">&gt;</a>
</div></div>
</article>
<footer id="footer">
<div class="box">
<p>
<span class="link">
<a href="http://designlol.net" target="_blank" title="全球设计精华分享站">Design lol</a>
<a href="http://logojoy.com" target="_blank">Logojoy</a>
<a href="http://www.pplock.com/" target="_blank" title="分享艺术·设计·创意">PPLock</a>
<a href="http://reader.mx/?utm_source=reeoo&amp;utm_medium=web&amp;utm_campaign=link" target="_blank" title="Reader APP">ReaderMX</a>
<a href="http://www.ui.cn" target="_blank">UICN</a>
<a href="http://www.uisdc.com/" target="_blank" title="优秀网页设计联盟">UISDC</a>
<a href="http://zmingcx.com/" target="_blank" title="知更鸟">Zmingcx</a>
</span>
<span class="link">
<a href="https://logomaster.ai/" rel="noopener" target="_blank">Online Logo Maker</a>
<a href="http://www.treasurebox.co.nz/outdoor-garden/greenhouse.html" rel="noopener" target="_blank">greenhouse nz</a>
<a href="https://www.payformathhomework.com" target="_blank">Pay For Math Homework</a>- math help
				</span>
<a href="https://www.zessay.com/" target="_blank">Essay services</a> for college students.   
				<a href="https://myhomeworkdone.com/" target="_blank">My Homework Done</a> really makes your homework done.   
				<a href="http://mydissertations.com/" target="_blank">MyDissertations</a> - dissertation help on design topics.   
						<br/>
			Powered by <a href="http://wordpress.org/" target="_blank">WordPress</a>. © <a href="https://reeoo.com" rel="home" title="Reeoo">Reeoo.com</a>.</p>
</div>
</footer>
<script type="text/javascript">
/* <![CDATA[ */
var image_lazy_load = {"image_unveil_load":"0"};
/* ]]> */
</script>
<script src="http://reeoo.com/wp-content/plugins/image-lazy-load/js/min/frontend-min.js?ver=1.0.9" type="text/javascript"></script>
<script src="http://reeoo.com/wp-includes/js/wp-embed.min.js?ver=4.9.8" type="text/javascript"></script>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-11594399-2', 'auto');
  ga('send', 'pageview');

</script>
</body>
</html>

request 请求没有做异常处理,这里暂时先忽略,一般通过urllib库判断request的请求是否成功。BeautifulSoup 构造方法的第二个参数(lxml或html.parser)为文档解析器,若不传入该参数,BeautifulSoup会自行选择最合适的解析器来解析文档,不过会有警告提示,具体可以参考bs4的帮助文档(https://www.crummy.com/software/BeautifulSoup/bs4/doc/)。

也可以通过文件句柄来初始化,可先将HTML的源码保存到本地同级目录 reo.html,然后将文件名作为参数:

soup = BeautifulSoup(open('reo.html'))

这样就可以先把网页都采集下来,再进行分析,避免了测试过程中,多次访问网站,导致被屏蔽等问题。可以(print)打印 soup,输出内容和HTML文本无二致,此时它为一个复杂的树形结构,每个节点都是Python对象。

4、获取指定标签

接下来示例代码中所用到的 soup 都为该soup。

4.1、Tag

Tag对象与HTML原生文档中的标签相同,可以直接通过对应名字获取

tag = soup.title
print(tag)

打印结果:

<title>Reeoo - web design inspiration and website gallerytitle>

4.2、Name

通过Tag对象的name属性,可以获取到标签的名称

print tag.name

# title

4.3、Attributes

一个tag可能包含很多属性,如id、class等,操作tag属性的方式与字典相同。

例如网页中包含缩略图区域的标签 article

...

<article class="box">

   <div id="main">

   <ul id="list">

       <li id="sponsor"><div class="sponsor_tips">div>

           <script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve=CVYD42T&placement=reeoocom" id="_carbonads_js">script>

       li>

...

获取它 class 属性的值

tag = soup.article

c = tag['class']

# [u'box']

也可以直接通过 .attrs 获取所有的属性

tag = soup.article

attrs = tag.attrs

print(attrs)

# {u'class': [u'box']}

ps. 因为class属于多值属性,所以它的值为数组。

-1-tag中的字符串

通过 string 方法获取标签中包含的字符串

tag = soup.title

s = tag.string

print(s)

# Reeoo - web design inspiration and website gallery

-2-文档树的遍历

一个Tag可能包含多个字符串或其它的Tag,这些都是这个Tag的子节点。Beautiful Soup提供了许多操作和遍历子节点的属性。

子节点

通过Tag的 name 可以获取到对应标签,多次调用这个方法,可以获取到子节点中对应的标签。

比如我们希望获取到 article 标签中的 li

tag = soup.article.div.ul.li

print(tag)

打印结果:

<li id="sponsor"><div class="sponsor_tips">div>

<script async="" id="_carbonads_js" src="//cdn.carbonads.com/carbon.js?zoneid=1696&serve=CVYD42T&placement=reeoocom" type="text/javascript">script>

li>

也可以把中间的一些节点省略,结果也一致

tag = soup.article.li

通过 . 属性只能获取到第一个tag,若想获取到所有的 li 标签,可以通过 find_all() 方法

ls = soup.article.div.ul.find_all('li')

获取到的是包含所有li标签的列表。

tag的 .contents 属性可以将tag的子节点以列表的方式输出:

tag = soup.article.div.ul
contents = tag.contents
print(contents)
for i in contents:
    print(i)

打印 contents 可以看到列表中不仅包含了 li 标签内容,还包括了换行符 '\n ',也可以循环输出一下,看看内部的区别。

通过tag的 .children 生成器,可以对tag的子节点进行循环

tag = soup.article.div.ul

children = tag.children

print(children)

for child in children:

   print(child)

可以看到 children 的类型为 object对象。对比以上两种for方法的结果,会发现他们结果差不多,但是可以看看开头处,会发现children方法的结果更为规范。

.contents 和 .children 属性仅包含tag的直接子节点,若要遍历子节点的子节点,可以通过 .descendants 属性,方法与前两者类似,这里不列出来了。

-3-父节点

通过 .parent 属性来获取某个元素的父节点,article 的 父节点为 body。

tag = soup.article

print tag.parent.name

# body

或者通过 .parents 属性遍历所有的父辈节点。

tag = soup.article

for p in tag.parents:

   print(p.name)

-4-兄弟节点

.next_sibling 和 .previous_sibling 属性用来插叙兄弟节点,使用方式与其他的节点类似。

-5-文档树的搜索

对树形结构的文档进行特定的搜索是爬虫抓取过程中最常用的操作。

find_all()

find_all(name , attrs , recursive , string , ** kwargs)

4.4、name 参数

查找所有名字为 name 的tag

soup.find_all('title')

# [<title>Reeoo - web design inspiration and website gallerytitle>]

soup.find_all('footer')

# [<footer id="footer"> <div class="box"> <p> ... div> footer>]

4.5、keyword 参数

如果指定参数的名字不是内置的参数名(name , attrs , recursive , string),则将该参数当成tag的属性进行搜索,不指定tag的话则默认为对所有tag进行搜索。

如,搜索所有 id 值为 footer 的标签

soup.find_all(id='footer')

# [<footer id="footer"> <div class="box"> <p> ... div> footer>]

加上标签的参数

soup.find_all('footer', id='footer')

[<footer id="footer">
 <div class="box">
 <p>
 <span class="link">
 <a href="http://designlol.net" target="_blank" title="全球设计精华分享站">Design lol</a>
 <a href="http://logojoy.com" target="_blank">Logojoy</a>
 <a href="http://www.pplock.com/" target="_blank" title="分享艺术·设计·创意">PPLock</a>
 <a href="http://reader.mx/?utm_source=reeoo&amp;utm_medium=web&amp;utm_campaign=link" target="_blank" title="Reader APP">ReaderMX</a>
 <a href="http://www.ui.cn" target="_blank">UICN</a>
 <a href="http://www.uisdc.com/" target="_blank" title="优秀网页设计联盟">UISDC</a>
 <a href="http://zmingcx.com/" target="_blank" title="知更鸟">Zmingcx</a>
 </span>
 <span class="link">
 <a href="https://logomaster.ai/" rel="noopener" target="_blank">Online Logo Maker</a>
 <a href="http://www.treasurebox.co.nz/outdoor-garden/greenhouse.html" rel="noopener" target="_blank">greenhouse nz</a>
 <a href="https://www.payformathhomework.com" target="_blank">Pay For Math Homework</a>- math help
 				</span>
 <a href="https://www.zessay.com/" target="_blank">Essay services</a> for college students.   
 				<a href="https://myhomeworkdone.com/" target="_blank">My Homework Done</a> really makes your homework done.   
 				<a href="http://mydissertations.com/" target="_blank">MyDissertations</a> - dissertation help on design topics.   
 						<br/>
 			Powered by <a href="http://wordpress.org/" target="_blank">WordPress</a>. © <a href="https://reeoo.com" rel="home" title="Reeoo">Reeoo.com</a>.</p>
 </div>
 </footer>]

获取所有缩略图的 div 标签,缩略图用 class 为 thumb 标记

soup.find_all('div', class_='thumb')

这里需要注意一点,因为 class 为Python的保留关键字,所以作为参数时加上了下划线,为“class_”。

指定名字的属性参数值可以包括:字符串、正则表达式、列表、True/False。

True/False

是否存在指定的属性。

搜索所有带有 target 属性的标签

soup.find_all(target=True)

搜索所有不带 target 属性的标签(仔细观察会发现,搜索结果还是会有带 target 的标签,那是不带 target 标签的子标签,这里需要注意一下。)

soup.find_all(target=False)

可以指定多个参数作为过滤条件,例如页面缩略图部分的标签如下所示:

<li>

   <div class="thumb">

       <a href="http://reeoo.com/aim-creative-studios">![AIM Creative Studios](http://upload-images.jianshu.io/upload_images/1346917-f6281ffe1a8f0b18.gif?imageMogr2/auto-orient/strip)a>

   div>

   <div class="title">

       <a href="http://reeoo.com/aim-creative-studios">AIM Creative Studiosa>

   div>

li>

搜索 src 属性中包含 reeoo 字符串,并且 class 为 lazy 的标签:

注:这里re是正则表达式,需要导入re包

soup.find_all(src=re.compile("reeoo.com"), class_='lazy')

搜索结果即为所有的缩略图 img 标签。

有些属性不能作为参数使用,如 data-**** 属性。在上面的例子中,data-original 不能作为参数使用,运行起来会报错,SyntaxError: keyword can't be an expression*。

4.6、attrs 参数

定义一个字典参数来搜索对应属性的tag,一定程度上能解决上面提到的不能将某些属性作为参数的问题。

例如,搜索包含 data-original 属性的标签

print soup.find_all(attrs={'data-original': True})

[<img alt="Travelshift" class="lazy" data-original="https://reeoo.xnny.net/Travelshift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Travelshift" width="300"/>,
 <img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>,
 <img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>,
 <img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>,
 <img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>,
 <img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>,
 <img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>,
 <img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>,
 <img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>,
 <img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>,
 <img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>,
 <img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>,
 <img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>,
 <img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>,
 <img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>,
 <img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>,
 <img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>,
 <img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>,
 <img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>,
 <img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>,
 <img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>,
 <img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>,
 <img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>]

搜索 data-original 属性中包含 reeoo.com 字符串的标签

soup.find_all(attrs={'data-original':re.compile('reeoo')})

[<img alt="Travelshift" class="lazy" data-original="https://reeoo.xnny.net/Travelshift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Travelshift" width="300"/>,
 <img alt="Loop" class="lazy" data-original="https://reeoo.xnny.net/Loop.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Loop" width="300"/>,
 <img alt="Programatório" class="lazy" data-original="https://reeoo.xnny.net/Programatorio.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="Programatório" width="300"/>,
 <img alt="Ultraviolet Way" class="lazy" data-original="https://reeoo.xnny.net/Ultraviolet Way.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Ultraviolet Way" width="300"/>,
 <img alt="みさとと。" class="lazy" data-original="https://reeoo.xnny.net/Misatoto Town.png!page" height="200" src="https://reeoo.com/assets/white.gif" title="みさとと。" width="300"/>,
 <img alt="Block Studio" class="lazy" data-original="https://reeoo.xnny.net/Block Studio.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Block Studio" width="300"/>,
 <img alt="Composition No. 24" class="lazy" data-original="https://reeoo.xnny.net/Composition No. 24.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Composition No. 24" width="300"/>,
 <img alt="Discovery Land Company" class="lazy" data-original="https://reeoo.xnny.net/Discovery Land Company.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Discovery Land Company" width="300"/>,
 <img alt="Hardies" class="lazy" data-original="https://reeoo.xnny.net/Hardies.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Hardies" width="300"/>,
 <img alt="Welch’s Fruit Snacks" class="lazy" data-original="https://reeoo.xnny.net/Welch's Fruit Snacks.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Welch’s Fruit Snacks" width="300"/>,
 <img alt="EXERON" class="lazy" data-original="https://reeoo.xnny.net/EXERON.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="EXERON" width="300"/>,
 <img alt="Pop Weaver" class="lazy" data-original="https://reeoo.xnny.net/Pop Weaver.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Pop Weaver" width="300"/>,
 <img alt="eDesign Interactive" class="lazy" data-original="https://reeoo.xnny.net/eDesign Interactive.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="eDesign Interactive" width="300"/>,
 <img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>,
 <img alt="Minibricks" class="lazy" data-original="https://reeoo.xnny.net/Minibricks.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Minibricks" width="300"/>,
 <img alt="Your Sport Agent" class="lazy" data-original="https://reeoo.xnny.net/Your Sport Agent.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Your Sport Agent" width="300"/>,
 <img alt="Modulz" class="lazy" data-original="https://reeoo.xnny.net/Modulz.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Modulz" width="300"/>,
 <img alt="Shift" class="lazy" data-original="https://reeoo.xnny.net/Shift.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="Shift" width="300"/>,
 <img alt="Rand" class="lazy" data-original="https://reeoo.xnny.net/Rand.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Rand" width="300"/>,
 <img alt="RappiPay" class="lazy" data-original="https://reeoo.xnny.net/RappiPay 2.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="RappiPay" width="300"/>,
 <img alt="Real Happiness Project from BBC Earth" class="lazy" data-original="https://reeoo.xnny.net/Real Happiness Project from BBC Earth.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="Real Happiness Project from BBC Earth" width="300"/>,
 <img alt="OPERA" class="lazy" data-original="https://reeoo.xnny.net/OPERA.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="OPERA" width="300"/>,
 <img alt="真如堂を楽しむ" class="lazy" data-original="https://reeoo.xnny.net/Kyoto Shin nyo-do.jpg!page" height="200" src="https://reeoo.com/assets/white.gif" title="真如堂を楽しむ" width="300"/>]

搜索 data-original 属性为指定值的标签

soup.find_all(attrs={'data-original': 'https://reeoo.xnny.net/OBSOLETE.png!page'})

[<img alt="OBSOLETE" class="lazy" data-original="https://reeoo.xnny.net/OBSOLETE.png!page" height="200" src="https://reeoo.com/assets/white.gif" tile="OBSOLETE" width="300"/>]

4.7、string 参数

和 name 参数类似,针对文档中的字符串内容。

搜索包含 Reeoo 字符串的标签

soup.find_all(string=re.compile("Reeoo"))

4.8、limit 参数

find_all() 返回的是整个文档的搜索结果,如果文档内容较多则搜索过程耗时过长,加上 limit 限制,当结果到达 limit 值时停止搜索并返回结果。

搜索 class 为 thumb 的 div 标签,只搜索3个

soup.find_all('div', class_='thumb', limit=3)

打印结果为一个包含3个元素的列表,实际满足结果的标签在文档里不止3个。

4.9、recursive 参数

find_all() 会检索当前tag的所有子孙节点,如果只想搜索tag的直接子节点,可以使用参数 recursive=False。

4.10、find()

find(name , attrs , recursive , string , ** kwargs)

find() 方法和 find_all() 方法的参数使用基本一致,只是 find() 的搜索方法只会返回第一个满足要求的结果,等价于 find_all() 方法并将limit设置为1。

soup.find_all('div', class_='thumb', limit=1)

soup.find('div', class_='thumb')

搜索结果一致,唯一的区别是 find_all() 返回的是一个数组,find() 返回的是一个元素。

当没有搜索到满足条件的标签时,find() 返回 None, 而 find_all() 返回一个空的列表。

4.11、CSS选择器

Tag 或 BeautifulSoup 对象通过 select() 方法中传入字符串参数, 即可使用CSS选择器的语法找到tag。

语义和CSS一致,搜索 article 标签下的 ul 标签中的 li 标签

print(soup.select('article ul li'))

通过类名查找,两行代码的结果一致,搜索 class 为 thumb 的标签

soup.select('.thumb')

soup.select('[class~=thumb]')

通过id查找,搜索 id 为 submenu的标签

soup.select('#submenu')

通过是否存在某个属性来查找,搜索具有 id 属性的 li 标签

soup.select('li[id]')

通过属性的值来查找查找,搜索class为 sponsor 的 li 标签

soup.select('li[class="sponsor"]')

其他

其他的搜索方法还有:

find_parents() 和 find_parent()

find_next_siblings() 和 find_next_sibling()

find_previous_siblings() 和 find_previous_sibling()

参数的作用和 find_all()、find() 差别不大,这里就不再列举使用方式了。这两个方法基本已经能满足绝大部分的查询需求。

还有一些方法涉及文档树的修改。对于爬虫来说大部分工作只是检索页面的信息,很少需要对页面源码做改动,所以这部分的内容也不再列举。

具体详细信息可直接参考Beautiful Soup库的官方说明文档。

猜你喜欢

转载自blog.csdn.net/u010472858/article/details/103483496
今日推荐