下载极客漫画——Beautiful Soup实用案例

文章目录

一、背景介绍

二、实现思路

三、效果图

扫描二维码关注公众号，回复： 17558125 查看本文章

八、附录

九、总结

一、背景介绍

XKCD网站是一个关于浪漫、隐喻、数字、以及语言的线上漫画。下面我们将使用Pyton中两个强大而常用的库Requests，Beautiful Soup 约30行代码一键下载网站所有漫画到本地。

官网地址：https://xkcd.com/

二、实现思路

官网首页有⼀个Prev按钮，让用户导航到前面的漫画。导航到前一个漫画也有⼀个Prev按钮，以此类推。这创建了⼀条线索，可以从最近的页面浏览到该网站的第⼀个页面。

程序需要完成以下任务。
1．加载XKCD主页。
2．保存该页的漫画图⽚。
3．转⼊前⼀张漫画的链接。
4．重复直到第⼀张漫画。

对应的代码结构

1．利用requests模块下载页⾯。
2．利用Beautiful Soup找到页⾯中漫画图像的URL。
3．利用iter_content()下载漫画图像，并保存到硬盘。
4．找到前一张漫画的URL链接，然后重复。

三、效果图

四、构思

打开官网首页，点击F12开发者工具，检查该页面上的元素。你会发现

代码结构：

import requests, os, bs4
url = ‘https://xkcd.com/’
os.makedirs('xkcd', exist_ok=True) # 新建一个用来保存下载图片的文件夹
while not url.endswith('#'):
         # TODO: Download the page.
         # TODO: Find the URL of the comic image.
         # TODO: Download the image.
         # TODO: Save the image to ./xkcd.
         # TODO: Get the Prev button's url.
print('Done.')

五、实现细节

1. 第一步下载网页

# Download the page.
print('Downloading page %s...' % url)
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')

说明：

首先，输出url，这样用户就知道程序将要下载哪个URL。

然后利用requests模块的request.get()函数下载它。

接着调用Response对象的raise_for_status()方法，

如果下载发发生问题，就抛出异常，并终止程序；

否则，利用下载页面的文本创建⼀个BeautifulSoup对象。

2. 寻找和下载漫画图像

# Find the URL of the comic image.
comicElem = soup.select('#comic img')
if comicElem == []:
    print('Could not find comic image.') 
else:
    comicUrl = 'https:' + comicElem[0].get('src')
    # Download the image.
    print('Downloading image %s...' % (comicUrl))
    res = requests.get(comicUrl)
    res.raise_for_status()

说明：

用开发者工具检查XKCD主页后，发现漫画图像的<img>元素在<div>元素中，<div>带有的id属性设置为comic。选择器'#comic img'将从BeautifulSoup对象中选出正确的<img>元素。选择器将返回⼀个包含⼀个<img>元素的列表。可以从这个<img>元素中取得src属性，将src传递给requests.get()，以下载这个漫画的图像文件

3. 保存图像，找到前⼀张漫画

# Save the image to ./xkcd.
imageFile = open(os.path.join('xkcd', os.path.basename(comicUrl)),'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
# Get the Prev button's url.
prevLink = soup.select('a[rel="prev"]')[0]
url = 'https://xkcd.com/' + prevLink.get('href')

说明：

这时，漫画的图像文件保存在变量res中。你需要将图像数据写入硬盘的文件。

先为本地的图像文件准备⼀个文件名，并将其传递给open()。

comicUrl的值类似'http://imgs.****/comics/heartbleed_explanation.png'。看起来很像文件路径。实际上，调用os.path.basename()时传⼊comicUrl，它只返回URL的最后部分：

'heartbleed_explanation.png'。当将图像保存到硬盘时，可以用它作为文件名。

用os.path.join()连接这个名称和xkcd文件夹的名称，这样程序就会在Windows操作系统下使用倒斜杠，在macOS和Linux操作系统下使用正斜杠（/）。

得到了文件名，就可以调用open()，并用'wb'（写⼆进制）模式打开⼀个新文件。

循环处理iter_content()方法的返回值。for循环中的代码将⼀段图像数据写入文件（每次最多10万字节），然后关闭该文件。图像现在保存到硬盘。

选择器 'a[rel="prev"]' 识别出rel属性中设置为prev的<a>元素，利用这个<a>元素的href属性可取得前⼀张漫画的URL，然后将它保存在url中。接着，while循环针对这张漫画，再次开始整个下载过程。

六、完整代码

import requests, os, bs4
url = 'https://xkcd.com/'          
os.makedirs('xkcd', exist_ok=True)  

while not url.endswith('#'):
    # Download the page.
    print('Downloading page %s...' % url)
    res = requests.get(url)
    res.raise_for_status()
    soup = bs4.BeautifulSoup(res.text, 'html.parser')

    # Find the URL of the comic image.
    comicElem = soup.select('#comic img')
    if not comicElem:
        print('Could not find comic image.')
    else:
        comicUrl = 'https:' + comicElem[0].get('src')
        # Download the image.
        print('Downloading image %s...' %comicUrl)
        res = requests.get(comicUrl)
        res.raise_for_status()

    # Save the image to ./xkcd.
    imageFile = open(os.path.join('xkcd',os.path.basename(comicUrl)), 'wb')
    for chunk in res.iter_content(100000):
        imageFile.write(chunk)
    imageFile.close()
    # Get the Prev button's url.
    prevLink = soup.select('a[rel="prev"]')[0]
    url = 'https://xkcd.com/' + prevLink.get('href')

print('Done.')

七、程序输出

八、附录

CSS选择器例子

传递给select()⽅法的选择器	将匹配……
soup.select('div')	所有名为<div>的元素
soup.select('#author')	带有id属性为author的元素
soup.select('.notice')	所有使⽤CSS class属性名为notice的元素
soup.select('div span')	所有在<div>元素之内的<span>元素
soup.select('div > span')	所有直接在<div>元素之内的<span>元素，中间没有其他元素
soup.select('input[name]')	所有名为<input>，并有⼀个name属性，其值⽆所谓的元素
soup.select('input[type="button"]')	所有名为<input>，并有⼀个type属性，其值为button的元素

九、总结

这个项目是⼀个很好的例子，说明程序可以自动顺着链接从网络上抓取大量的数据。对Beautiful Soup的应用也有了简单认识。更多功能可以从Beautiful Soup的文档了解。

喜欢这篇文章的话记得点赞关注加收藏哦！

一、背景介绍

二、实现思路

三、效果图

四、构思

五、实现细节

1. 第一步下载网页

2. 寻找和下载漫画图像

3. 保存图像，找到前⼀张漫画

六、完整代码

七、程序输出

八、附录

九、总结

猜你喜欢

目录

热门文章