因个人工作需要,想从网上爬一些美女图片当配图,于是搜到了美桌网,在meinvtag***标签下有一些高清美女图片,正符合我的需要,因此写了个简单爬虫下载。
首先观察网站特点。以meinvtag2标签为例,总共有5个页面,类似于http://www.win4000.com/meinvtag2_1.html,最后的数字代表页数,是爬虫最喜欢的url构成,直接可以开5个线程单独处理。在页面html源码中可以提取到每个相册集的url,打开相册集就可以一张张查看所有图片。图片页面的url构成同样简单,类似于http://www.win4000.com/meinv198397_2.html,但是无法获知某个相册集下有多少张图片,我采取的方法是,从1到50循环获取,当获取不到图片时跳出循环。代码如下
import requests
from bs4 import BeautifulSoup
import threading
def download_img_from_url(path,url):
with open(path, 'wb') as f:
f.write(requests.get(url).content)
def get_BS(url):
html = requests.get(url)
try:
html.raise_for_status()
return BeautifulSoup(html.text, "lxml")
except:
return None
def download(i):
page_url = page_url_format.format(i)
bs = get_BS(page_url)
lists = bs.find_all("div",{'class':'tab_box'})
tags = lists[1].find_all('a')
for tag in tags:
album_url = tag.get('href')
album_url = album_url[0:-5] + '_'
for id in range(1,50):
img_page_url = album_url + str(id) + ".html"
#print(img_page_url)
bs2 = get_BS(img_page_url)
if bs2:
img_url = bs2.find("img", class_='pic-large').get('data-original')
name = img_page_url[28:-5]
download_img_from_url(save_path.format(name),img_url)
else:
break
page_url_format = 'http://www.win4000.com/meinvtag2_{}.html'
save_path = 'D:\\image\\{}.jpg'
threads = []
for i in range(1,6): #直接每一页开一个线程
thread = threading.Thread(target=lambda:download(i))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()