爬取虎扑步行街——秋名山论美美女壁纸图片

最近学了学爬虫,由于平时笔者经常刷虎扑,于是决定实战一下,对虎扑上的美女图片进行爬取,特地来分享一下经验。

这次主要使用了request和BeautifulSoup两个库,使用urllib库对图片进行下载。

首先进入虎扑步行街,搜索关键字“秋名山论美”,得到如下网页:


首先f12打开网页检查,点击network然后刷新网页复制headers。根据观察发现网页的url中&page=1中的1为对应的页码,所以可以构建url列表页进行循环

for i in range(1,17):
    page_num = '&page='
    i = str(i)
    url = 'https://my.hupu.com/search?q=%E3%80%90%E7%A7%8B%E5%90%8D%E5%B1%B1%E8%AE%BA%E7%BE%8E%E3%80%91'+page_num+i

之后右键每一期的标题,查看在网页源代码的位置。


使用Beautiful库中的selector选择器进行对网页url的提取

response = requests.get(url=url,headers=headers)
soup = BeautifulSoup(response.content,'lxml')
img_ = soup.select('.mytopic.topiclisttr tbody tr .p_title a')
for _url in img_:
    img_url = _url['href']
    url_list.append(img_url)

随后点击进入一期秋名山论美的网页,找到壁纸图片右键检查,找到图片的位置,同样使用selector选择器进行选择。

response = requests.get(url=url,headers=headers)
soup = BeautifulSoup(response.content,'lxml')
img_ = soup.select(".floor .floor-show .floor_box tbody tr td .quote-content img")
创建文件夹,观察发现图片的url中?前面的是图片的url,所以使用split进行分割后提取之后下载和保存图片
if not os.path.exists(path):
    os.mkdir(path)
os.chdir(path)
 
 
try:
    if '?' in img['data-original']:
        img_url = img['data-original'].split('?')[0]
    else:
        continue
except KeyError:
    print(i)
    continue
try:
    content = urllib.request.urlopen(img_url)
except urllib.error.HTTPError:
    continue
content = content.read()
with open(name, 'wb') as f:
    f.write(content)
    time.sleep(0.2)
 
 

最后附上源码进行学习

import requests
import urllib.request
import urllib.error
import os
import re
import time
from bs4 import BeautifulSoup

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 UBrowser/6.2.3964.2 Safari/537.36'
               }
url_list = []
for i in range(1,17):     #17页数字为汉字,共有15期,需要用number_i来命名
    page_num = '&page='
    i = str(i)
    url = 'https://my.hupu.com/search?q=%E3%80%90%E7%A7%8B%E5%90%8D%E5%B1%B1%E8%AE%BA%E7%BE%8E%E3%80%91'+page_num+i

    response = requests.get(url=url,headers=headers)
    soup = BeautifulSoup(response.content,'lxml')
    img_ = soup.select('.mytopic.topiclisttr tbody tr .p_title a')
    for _url in img_:
        img_url = _url['href']
        url_list.append(img_url)

# number_i=15
for url in url_list:
    response = requests.get(url=url,headers=headers)
    soup = BeautifulSoup(response.content,'lxml')
    img_ = soup.select(".floor .floor-show .floor_box tbody tr td .quote-content img")

    # *********************************************************************
    title = soup.select(".subhead span")[0].string
    s = ''
    s = s.join(title)
    print(s)
    try:
        number = re.findall(r'(\d+)',s)[0]
    except IndexError:
        continue
    # number = str(number_i)
    # number_i-=1
    path = 'E:\project\pachong'+'\\'+number
    # ************************************************************************
    if not os.path.exists(path):
        os.mkdir(path)
    os.chdir(path)
    for i, img in enumerate(img_):
        i = str(i)
        print(img)

        try:
            if '?' in img['data-original']:
                img_url = img['data-original'].split('?')[0]
            else:
                continue
        except KeyError:
            print(i)
            continue
        print(i, img_url)
        name = number + '_' + i + '.' + 'jpg'
        try:
            content = urllib.request.urlopen(img_url)
        except urllib.error.HTTPError:
            continue
        content = content.read()
        with open(name, 'wb') as f:
            f.write(content)
            time.sleep(0.2)
    time.sleep(1)
 
  
 

猜你喜欢

转载自blog.csdn.net/qq_34180674/article/details/80775417
今日推荐