聪哥哥教你学Python之如何爬取美女图片

今天要讲的是Python。Python目前主要是在人工智能和数据分析上比较火。这里我们就讲它的数据分析。什么叫数据分析呢？

简单地说，根据已知数据，经过分析，得出结论。这就叫做数据分析。

今天聪哥哥我拿一个简单的爬虫实例，教你爬取美女图片，不过在此之前聪哥哥我得说说一些杂七杂八的。

这个教程，需要一定的Python基础，TCP/IP协议也得懂，具有一定的浏览器调试或者抓包经验。

当然了，最重要的是一个学习的心，一颗积极上进的心。

当然了，欲望也可以。聪哥哥我曾经看过一本叫《人类简史》的书，虽然当时没有很深的看，不过心中却产生了一个大胆的想法和论断，那就是，人类之所以进化并走到了现在，不外乎这两个字，“欲望”。也许已经有几百万年，毕竟有些专家的论断也不一定是可靠的，这个世界有太多的未知数。

不是非常了解和熟悉Python的小伙伴们，我在此推荐一个教程，廖雪峰Python教程:https://www.liaoxuefeng.com/wiki/0014316089557264a6b348958f449949df42a6d3a2e542c000

这个教程既有实例又有理论，双向结合。

个人建议学习这个教程时，应当采取的策略是:阅读+实践。

阅读+实践，针对的人群是有一定的编程基础，比如你学过C/C++，或者是被誉为世界上最强大的语言PHP。有一定的编程基础对于学习是非常有帮助的。当然了，还有一个更重要的就是兴趣。曾经记得某位大师说过:兴趣是最好的老师。我觉得一个人如何想要在技术这条路长远的走下去，兴趣是一个很重要的因素。

不过这个兴趣你可以分多种角度来看。

比如你真正对这门编程语言发自内心的爱，比如你收够了PHP的变态语法，觉得Python是如此的平易近人。

或者是你受够了C的很多难以驾驭的特性，比方说面向结构编程不如面向对象来的实际痛苦。面向结构，一听这个词，就不爽，结构有什么意思，还不如对象来的实际。一听对象这词就一个字爽。

再比如你对某某感兴趣，不打比方的，就直接说，你对美女图片非常感兴趣，每天不看就睡不着。记得我曾经的一个同学就是这样。每天费尽心机的到处搜索，还不如写个爬虫，大量的爬取图片，自己上传到百度云或者是其他的存储云上，想什么时候看就什么时候看，多爽啊。你看，这种人如果将自己的某某兴趣转移到学习，不说这个人一定会有一番大业，至少这个人，年薪百万不是梦。

下面进入正题（代码贴器，啪啪啪，稍微幽默下，记得某位名叫YOU什么去的大师曾说过:一个人如果没有幽默感，那将是一件非常可怕的事情）:

test001.py(单进程)

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import os
import sys
'''
#安卓端需要此语句
reload(sys)
sys.setdefaultencoding('utf-8')
'''

if(os.name == 'nt'):
        print(u'你正在使用win平台')
else:
        print(u'你正在使用linux平台')

header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
#http请求头
all_url = 'http://www.mzitu.com'
start_html = requests.get(all_url,headers = header)

#保存地址
path = 'D:/test/'

#找寻最大页数
soup = BeautifulSoup(start_html.text,"html.parser")
page = soup.find_all('a',class_='page-numbers')
max_page = page[-2].text


same_url = 'http://www.mzitu.com/page/'
for n in range(1,int(max_page)+1):
    ul = same_url+str(n)
    start_html = requests.get(ul, headers=header)
    soup = BeautifulSoup(start_html.text,"html.parser")
    all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')
    for a in all_a:
        title = a.get_text() #提取文本
        if(title != ''):
            print("准备扒取："+title)

            #win不能创建带？的目录
            if(os.path.exists(path+title.strip().replace('?',''))):
                    #print('目录已存在')
                    flag=1
            else:
                os.makedirs(path+title.strip().replace('?',''))
                flag=0
            os.chdir(path + title.strip().replace('?',''))
            href = a['href']
            html = requests.get(href,headers = header)
            mess = BeautifulSoup(html.text,"html.parser")
            pic_max = mess.find_all('span')
            pic_max = pic_max[10].text #最大页数
            if(flag == 1 and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):
                print('已经保存完毕，跳过')
                continue
            for num in range(1,int(pic_max)+1):
                pic = href+'/'+str(num)
                html = requests.get(pic,headers = header)
                mess = BeautifulSoup(html.text,"html.parser")
                pic_url = mess.find('img',alt = title)
                html = requests.get(pic_url['src'],headers = header)
                file_name = pic_url['src'].split(r'/')[-1]
                f = open(file_name,'wb')
                f.write(html.content)
                f.close()
            print('完成')
    print('第',n,'页完成')

这个Python脚本如果执行报错，说是没有安装requests模块。

那么，你可以通过pip install requests 完成安装对应的依赖库即可，这个依赖库与Node.js中Npm的共同点，都可以相当于依赖库的管理。或者换句话说，pip 与ubuntu的 sudo apt-get install 安装软件的策略倒是十分相似。它们到底有什么区别，这里的重点不在于此。这里另外想要告诉你的一个IT哲理就是:技术无论千变万化，把握其本质，就可以以不变应万变。

当然了，这个不变应万变并不代表就不学习了。学习是人一生中的必做之事。比如男孩蜕变为一个男人，这也是一种学习。

学习无处不在，大家自行领悟。

test002.py(多进程)

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import os
from multiprocessing import Pool
import sys


def find_MaxPage():
    all_url = 'http://www.mzitu.com'
    start_html = requests.get(all_url,headers = header)
    #找寻最大页数
    soup = BeautifulSoup(start_html.text,"html.parser")
    page = soup.find_all('a',class_='page-numbers')
    max_page = page[-2].text
    return max_page

def Download(href,header,title,path):
    html = requests.get(href,headers = header)
    soup = BeautifulSoup(html.text,'html.parser')
    pic_max = soup.find_all('span')
    pic_max = pic_max[10].text  # 最大页数
    if(os.path.exists(path+title.strip().replace('?','')) and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):
        print('已完毕，跳过'+title)
        return 1
    print("开始扒取：" + title)
    os.makedirs(path+title.strip().replace('?',''))
    os.chdir(path + title.strip().replace('?',''))
    for num in range(1,int(pic_max)+1):
        pic = href+'/'+str(num)
        #print(pic)
        html = requests.get(pic,headers = header)
        mess = BeautifulSoup(html.text,"html.parser")
        pic_url = mess.find('img',alt = title)
        html = requests.get(pic_url['src'],headers = header)
        file_name = pic_url['src'].split(r'/')[-1]
        f = open(file_name,'wb')
        f.write(html.content)
        f.close()
    print('完成'+title)

def download(href,header,title):

    html = requests.get(href,headers = header)
    soup = BeautifulSoup(html.text,'html.parser')
    pic_max = soup.find_all('span')
    #for j in pic_max:
        #print(j.text)
    #print(len(pic_max))
    pic_max = pic_max[10].text  # 最大页数
    print(pic_max)


'''
#安卓端需要此语句
reload(sys)
sys.setdefaultencoding('utf-8')
'''


if __name__=='__main__':
    if (os.name == 'nt'):
        print(u'你正在使用win平台')
    else:
        print(u'你正在使用linux平台')

    header = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 UBrowser/6.1.2107.204 Safari/537.36'}
    # http请求头
    path = 'D:/test/'
    max_page = find_MaxPage()
    same_url = 'http://www.mzitu.com/page/'

    #线程池中线程数
    pool = Pool(5)
    for n in range(1,int(max_page)+1):
        each_url = same_url+str(n)
        start_html = requests.get(each_url, headers=header)
        soup = BeautifulSoup(start_html.text, "html.parser")
        all_a = soup.find('div', class_='postlist').find_all('a', target='_blank')
        for a in all_a:
            title = a.get_text()  # 提取文本
            if (title != ''):
                href = a['href']
                pool.apply_async(Download,args=(href,header,title,path))
    pool.close()
    pool.join()
    print('所有图片已下完')

第一个脚本执行完毕，你会很疑惑，为什么爬取的图片都显示不能打开呢？明明资源就在哪，却什么都看不到，心里顿时不爽。

然后，发现还有第二个脚本，于是执行了，还是发现，两个脚本之间除了单线程执行和多线程执行的区别外，就没什么区别了。

其实原因，很简单，你还要一个忽略的，那就是防盗链，这个防盗链，你可以理解为反爬虫，爬虫早几年前是非常火爆的，那个时候，不少人因为爬虫而实现了财富自由。但是，随着会爬虫的人越来越多，人家网站也不是二百五，总是被你牵着鼻子走，防盗策略还是要的。

下面最后一个脚本，你会发现，当你执行完毕后，你就可以尽情的嘿嘿嘿嘿了

test003.py

#coding=utf-8
import requests
from bs4 import BeautifulSoup
import os

all_url = 'http://www.mzitu.com'


#http请求头
Hostreferer = {
    'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer':'http://www.mzitu.com'
               }
Picreferer = {
    'User-Agent':'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)',
    'Referer':'http://i.meizitu.net'
}
#此请求头破解盗链

start_html = requests.get(all_url,headers = Hostreferer)

#保存地址
path = 'D:/test/'

#找寻最大页数
soup = BeautifulSoup(start_html.text,"html.parser")
page = soup.find_all('a',class_='page-numbers')
max_page = page[-2].text


same_url = 'http://www.mzitu.com/page/'
for n in range(1,int(max_page)+1):
    ul = same_url+str(n)
    start_html = requests.get(ul, headers = Hostreferer)
    soup = BeautifulSoup(start_html.text,"html.parser")
    all_a = soup.find('div',class_='postlist').find_all('a',target='_blank')
    for a in all_a:
        title = a.get_text() #提取文本
        if(title != ''):
            print("准备扒取："+title)

            #win不能创建带？的目录
            if(os.path.exists(path+title.strip().replace('?',''))):
                    #print('目录已存在')
                    flag=1
            else:
                os.makedirs(path+title.strip().replace('?',''))
                flag=0
            os.chdir(path + title.strip().replace('?',''))
            href = a['href']
            html = requests.get(href,headers = Hostreferer)
            mess = BeautifulSoup(html.text,"html.parser")
            pic_max = mess.find_all('span')
            pic_max = pic_max[10].text #最大页数
            if(flag == 1 and len(os.listdir(path+title.strip().replace('?',''))) >= int(pic_max)):
                print('已经保存完毕，跳过')
                continue
            for num in range(1,int(pic_max)+1):
                pic = href+'/'+str(num)
                html = requests.get(pic,headers = Hostreferer)
                mess = BeautifulSoup(html.text,"html.parser")
                pic_url = mess.find('img',alt = title)
                print(pic_url['src'])
                #exit(0)
                html = requests.get(pic_url['src'],headers = Picreferer)
                file_name = pic_url['src'].split(r'/')[-1]
                f = open(file_name,'wb')
                f.write(html.content)
                f.close()
            print('完成')
    print('第',n,'页完成')

最终的结果如图所示:

小结:最后强调一句，结果不是最重要的，最重要的是这一个过程你学到了什么。

一句话，学习得带有一个明确的目的，这样你才会学的更快。另外上面的图只不过就是一个案例，我希望这个案例能促进广大的IT朋友们的学习热情，让大家的IT之路越走越顺。如果能达到这个目的，聪哥哥我也就觉得值了。

聪哥哥教你学Python之如何爬取美女图片

猜你喜欢