python爬虫:爬取某网站高清图片

1. 爬取千图网高清图片
import urllib.request
import re
import urllib.error

for i in range(1,10):
   # 千图网第一页
    pageurl='https://www.58pic.com/piccate/3-156-909-se1-p'+str(i)+'.html'
    data=urllib.request.urlopen(pageurl).read().decode("utf-8","ignore")
    
    #正则提取
    pat='(//www.58pic.com/newpic/.*?.html)'
    imglist=re.compile(pat).findall(data)
print(imglist)
['//www.58pic.com/newpic/34666756.html', '//www.58pic.com/newpic/34664475.html', '//www.58pic.com/newpic/34664471.html', '//www.58pic.com/newpic/34664397.html', '//www.58pic.com/newpic/34664383.html', '//www.58pic.com/newpic/34663375.html', '//www.58pic.com/newpic/34663183.html', '//www.58pic.com/newpic/34662278.html', '//www.58pic.com/newpic/34480033.html', '//www.58pic.com/newpic/34479938.html', '//www.58pic.com/newpic/34479937.html', '//www.58pic.com/newpic/34479855.html', '//www.58pic.com/newpic/34479854.html', '//www.58pic.com/newpic/34479549.html', '//www.58pic.com/newpic/34479548.html', '//www.58pic.com/newpic/34479381.html', '//www.58pic.com/newpic/34479010.html', '//www.58pic.com/newpic/34478964.html', '//www.58pic.com/newpic/34478963.html', '//www.58pic.com/newpic/34432574.html', '//www.58pic.com/newpic/34432554.html', '//www.58pic.com/newpic/34432517.html', '//www.58pic.com/newpic/34426270.html', '//www.58pic.com/newpic/34426034.html', '//www.58pic.com/newpic/34425959.html', '//www.58pic.com/newpic/34425710.html', '//www.58pic.com/newpic/34425658.html', '//www.58pic.com/newpic/34425570.html', '//www.58pic.com/newpic/34425469.html', '//www.58pic.com/newpic/34425122.html', '//www.58pic.com/newpic/34424954.html', '//www.58pic.com/newpic/34424934.html', '//www.58pic.com/newpic/34424029.html', '//www.58pic.com/newpic/34424028.html', '//www.58pic.com/newpic/34423912.html']
'''
    for j in range(0,len(imglist)):
        try:
            thisimg=imglist[j]+"/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a0"
            #被网站强行裁剪的一小部分
            #thisimg=imglist[j]+"/fw/1024/watermark/url/L2ltYWdlcy93YXRlcm1hcmsvZGF0dS5wbmc=/repeat/true/crop/0x1024a0a1024"
            file="F:/jupyterpycodes/python_pachongfenxi/result/"+str(i)+str(j)+".jpg"
            urllib.request.urlretrieve(thisimg,filename=file)
            print("第"+str(i)+"页第"+str(j)+"个图片爬取成功")
        except urllib.error.URLError as e:
            if hasattr(e,"code"):
                print(e.code)
            if hasattr(e,"reason"):
                print(e.reason)
        except Exception as e:
            print(e)
'''



```python

2. 抓包分析:

即将网络传输发送与接收的数据包进行抓取的操作,做爬虫时,数据并不一-定就在HTML源码中,很可能隐藏在一些网址中,所以,我们要抓取某些数据,就需要进行抓包,分析出对应数据所隐藏在的网址,然后分析规律并爬取。

3. 使用Fiddler进行抓包分析

(爬取源代码中没有的数据)Fiddler默认只能抓取HTTP的数据,抓不到HTTPS的数据。如需要抓HTTPS的数据,需要进行相应设置。
参考网址 https://ask.hellobi.com/blog/weiwei/5159

发布了47 篇原创文章 · 获赞 35 · 访问量 1815

猜你喜欢

转载自blog.csdn.net/weixin_43412569/article/details/104855097