(死宅福利)python爬虫脚本 爬取兔玩君分享计划 千套写真

小夏又来写博客啦,上学期期末的时候,把学校所有研究生的照片爬了下来,哈哈哈哈哈,结果事实证明东南好看的女生有1..少,都是靠实力吃饭的(求生欲)~~
PS:有南京地区学校的小伙伴,求求求各个学校的学生账号,找找漏洞,想爬南京高校女生照片~~


这次的爬虫是兔玩君分享计划的所有收费套图
不知道的同学可以看这里www.tuwanjun.com




简单的说一下吧,这个网站通过Ajax+JS的方式加载图片,每套图只可以看三张,之后要收费

但是博主发现了一个小漏洞,缩略图的url和大图地址文件名基本相同,路径也有由参数构成的,base64解码url发现,小图和大图只有部分参数不同,并且每组图可以通过替换获得完整图片地址。

虽然用了多线程,但是一个多小时了还没拍完,已经几百套图了。
秀一下:


这里贴下源码,求打赏,求赞助,欢迎加入精神股东!小白,大股东,土豪,萌妹可以+xwd2363,直接百度云给你!

 




import json
import re
import requests
from requests.exceptions import RequestException
from urllib.parse import urlencode
from urllib.parse import unquote
from multiprocessing import Pool
import os
from hashlib import md5
def get_page(offset):
key={
'type':'image',
'dpr':3,
'id':offset,
}
url = 'https://api.tuwan.com/apps/Welfare/detail?' + urlencode(key)
try:
response = requests.get(url)
if response.status_code == 200:
return response.text
return None
except RequestException:
print("请求页出错", url)
return None
def getUrl(html):

pattern1 = re.compile('"thumb":(.*?)}', re.S)
result = re.findall(pattern1, html)
bigUrl=result[0]
bigUrl=bigUrl.replace('"','').replace('\\','')
pattern2 = re.compile('(http.*?.+jpg),', re.S)
result2 = re.findall(pattern2, bigUrl)
bigUrl=result2[0]

pattern3 = re.compile('(http.*?==.*?\.jpg)', re.S)
result3=re.findall(pattern3,result[3])
smallUrl = []
for item in result3:
# print(item.replace('\\',''))
smallUrl.append(item.replace('\\',''))
return (bigUrl,smallUrl)

def findReplaceStr(url):
pattern = re.compile('.*?thumb/jpg/+(.*?wx+)(.*?)(/u/.*?).jpg', re.S)
result = re.match(pattern, url)
return result.group(2)

def getBigImageUrl(url,replaceStr):
pattern = re.compile('.*?thumb/jpg/+(.*?wx+)(.*?)(/u/.*?).jpg', re.S)
result = re.match(pattern, url)
newurl='http://img4.tuwandata.com/v3/thumb/jpg/'+result.group(1)+replaceStr+ result.group(3)
return newurl
def save_image(content,offset):
path='{0}'.format(os.getcwd()+'\image\\'+str(offset))
file_path='{0}\{1}.{2}'.format(path,md5(content).hexdigest(), 'jpg')

if not os.path.exists(path):
os.mkdir(path)
if not os.path.exists(file_path):
with open(file_path,'wb') as f:
f.write(content)
f.close()
def download_images(url,offset):
print('downloading:',url)
try:
response = requests.get(url)
if response.status_code == 200:
save_image(response.content,offset)
return None
except RequestException:
print("请求图片出错",url)
return None

def download(bigImageUrl,smallImageUrl,offset):
replaceStr = findReplaceStr(bigImageUrl)
for url in smallImageUrl:
download_images(getBigImageUrl(url,replaceStr),offset)


def main(offset):
try:
html = get_page(offset)
urls = getUrl(html)
download(urls[0], urls[1], offset)
return None
except Exception:
print("地址出错:",offset)
return None

if __name__ == '__main__':
groups = [x for x in range(1,3000)]
pool = Pool()
pool.map(main,groups)



猜你喜欢

转载自www.cnblogs.com/CooperXia-847550730/p/10533558.html