Today overtime ah, damned hard! !
Bored, written in python caught a reptile picture, I feel very nice, ha ha
First pasting the code: (python Version: 2.7.9)
__author__ = 'bloodchilde'
import urllib
import urllib2
import re
import os
class Spider:
def __init__(self):
self.siteUrl="http://sc.chinaz.com/biaoqing/"
self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64; Trident/7.0; rv:11.0) like Gecko'
self.headers = { 'User-Agent' : self.user_agent }
def getPage(self,pageIndex):
url = self.siteUrl+"index_"+str(pageIndex)+".html"
request = urllib2.Request(url,headers = self.headers)
response = urllib2.urlopen(request)
return response.read().decode("utf-8")
def getContents(self,pageIndex):
page = self.getPage(pageIndex)
pattern = re.compile('''<div.*?class='num_1'.*?>.*?<p>.*?<a.*?href='.*?'.*?target='_blank'.*?title='(.*?)'.*?><img.*?src2="(.*?)".*?>.*?</a>.*?</p>.*?</div>''',re.S)
items = re.findall(pattern,page)
contents=[]
for item in items:
contents.append([item[0],item[1]])
return contents
def mk_dir(self,path):
isExisist = os.path.exists(path)
if not isExisist:
os.makedirs(path)
return True
else:
return False
def downImage(self,url,dirname):
imageUrl = url
request = urllib2.Request(imageUrl,headers = self.headers)
response = urllib2.urlopen(request)
imageContents = response.read()
urlArr = imageUrl.split(u"/")
imageName = str(urlArr[len(urlArr)-1])
print imageName
path = u"C:/Users/bloodchilde/Desktop/image_python/"+dirname
self.mk_dir(path)
imagePath = path+u"/"+imageName
f = open(imagePath, 'wb')
f.write(imageContents)
f.close()
def downLoadAllPicture(self,PageIndex):
contents = self.getContents(PageIndex)
for list in contents:
dirname = list[0]
imageUrl = list[1]
self.downImage(imageUrl,dirname)
demo = Spider()
for page in range(3,100):
demo.downLoadAllPicture(page)
Results are as follows:
Download so many pictures, and instantly get to analyze the following procedures:
First of all, my goal page is:
http://sc.chinaz.com/biaoqing/index_3.html
Program features to this page to download emoticons
Program ideas:
1, access to the source code of web page information
2, parse the source code to obtain the URL to download pictures (regular process)
3, repositioning get information url url url to initiate a request for this picture, this picture is actually the url information content contents
4, obtained by the image above URL can also take a picture of the name (name suffix) imageName
5, create a file in the local imageName to get the name, the contents of the contents can be written into the file
Open http://sc.chinaz.com/biaoqing/index_3.html, view source, find the code segment to be addressed as follows:
Corresponding regular is:
'''<div.*?class='num_1'.*?>.*?<p>.*?<a.*?href='.*?'.*?target='_blank'.*?title='(.*?)'.*?><img.*?src2="(.*?)".*?>.*?</a>.*?</p>.*?</div>'''
Us from obtaining title and snippet src2, title as a folder name, src2 picture as a target the URL of
----------------
Disclaimer: This article is CSDN blogger "Little Wei "the original article, follow the CC 4.0 BY-SA copyright agreement, reproduced, please attach the original source link and this statement.
Original link: https: //blog.csdn.net/dai_jing/article/details/46661969