Implemented a subject-oriented Python web crawler program and complete the following:
(Note: per person a question, the subject matter of choice, all design content and source code to be submitted to the blog platform Park)
First, the web crawler themed design (15 points)
1. Thematic Web Crawler name
Over the years the novel situation of war drama princes
content and data features 2. Thematic web crawlers crawling analysis
content and data features 2. Thematic web crawlers crawling analysis
Each novel in the name of war drama princes
Total number of hits in each of the novel
3. Overview of thematic design web crawler (including the realization of ideas and technical difficulties)
http://home.zongheng.com/show/userInfo/166130.html
http://book.zongheng.com/book/{}.html
Start with the author information page crawling the url address books, and then crawling traffic based on books name and url address books, and then made Excel chart name and clicks.
Second, the structure relating to the page analysis (15 points)
structural features of the subject page 1
structural features of the subject page 1
http://home.zongheng.com/show/userInfo/166130.html
Play war lords author information page url
http://book.zongheng.com/book/{}.html
In the works of author information page crawl url, populated by different codes in brackets.
2.Htmls page parsing
2.Htmls page parsing
The gripping type from a class label imgbox the div, then gripping the work url different from the href a label.
From the class name for the type of work to crawl under the book-info div tag.
I crawl from class type label for the next div nums, and then crawl works hits from the second i tab.
3. Node (tag) and traversal method lookup method
(shown node tree structure, if necessary)
(shown node tree structure, if necessary)
DEF namesinfo (HTML): Soup = BeautifulSoup (HTML, ' html.parser ' ) # get div book-name attribute is the name = soup.find_all ( " div " , attrs = ' Book-name ' ) # regular acquire Chinese book name namess the re.findall = (R & lt " [\ u4e00- \ u9fa5] + " , STR (name [0]))
find_all way to find, then the regular expression to obtain Chinese title.
Third, the network design crawlers (60 minutes)
the crawler body to be included in the following sections, to be attached to the source code and comments in detail, and provides an output result after every part of the program theme.
from bs4 import BeautifulSoup import requests, matplotlib, re, xlwt import matplotlib.pyplot as plt #获取页面 def gethtml(url): info = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36'} try: data = requests.get(url, headers=info) data.raise_for_status() data.encoding = data.apparent_encoding return data.text the except : return " " # books url DEF urlinfo (url): Books = [] Book = getHtml (url) Soup = BeautifulSoup (Book, " html.parser " ) # get property tit p-label p = soup.find_all ( " the p- " , attrs = " TIT " ) for Item in the p-: # acquire books address books.append (item.a.attrs [ ' href ' ]) returnBooks # traffic information DEF numsinfo (HTML): n- = [] Soup = the BeautifulSoup (HTML, ' html.parser ' ) div = soup.find_all ( " div " , attrs = ' the nums ' ) the nums = div [0] I = 0 for Spa in nums.find_all ( " I " ): IF I == 2 : # clicks on Get n.append (spa.string.split ( '. ' ) [0]) BREAK I + =. 1 return n- # title information DEF namesinfo (HTML): Soup = the BeautifulSoup (HTML, ' html.parser ' ) # get property as the div book-name name = soup.find_all ( " div " , attrs = ' Book-name ' ) # regular acquire Chinese title namess = re.findall (r " [\ u4e00- \ u9fa5] + " , str (name [0])) return namess # repair Chinese block matplotlib.rcParams [' Font.sans serif- ' ] = [ ' SimHei ' ] matplotlib.rcParams [ ' font.family ' ] = ' Sans-serif ' matplotlib.rcParams [ ' axes.unicode_minus ' ] = False # bar graph DEF Bar (X , Y, User): plt.bar (left = X, Y = height, Color = ' Y ' , width = 0.5 ) plt.ylabel ( ' hits ' ) plt.xlabel ( ' title ' ) plt.title ( user) plt.savefig ( user, dpi = 300 ) plt.show () DEF File (Book, nums, address): # Create a Workbook, create the equivalent of Excel Excel = xlwt.Workbook (encoding = ' UTF-8 ' ) # Create a name table One of Sheet1 excel.add_sheet = (U ' One ' , cell_overwrite_ok = True) # write column names sheet1.write (0, 0, ' Book ' ) sheet1.write (0, . 1, ' the nums ' ) for I in the Range (1 , len (Book)): sheet1.write (I, 0, Book [I]) for Jin Range (. 1 , len (the nums)): sheet1.write (J, . 1 , the nums [J]) excel.save (address) # list element type conversion DEF Convert (Lista): listB = [] for I in Lista: listb.append (I [0]) return listB DEF main (): # OF page author = ' http://home.zongheng.com/show/userInfo/166130.html ' User = ' play war lords ' URLs = urlinfo (author) the namelist = [] countlist = [] for url in urls: html = gethtml(url) namelist.append(namesinfo(html)) countlist.append(numsinfo(html)) namelist = convert(namelist) countlist = convert(countlist) for i in range(len(countlist)): countlist[i] = int(countlist[i]) #保存地址 addr = f'D:\\{user}.xls' file(namelist, countlist, addr) Bar(namelist, countlist, user) if __name__ == '__main__ ' : main ()
1. Data acquisition and crawling
DEF urlinfo (URL): Books = [] Book = the getHtml (URL) Soup = the BeautifulSoup (Book, " html.parser " ) # get property as a p-tag tit p = soup.find_all ( " p " , attrs = " tit " ) for Item in the p-: # acquire books address books.append (item.a.attrs [ ' href ' ]) return Books
def numsinfo(html): n = [] soup = BeautifulSoup(html, 'html.parser') div = soup.find_all("div", attrs='nums') nums = div[0] i = 0 for spa in nums.find_all("i"): if i == 2: #获取点击量 n.append(spa.string.split('.')[0]) break i += 1 return n
2. The data processing and cleaning
Data cleaning
for spa in nums.find_all("i"): if i == 2: #获取点击量 n.append(spa.string.split('.')[0]) break i += 1
Data cleaning
namess = re.findall(r"[\u4e00-\u9fa5]+", str(name[0])) return namess
3. Text analysis (optional): jieba word, wordcloud visualization
DEF File (Book, nums, address): # Create a Workbook, create the equivalent of Excel Excel = xlwt.Workbook (encoding = ' UTF-8 ' ) # Create a table named One of sheet1 = excel.add_sheet (U ' One ' , = cell_overwrite_ok True) # write column names sheet1.write (0, 0, ' Book ' ) sheet1.write (0, . 1, ' the nums ' ) for I in Range (. 1 , len (Book)): sheet1.write ( I, 0, Book [I]) for J in Range (. 1, len(nums)): sheet1.write(j, 1, nums[j]) excel.save(address)
4. Data analysis and visualization
(for example: column graph, histogram, scatter, FIG box, distribution, regression analysis data, etc.)
def Bar(x, y, user): plt.bar(left=x, height=y, color='y', width=0.5) plt.ylabel('点击量') plt.xlabel('书名') plt.title(user) plt.savefig(user, dpi=300) plt.show()
5. Data Persistence
CONCLUSIONS (10 points)
1. After analyzing data on the subject and visualization, what conclusion can be?
1. After analyzing data on the subject and visualization, what conclusion can be?
The sword and the snow is the author of best-selling novel.
2. Make a simple summary of the circumstances of this programming task is completed.
2. Make a simple summary of the circumstances of this programming task is completed.
Through this experiment, I applied for python, let me crawling on the network have a more profound understanding.