Step :
Requests using the Python and BeautifuiSoup module static pages crawled, Web Crawler essentially two steps:
1, the parameter setting request (url, headers, cookies, post, or get authentication, etc.) of the server to access the target site;
2, the server returns parsed documents, information extraction needs.
Site : http://beijing.8684.cn/
First, the environment configuration
# - * - Coding: UTF-. 8 - * - # Import Requests Import Requests the BeautifulSoup # introduced BS4 of from BS4 Import the BeautifulSoup Import OS headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit /537.36 (KHTML, like the Gecko) the Chrome / 49.0.2623.221 Safari / 537.36 SE 2.X MetaSr 1.0 ' } # start URL address all_url = ' http://beijing.8684.cn ' START_HTML = Requests. GET (all_url, = headers headers) # Print (start_html.text) # lxml way to parse html document Soup = BeautifulSoup(start_html.text, 'lxml')
Second, site analysis
1) Beijing Public Transportation - start with a number - "F12" start the Developer Tools, click on the "Elements", click on the "1" - found a link saved in<div class="bus_kt_r1">里 - 提取出div里的href即可
all_a = Soup.find(‘div’,class_=’bus_kt_r1’).find_all(‘a’)
2) each bus links are in<div id="con_site_1" class="site_list"> 的
<a>
里面 - 取出里面的herf即为线路网址,其内容即为线路名称
# 取出a标签的href属性
href = a['href']
html = all_url + href
second_html = requests.get(html,headers=headers)
#print (second_html.text)
Soup2 = BeautifulSoup(second_html.text, 'lxml')
all_a2 = Soup2.find('div',class_='cc_content').find_all('div')[-1].find_all('a')
3) Open the line link to view a specific site information - After opening the page analysis document structure found - basic information line is stored in the <div class = "bus_i_content"> inside - Bus station information is stored in the <div class = "bus_line_top"> and <div class = "bus_line_site"> inside
Remove the label text # a1 TITLE1 a2.get_text = () # remove a tag href attribute href1 A2 = [ 'href'] #Print (TITLE1, href1) # Line Construction site URL html_bus + = all_url href1 thrid_html = requests.get (html_bus, headers = headers) Soup3 = the BeautifulSoup (thrid_html.text, 'lxml') # extraction line name bus_name = Soup3.find ( 'div', class _ = 'bus_i_t1'). find ( 'h1'). get_text () # extraction line attribute the bus_type = Soup3.find ( 'div', _ = class 'bus_i_t1'). Find ( 'A'). get_text () # runtime bus_time = Soup3.find_all ( 'p', class _ = 'bus_i_t4') [0] .get_text () # fare bus_cost = Soup3.find_all ( 'P', _ = class 'bus_i_t4') [. 1] .get_text () # bus company bus_company = Soup3.find_all('p',class_='bus_i_t4')[2].find('a').get_text() # 更新时间 bus_update = Soup3.find_all('p',class_='bus_i_t4')[3].get_text() bus_label = Soup3.find('div',class_='bus_label') if bus_label:# 线路里程 bus_length = bus_label.get_text() else: bus_length = [] #print (bus_name,bus_type,bus_time,bus_cost,bus_company,bus_update)# 线路简介 all_line = Soup3.find_all('div',class_='bus_line_top') # 公交站点 all_site = Soup3.find_all('div',class_='bus_line_site') line_x = all_line[0].find('div',class_='bus_line_txt').get_text()[:-9]+all_line[0].find_all('span')[-1].get_text() sites_x = all_site[0].find_all('a') for site_x in sites_x: sites_x_list = []# Uplink site sites_x_list.append (site_x.get_text ()) len = line_num (all_line) # Link If present, two returns List, only one empty IF line_num == 2: line_y all_line = [. 1] .find ( 'div', _ = class 'bus_line_txt') get_text (. ) [: -. 9] + all_line [. 1] .find_all ( 'span') [-. 1] .get_text () sites_y all_site = [. 1] .find_all ( 'A') # downlink sites sites_y_list = [] for site_y in sites_y: sites_y_list.append (site_y.get_text ()) the else: line_y, sites_y_list = [], [] Information = [bus_name, the bus_type, bus_time, bus_cost, bus_company, bus_update, bus_length, line_x, sites_x_list, line_y, sites_y_list]
Thus the relevant information on certain routes and downlink site information are parsed, if you want to crawl the city's public transit network site, you can simply add cycle