Using Python Requests / BeautifuiSoup module for urban transit network site crawling data

Step :

Requests using the Python and BeautifuiSoup module static pages crawled, Web Crawler essentially two steps: 
1, the parameter setting request (url, headers, cookies, post, or get authentication, etc.) of the server to access the target site; 
2, the server returns parsed documents, information extraction needs.

Site : http://beijing.8684.cn/

First, the environment configuration

# - * - Coding: UTF-. 8 - * - 
# Import Requests 
Import Requests the BeautifulSoup # introduced BS4 of

 from BS4 Import the BeautifulSoup 
Import OS 

headers = { ' the User-- Agent ' : ' the Mozilla / 5.0 (the Windows NT 10.0; the WOW64) AppleWebKit /537.36 (KHTML, like the Gecko) the Chrome / 49.0.2623.221 Safari / 537.36 SE 2.X MetaSr 1.0 ' }
 # start URL address 
all_url = ' http://beijing.8684.cn '  
START_HTML = Requests. GET (all_url, = headers headers) 
# Print (start_html.text) # lxml way to parse html document 
Soup
= BeautifulSoup(start_html.text, 'lxml') 

Second, site analysis

1) Beijing Public Transportation - start with a number - "F12" start the Developer Tools, click on the "Elements", click on the "1" - found a link saved in<div class="bus_kt_r1">里 - 提取出div里的href即可

all_a = Soup.find(‘div’,class_=’bus_kt_r1’).find_all(‘a’)

2) each bus links are in<div id="con_site_1" class="site_list"> 的<a>里面 - 取出里面的herf即为线路网址,其内容即为线路名称

# 取出a标签的href属性
href = a['href']
html = all_url + href
second_html = requests.get(html,headers=headers)
#print (second_html.text)
Soup2 = BeautifulSoup(second_html.text, 'lxml') 
all_a2 = Soup2.find('div',class_='cc_content').find_all('div')[-1].find_all('a') 

3) Open the line link to view a specific site information - After opening the page analysis document structure found - basic information line is stored in the <div class = "bus_i_content"> inside - Bus station information is stored in the <div class = "bus_line_top"> and <div class = "bus_line_site"> inside

 

Remove the label text # a1 
TITLE1 a2.get_text = () # remove a tag href attribute 
href1 A2 = [ 'href'] 
#Print (TITLE1, href1) # Line Construction site URL 
html_bus + = all_url href1 
thrid_html = requests.get (html_bus, headers = headers) 
Soup3 = the BeautifulSoup (thrid_html.text, 'lxml') # extraction line name 
bus_name = Soup3.find ( 'div', class _ = 'bus_i_t1'). find ( 'h1'). get_text () # extraction line attribute 
the bus_type = Soup3.find ( 'div', _ = class 'bus_i_t1'). Find ( 'A'). get_text () # runtime 
bus_time = Soup3.find_all ( 'p', class _ = 'bus_i_t4') [0] .get_text () # fare 
bus_cost = Soup3.find_all ( 'P', _ = class 'bus_i_t4') [. 1] .get_text () # bus company 
bus_company = Soup3.find_all('p',class_='bus_i_t4')[2].find('a').get_text() 







# 更新时间
bus_update = Soup3.find_all('p',class_='bus_i_t4')[3].get_text() 
bus_label = Soup3.find('div',class_='bus_label')
if bus_label:# 线路里程
    bus_length = bus_label.get_text() 
else:
    bus_length = []
#print (bus_name,bus_type,bus_time,bus_cost,bus_company,bus_update)# 线路简介
all_line = Soup3.find_all('div',class_='bus_line_top') # 公交站点
all_site = Soup3.find_all('div',class_='bus_line_site')
line_x = all_line[0].find('div',class_='bus_line_txt').get_text()[:-9]+all_line[0].find_all('span')[-1].get_text()
sites_x = all_site[0].find_all('a') 
for site_x in sites_x:
sites_x_list = []# Uplink site
    



    sites_x_list.append (site_x.get_text ())
len = line_num (all_line) # Link If present, two returns List, only one empty 
IF line_num == 2:   
    line_y all_line = [. 1] .find ( 'div', _ = class 'bus_line_txt') get_text (. ) [: -. 9] + all_line [. 1] .find_all ( 'span') [-. 1] .get_text () 
    sites_y all_site = [. 1] .find_all ( 'A') # downlink sites 
    sites_y_list = [] 
    for site_y in sites_y: 
        sites_y_list.append (site_y.get_text ()) 
the else: 
    line_y, sites_y_list = [], [] 
Information = [bus_name, the bus_type, bus_time, bus_cost, bus_company, bus_update, bus_length, line_x, sites_x_list, line_y, sites_y_list]

    

Thus the relevant information on certain routes and downlink site information are parsed, if you want to crawl the city's public transit network site, you can simply add cycle

  

 

Guess you like

Origin www.cnblogs.com/5211314jackrose/p/11307930.html