HTTP协议
HTTP,超文本传输协议(HTTP,HyperText Transfer Protocol)是互联网上应用最为广泛的一种网络协议。所有的WWW文件都必须遵守这个标准。设计HTTP最初的目的是为了提供一种发布和接收HTML页面的方法,HTTP是一种基于"请求与响应"模式的、无状态的应用层协议。HTTP协议采用URL作为定位网络资源的的标识符。http://host[:post][path]
host
:合法的Internet主机域名或ip地址port
:端口号,缺省为80path
:请求资源的路径
HTTP URl的理解:
url是通过HTTP协议存取资源的的Internet路径,一个URL对应一个数据资源。
Requests库
与HTTP协议操作相对应的requests库的7个方法如下:
方法 | 说明 |
requests.request() | 构造一个请求,支撑以下各方法的基础方法 |
requests.get() | 获取HTML网页的主要方法,对应于HTTP的GET |
requests.head() | 获取HTML网页头信息的方法,对应于HTTP的HEAD |
requests.post() | 向HTML网页提交POST请求的方法,对应于HTTP的POST |
requests.put() | 向HTML网页提交PUT请求的方法,对应于HTTP的PUT |
requests.patch() | 向HTML网页提交局部修改请求,对应于HTTP的PATCH |
requests.delete() | 向HTML页面提交删除请求,对应于HTTP的DELETE |
安装
pip install requests
requests库安装小测
import request
url = 'https://www.baidu.com'
r = requests.get(url)
r.encoding = r.apparent_encoding
print(r.text[-200:])
Out[13]: 'w.baidu.com/duty/>使用百度前必读</ a> < a href= >意见反馈</ a> 京ICP证030173号 < img src=//www.baidu.com/img/gs.gif> </p > </div> </div> </div> </body> </html>\r\n'
requests库的详细介绍:
(以后再补上)
Requests库应用的固定格式:
import requests
def getHTMLText(url):
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
BeautifulSoup 库
BeautifulSoup 库介绍:
(以后补上)
划重点:BeautifulSoup库的基本使用中,注意NavigableString元素类型及<Tag>.string的使用;代码如下:
text = "<li><a class=B href=//www.dianping.com/beijing/ch10/g112>大众点评</a></li>"
soup = BeautifulSoup(text,"html.parser")
str = soup.li.string
str1 = soup.li.a.string
str2 = soup.a.string
#这里str=str1=str2
type(str) = bs4.element.NavigableString
text1 = "<li>大众点评<a class=B href=//www.dianping.com/beijing/ch10/g112></a></li>"
soup = BeautifulSoup(text1,"html.parser")
str = soup.li.string
error
text2 = "<li><a class=B href=//www.dianping.com/beijing/ch10/g112></a>大众点评</li>"
soup = BeautifulSoup(text2,"html.parser")
str = soup.li.string
error
#当一个tag中既有tat.children,又有NavigableString时,不识别<Tag>.string,且type()默认为none。
此时应该用<Tag>.text来取出Tag中的string。
完整代码示例:
#-*- coding:utf-8 -*-
#Crawlandmark_trading area.py
#商区,地标爬取
import numpy as np
import pandas as pd
from pandas import Series,DataFrame
import requests
from bs4 import BeautifulSoup
import bs4
def getHTMLText(url):
try:
r = requests.get(url)
r.raise_for_status()
r.encoding = r.apparent_encoding
return r.text
except:
return ""
def fillSQList(sqlist, html):
soup = BeautifulSoup(html,"html.parser")
for div in soup.find_all("div", "box shopallCate"):
if div.find("h2").text == "商区":
# if div.find("h2").text == "地标":
for dl in div.find_all("dl"):
NaN = ""
sqlist.append([dl.find("dt").text,NaN])
for li in dl.find_all("li"):
sqlist.append([NaN,li.a.string])
def toCsv(sqlist):
sqlist = pd.DataFrame(data = sqlist)
sqlist.to_csv('商区.csv',encoding='utf_8_sig')
# sqlist.to_csv('地标.csv',encoding='utf_8_sig')
if __name__ == '__main__':
sqlist = []
url = "http://www.dianping.com/shopall/2/0#BDBlock"
html = getHTMLText(url)
fillSQList(sqlist, html)
toCsv(sqlist)