requests模块
爬虫的动作都是在对应你想要的网页源代码里抓取的
本质是模拟客户的请求,接收对方的网络响应,按照程序员的要求抓取对应的信息
(理论上浏览器能做的,爬虫都能做)
爬虫分类:
通用爬虫:
搜索引擎的爬虫
聚焦爬虫:
针对特定网站的爬虫
搜索引擎的工作原理:
爬取网页–存储数据–预处理–提供搜索和排名服务
对网站发起请求的种类
get
浏览器输入框里的请求
post
表单输入
爬取百度的源代码并保存到本地。
import requests
def main():
url = "https://www.baidu.com"
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
responses = requests.get(url,headers=headers) #带header的请求
#print(responses.content) #查看百度返回给我的源代码
with open("baidu.txt","wb") as f:
f.write(responses.content)
pass
if __name__ == '__main__':
main()
responses.status_code 获取状态码
responses.headers 获得响应的头部
responses.url 获取响应的url
responses.request.headers 获取请求头部
responses.content 获取页面代码
关于url中的请求参数来执行具体请求。
import requests
def main():
url = "https://www.baidu.com/s"
#url = "https://www.baidu.com/s?wd={}".format("12306")#用了format就不用带params了
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.10 Safari/537.35"}
params = {"wd":"12306"} #搜索关键字是12306
responses = requests.get(url,headers=headers,params=params) #带请求参数
print(responses.request.url)
with open("baidu.txt","wb")as f:
f.write(responses.content)
pass
if __name__ == '__main__':
main()
百度贴吧爬虫
mport requests
def main():
tieba_name = input("请输入贴吧名字:")
url_temp = "http://tieba.baidu.com/f?kw="+tieba_name+"&ie=utf-8&pn={}"
url_list = [] #存储url列表
#爬2页
for i in range(2):
url_list.append(url_temp.format(i*50))
#定义一下headers
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
#遍历url_list里的东西发送数据
for url in url_list:
resp = requests.get(url,headers=headers)
with open("a.txt","wb+") as f:
f.write(resp.content)
pass
if __name__ == '__main__':
main()
发送post请求
import requests
def main():
url = "http://njeclissi.com:81/Less-20/index.php"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
data = {"uname":"admin","passwd":"admin"} #定义一下post的数据
resp = requests.post(url,headers=header,data=data) #发送post数据
print(resp.content)
pass
if __name__ == '__main__':
main()
反反爬虫
反反爬虫程序的手段
防止真实地址被追踪
proxys = {“http”:“http://1.2.3.4:9999”}
resp = requests.post(url,headers=header,data=data,proxys=proxys)
匿名代理
对方知道你用代理,但不知道你是谁
混淆代理
知道你用代理,但是人家获得的是一个假的ip地址
高匿代理
不知道你用代理
防御爬虫侦测:
一段时间ip访问量
cookie,user-agent,headers,referer
处理cookie
import requests
def main():
url = "http://njeclissi.com:81/Less-20/index.php"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36","Cookie":"uname=admin"}
resp = requests.get(url,headers=header)
print(resp.content)
pass
if __name__ == '__main__':
main()
post登录之后获取cookie
cookies = requests.utils.dict_from_cookiejar(resp.cookies)
print(cookies)
import requests
def main():
url = "http://njeclissi.com:81/Less-20/index.php"
header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
#data = {"uname":"admin","passwd":"admin"} #定义一下post的数据
cookie = {"uname":"admin"}
resp = requests.get(url,headers=header,cookies=cookie)
print(resp.content)
pass
if __name__ == '__main__':
main()