python学习-----爬虫技巧

requests模块

爬虫的动作都是在对应你想要的网页源代码里抓取的

本质是模拟客户的请求,接收对方的网络响应,按照程序员的要求抓取对应的信息
(理论上浏览器能做的,爬虫都能做)

爬虫分类:
通用爬虫:
搜索引擎的爬虫

聚焦爬虫:
针对特定网站的爬虫

搜索引擎的工作原理:
爬取网页–存储数据–预处理–提供搜索和排名服务

对网站发起请求的种类
get
浏览器输入框里的请求

post
表单输入

爬取百度的源代码并保存到本地。

import requests

def main():
    url = "https://www.baidu.com"
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    responses = requests.get(url,headers=headers) #带header的请求
    #print(responses.content) #查看百度返回给我的源代码
    with open("baidu.txt","wb") as f:
        f.write(responses.content)
    pass

if __name__ == '__main__':
    main()

responses.status_code 获取状态码
responses.headers 获得响应的头部
responses.url 获取响应的url
responses.request.headers 获取请求头部
responses.content 获取页面代码

关于url中的请求参数来执行具体请求。

import requests

def main():
    url = "https://www.baidu.com/s"
    #url = "https://www.baidu.com/s?wd={}".format("12306")#用了format就不用带params了

    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.10 Safari/537.35"}
    params = {"wd":"12306"} #搜索关键字是12306
    responses = requests.get(url,headers=headers,params=params) #带请求参数
    print(responses.request.url)
    with open("baidu.txt","wb")as f:
        f.write(responses.content)
    pass

if __name__ == '__main__':
    main()

百度贴吧爬虫

mport requests

def main():
    tieba_name = input("请输入贴吧名字:")
    url_temp = "http://tieba.baidu.com/f?kw="+tieba_name+"&ie=utf-8&pn={}"
    url_list = [] #存储url列表
    #爬2页
    for i in range(2):
        url_list.append(url_temp.format(i*50))
    #定义一下headers
    headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    #遍历url_list里的东西发送数据
    for url in url_list:
        resp = requests.get(url,headers=headers)
        with open("a.txt","wb+") as f:
            f.write(resp.content)
    pass

if __name__ == '__main__':
    main()

发送post请求

import requests

def main():
    url = "http://njeclissi.com:81/Less-20/index.php"
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    data = {"uname":"admin","passwd":"admin"} #定义一下post的数据
    resp = requests.post(url,headers=header,data=data) #发送post数据
    print(resp.content)
    pass

if __name__ == '__main__':
    main()

反反爬虫

反反爬虫程序的手段
防止真实地址被追踪

proxys = {“http”:“http://1.2.3.4:9999”}
resp = requests.post(url,headers=header,data=data,proxys=proxys)

匿名代理
对方知道你用代理,但不知道你是谁

混淆代理
知道你用代理,但是人家获得的是一个假的ip地址

高匿代理
不知道你用代理

防御爬虫侦测:
一段时间ip访问量
cookie,user-agent,headers,referer

处理cookie

import requests

def main():
    url = "http://njeclissi.com:81/Less-20/index.php"
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36","Cookie":"uname=admin"}
    resp = requests.get(url,headers=header)
    print(resp.content)
    pass

if __name__ == '__main__':
    main()

post登录之后获取cookie

cookies = requests.utils.dict_from_cookiejar(resp.cookies)
print(cookies)

import requests

def main():
    url = "http://njeclissi.com:81/Less-20/index.php"
    header = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36"}
    #data = {"uname":"admin","passwd":"admin"} #定义一下post的数据
    cookie = {"uname":"admin"}
    resp = requests.get(url,headers=header,cookies=cookie)
    print(resp.content)
    pass

if __name__ == '__main__':
    main()
发布了21 篇原创文章 · 获赞 3 · 访问量 975

猜你喜欢

转载自blog.csdn.net/weixin_46097280/article/details/104171507