准备工作

1. Python语言

Python 爬虫作为 Python 编程的进阶知识，要求学习者具备较好的 Python 编程基础.

同时，了解 Python 语言的多进程与多线程，并熟悉正则表达式语法，也有助于您编写爬虫程序。

注意：关于正则表达式，Python 提供了专门的 re 模块，详细可参考《Python re模块》。

2. Web前端

了解 Web 前端的基本知识，比如 HTML、CSS、JavaScript，这能够帮助你分析网页结构，提炼出有效信息。

3. HTTP协议

掌握 OSI 七层网络模型，了解 TCP/IP 协议、HTTP 协议，这些知识将帮助您了解网络请求（GET 请求、POST 请求）和网络传输的基本原理。同时，也有助您了解爬虫程序的编写逻辑，
在这里插入图片描述

环境准备

编写 Python 爬虫程序前，需要准备相应的开发环境，这非常的简单。首先您需要在您的电脑上安装 Python，然后下载安装 Pycharm IDE（集成开发环境）工具

第一个爬虫程序

下面使用 Python的requests模块获取网页的 html 信息

获取网页html信息

获取响应对象
向百度（http://www.baidu.com/）发起请求，获取百度首页的 HTML 信息，代码如下：

import requests

if __name__ == "__main__":
    # 向URL发请求,返回响应对象,注意url必须完整
    # get方法会返回一个响应对象
    response = requests.get('http://www.baidu.com/')
    # 3.获取数据,text返回字符串形式的响应数据
    page_text = response.text
    print(page_text)

常用方法

方法	说明
requests.request()	构造一个请求，支撑以下各方法的基础方法
requests.get()	获取HTML网页的主要方法，对应HTTP的GET
requests.head()	获取HTML网页头的信息方法，对应HTTP的HEAD
requests.post()	向HTML网页提交POST请求方法，对应HTTP的POST
requests.put()	向HTML网页提交PUT请求的方法，对应HTTP的PUT
requests.patch()	向HTML网页提交局部修改请求，对应于HTTP的PATCH
requests.delete()	向HTML页面提交删除请求,对应HTTP的DELETE

我们通过调用Request库中的方法，得到返回的对象。其中包括两个对象，request对象和response对象。

request对象就是我们要请求的url，response对象是返回的内容，如图：
在这里插入图片描述

代理

GET 方法

代码如下，只需要加一个UA参数，

import requests
headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
r = requests.get('https://www.baidu.com', headers=headers)
print(r.status_code)
print(r.text)

POST方法

只需要把get改成post就好了

import requests
headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
r = requests.post('https://www.baidu.com', headers=headers)
print(r.status_code)
print(r.text)

运行下试试看。一般post都是用来提交表单信息的，嗯，这里找一个能提交数据的url，去post下。
data是要post的数据，post方法里加了一个data参数。

import requests
headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
# post的数据
data = {
    
    "info": "biu~~~ send post request"}
r = requests.post('http://dev.kdlapi.com/testproxy', headers=headers, data=data) #加一个data参数
print(r.status_code)
print(r.text)

http code 200，说明post成功

使用代理

一般网站都有屏蔽的限制策略，用自己的IP去爬，被封了那该网站就访问不了，这时候就得用代理IP来解决问题了。封吧，反正封的不是本机IP，封的代理IP。既然使用代理，得先找一个代理IP。
PS：自己写个代理服务器太麻烦了，

import requests

headers = {
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'}
# post的数据
data = {
    
    "info": "biu~~~ send post request"}

# 代理信息,由快代理赞助
proxy = '115.203.28.25:16584'
proxies = {
    
    
    "http": "http://%(proxy)s/" % {
    
    'proxy': proxy},
    "https": "http://%(proxy)s/" % {
    
    'proxy': proxy}
}

r = requests.post('http://dev.kdlapi.com/testproxy', headers=headers, data=data, proxies=proxies) #加一个proxies参数
print(r.status_code)
print(r.text)

代理池单ip和多ip设置方式

单ip代理模式

import requests
proxy = {
    
    
    'HTTPS': '162.105.30.101:8080'
}
url = '爬取链接地址'
response = requests.get(url,proxies=proxy)

多ip代理模式

import requests
#导入random，对ip池随机筛选
import random
proxy = [
    {
    
    
        'http': 'http://61.135.217.7:80',
        'https': 'http://61.135.217.7:80',
    },
{
    
    
        'http': 'http://118.114.77.47:8080',
        'https': 'http://118.114.77.47:8080',
    },
{
    
    
        'http': 'http://112.114.31.177:808',
        'https': 'http://112.114.31.177:808',
    },
{
    
    
        'http': 'http://183.159.92.117:18118',
        'https': 'http://183.159.92.117:18118',
    },
{
    
    
        'http': 'http://110.73.10.186:8123',
        'https': 'http://110.73.10.186:8123',
    },
]
url = '爬取链接地址'
response = requests.get(url,proxies=random.choice(proxy))

python爬虫学习(四)