【Python爬虫高级技巧】requests库高级用法 - 代理SSL流式一网打尽

各位小伙伴大家好，我是唐叔！上期我们讲了 requests的基础用法，今天我们来点更硬核的内容——requests的高级玩法。这些技巧在你爬取特殊网站、处理大文件或需要安全验证时会非常有用！

文章目录

一、代理设置：隐藏真实IP的神器

1.1 为什么要用代理？

突破IP访问限制
隐藏真实IP地址
实现分布式爬虫

1.2 代理设置方法

import requests

proxies = {
    
    
    'http': 'http://10.10.1.10:3128',  # HTTP代理
    'https': 'http://10.10.1.10:1080',  # HTTPS代理
}

# 带认证的代理
auth_proxies = {
    
    
    'http': 'http://user:[email protected]:3128/'
}

response = requests.get('http://example.com', proxies=proxies, timeout=5)
print(response.status_code)

1.3 代理池实战技巧

import random

proxy_pool = [
    'http://123.123.123.123:8888',
    'http://111.111.111.111:9999',
    'http://222.222.222.222:7777'
]

def get_with_random_proxy(url):
    proxy = {
    
    'http': random.choice(proxy_pool)}
    try:
        return requests.get(url, proxies=proxy, timeout=3)
    except:
        return None  # 代理失败处理

二、SSL验证：安全请求的必修课

2.1 SSL验证的必要性

防止中间人攻击
确保数据传输安全
某些API的强制要求

2.2 禁用SSL验证（谨慎使用！）

# 不推荐在生产环境使用！
response = requests.get('https://example.com', verify=False)

2.3 自定义CA证书

# 使用自定义证书
response = requests.get('https://example.com', 
                       verify='/path/to/certificate.pem')

# 会话级设置
s = requests.Session()
s.verify = '/path/to/certificate.pem'

2.4 客户端证书认证

# 双向SSL认证
cert = ('/path/client.cert', '/path/client.key')
response = requests.get('https://example.com', cert=cert)

三、流式请求：处理大文件的正确姿势

3.1 什么是流式请求？

不立即下载全部内容
按需分块读取数据
节省内存，适合大文件

3.2 基本流式读取

# 下载大文件示例
url = 'http://example.com/bigfile.zip'
with requests.get(url, stream=True) as r:
    with open('bigfile.zip', 'wb') as f:
        for chunk in r.iter_content(chunk_size=8192): 
            if chunk:  # 过滤keep-alive空块
                f.write(chunk)

3.3 进度显示增强版

def download_file(url, filename):
    with requests.get(url, stream=True) as r:
        r.raise_for_status()
        total_size = int(r.headers.get('content-length', 0))
        downloaded = 0

        with open(filename, 'wb') as f:
            for chunk in r.iter_content(chunk_size=8192):
                f.write(chunk)
                downloaded += len(chunk)
                done = int(50 * downloaded / total_size)
                print(f"\r[{
      
      '=' * done}{
      
      ' ' * (50-done)}] {
      
      downloaded}/{
      
      total_size} bytes", end='')

    print(f"\n{
      
      filename} 下载完成！")

download_file('http://example.com/bigfile.zip', 'bigfile.zip')

四、其他高级技巧

4.1 连接适配器配置

from requests.adapters import HTTPAdapter

s = requests.Session()
# 设置连接池大小
adapter = HTTPAdapter(pool_connections=10, pool_maxsize=100)
s.mount('http://', adapter)
s.mount('https://', adapter)

4.2 自定义身份认证

from requests.auth import AuthBase

class TokenAuth(AuthBase):
    def __init__(self, token):
        self.token = token

    def __call__(self, r):
        r.headers['Authorization'] = f'Bearer {
      
      self.token}'
        return r

response = requests.get('https://api.example.com', auth=TokenAuth('your-token'))

4.3 事件钩子（Hooks）

def print_url(response, *args, **kwargs):
    print(f"请求URL: {
      
      response.url}")

hooks = {
    
    'response': [print_url]}
requests.get('http://example.com', hooks=hooks)

五、高级实战：突破反爬的完整案例

下面我们综合运用这些高级技巧，实现一个能突破常见反爬措施的爬虫：

import requests
from fake_useragent import UserAgent

class AdvancedCrawler:
    def __init__(self, proxy_pool=None):
        self.session = requests.Session()
        self.ua = UserAgent()
        self.proxy_pool = proxy_pool or []

    def get_with_retry(self, url, max_retry=3):
        headers = {
    
    'User-Agent': self.ua.random}
        proxies = {
    
    'http': random.choice(self.proxy_pool)} if self.proxy_pool else None

        for attempt in range(max_retry):
            try:
                response = self.session.get(
                    url,
                    headers=headers,
                    proxies=proxies,
                    timeout=10,
                    verify=False  # 仅示例，生产环境应使用正确证书
                )

                # 检查是否有反爬内容
                if 'access denied' in response.text.lower():
                    raise ValueError('触发反爬机制')

                return response

            except Exception as e:
                print(f"尝试 {
      
      attempt + 1} 失败: {
      
      str(e)}")
                if attempt == max_retry - 1:
                    raise
                time.sleep(2 ** attempt)  # 指数退避

# 使用示例
if __name__ == '__main__':
    proxy_pool = [
        'http://proxy1.example.com:8080',
        'http://proxy2.example.com:8080'
    ]

    crawler = AdvancedCrawler(proxy_pool)
    try:
        response = crawler.get_with_retry('https://target-site.com/data')
        print("获取成功！")
        print(response.text[:500])  # 打印前500字符
    except Exception as e:
        print("最终失败:", e)

六、总结与建议

今天我们深入学习了requests的高级用法：

代理设置：突破访问限制的利器
SSL验证：保障通信安全的关键
流式请求：处理大数据的优雅方案
还实现了一个综合性的高级爬虫案例

唐叔的几点建议：

代理IP要定期检测可用性
生产环境不要禁用SSL验证
流式请求时注意设置合适的chunk_size
合理设置超时和重试机制

文章目录

一、代理设置：隐藏真实IP的神器

1.1 为什么要用代理？

1.2 代理设置方法

1.3 代理池实战技巧

二、SSL验证：安全请求的必修课

2.1 SSL验证的必要性

2.2 禁用SSL验证（谨慎使用！）

2.3 自定义CA证书

2.4 客户端证书认证

三、流式请求：处理大文件的正确姿势

3.1 什么是流式请求？

3.2 基本流式读取

3.3 进度显示增强版

四、其他高级技巧

4.1 连接适配器配置

4.2 自定义身份认证

4.3 事件钩子（Hooks）

五、高级实战：突破反爬的完整案例

六、总结与建议

猜你喜欢

目录

热门文章