各位小伙伴大家好,我是唐叔!上期我们讲了 requests的基础用法 ,今天我们来点更硬核的内容——requests的高级玩法。这些技巧在你爬取特殊网站、处理大文件或需要安全验证时会非常有用!
文章目录
一、代理设置:隐藏真实IP的神器
1.1 为什么要用代理?
- 突破IP访问限制
- 隐藏真实IP地址
- 实现分布式爬虫
1.2 代理设置方法
import requests
proxies = {
'http': 'http://10.10.1.10:3128', # HTTP代理
'https': 'http://10.10.1.10:1080', # HTTPS代理
}
# 带认证的代理
auth_proxies = {
'http': 'http://user:[email protected]:3128/'
}
response = requests.get('http://example.com', proxies=proxies, timeout=5)
print(response.status_code)
1.3 代理池实战技巧
import random
proxy_pool = [
'http://123.123.123.123:8888',
'http://111.111.111.111:9999',
'http://222.222.222.222:7777'
]
def get_with_random_proxy(url):
proxy = {
'http': random.choice(proxy_pool)}
try:
return requests.get(url, proxies=proxy, timeout=3)
except:
return None # 代理失败处理
二、SSL验证:安全请求的必修课
2.1 SSL验证的必要性
- 防止中间人攻击
- 确保数据传输安全
- 某些API的强制要求
2.2 禁用SSL验证(谨慎使用!)
# 不推荐在生产环境使用!
response = requests.get('https://example.com', verify=False)
2.3 自定义CA证书
# 使用自定义证书
response = requests.get('https://example.com',
verify='/path/to/certificate.pem')
# 会话级设置
s = requests.Session()
s.verify = '/path/to/certificate.pem'
2.4 客户端证书认证
# 双向SSL认证
cert = ('/path/client.cert', '/path/client.key')
response = requests.get('https://example.com', cert=cert)
三、流式请求:处理大文件的正确姿势
3.1 什么是流式请求?
- 不立即下载全部内容
- 按需分块读取数据
- 节省内存,适合大文件
3.2 基本流式读取
# 下载大文件示例
url = 'http://example.com/bigfile.zip'
with requests.get(url, stream=True) as r:
with open('bigfile.zip', 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
if chunk: # 过滤keep-alive空块
f.write(chunk)
3.3 进度显示增强版
def download_file(url, filename):
with requests.get(url, stream=True) as r:
r.raise_for_status()
total_size = int(r.headers.get('content-length', 0))
downloaded = 0
with open(filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=8192):
f.write(chunk)
downloaded += len(chunk)
done = int(50 * downloaded / total_size)
print(f"\r[{
'=' * done}{
' ' * (50-done)}] {
downloaded}/{
total_size} bytes", end='')
print(f"\n{
filename} 下载完成!")
download_file('http://example.com/bigfile.zip', 'bigfile.zip')
四、其他高级技巧
4.1 连接适配器配置
from requests.adapters import HTTPAdapter
s = requests.Session()
# 设置连接池大小
adapter = HTTPAdapter(pool_connections=10, pool_maxsize=100)
s.mount('http://', adapter)
s.mount('https://', adapter)
4.2 自定义身份认证
from requests.auth import AuthBase
class TokenAuth(AuthBase):
def __init__(self, token):
self.token = token
def __call__(self, r):
r.headers['Authorization'] = f'Bearer {
self.token}'
return r
response = requests.get('https://api.example.com', auth=TokenAuth('your-token'))
4.3 事件钩子(Hooks)
def print_url(response, *args, **kwargs):
print(f"请求URL: {
response.url}")
hooks = {
'response': [print_url]}
requests.get('http://example.com', hooks=hooks)
五、高级实战:突破反爬的完整案例
下面我们综合运用这些高级技巧,实现一个能突破常见反爬措施的爬虫:
import requests
from fake_useragent import UserAgent
class AdvancedCrawler:
def __init__(self, proxy_pool=None):
self.session = requests.Session()
self.ua = UserAgent()
self.proxy_pool = proxy_pool or []
def get_with_retry(self, url, max_retry=3):
headers = {
'User-Agent': self.ua.random}
proxies = {
'http': random.choice(self.proxy_pool)} if self.proxy_pool else None
for attempt in range(max_retry):
try:
response = self.session.get(
url,
headers=headers,
proxies=proxies,
timeout=10,
verify=False # 仅示例,生产环境应使用正确证书
)
# 检查是否有反爬内容
if 'access denied' in response.text.lower():
raise ValueError('触发反爬机制')
return response
except Exception as e:
print(f"尝试 {
attempt + 1} 失败: {
str(e)}")
if attempt == max_retry - 1:
raise
time.sleep(2 ** attempt) # 指数退避
# 使用示例
if __name__ == '__main__':
proxy_pool = [
'http://proxy1.example.com:8080',
'http://proxy2.example.com:8080'
]
crawler = AdvancedCrawler(proxy_pool)
try:
response = crawler.get_with_retry('https://target-site.com/data')
print("获取成功!")
print(response.text[:500]) # 打印前500字符
except Exception as e:
print("最终失败:", e)
六、总结与建议
今天我们深入学习了requests的高级用法:
- 代理设置:突破访问限制的利器
- SSL验证:保障通信安全的关键
- 流式请求:处理大数据的优雅方案
- 还实现了一个综合性的高级爬虫案例
唐叔的几点建议:
- 代理IP要定期检测可用性
- 生产环境不要禁用SSL验证
- 流式请求时注意设置合适的chunk_size
- 合理设置超时和重试机制