python 爬虫之urllib 库的相关模块的介绍以及应用

在这里插入图片描述

文章目录

urllib.request 模块
应用

urllib.request 模块

在 Python 中，urllib.request 模块是用于处理 URL 请求的标准库模块之一。它提供了一组功能，用于打开、读取和处理 URL，包括发送 HTTP 请求和处理响应。以下是 urllib.request 模块的一些主要功能：

打开 URL：

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
- 用于打开指定的 URL。
- url 是要打开的 URL 字符串。
- data 是可选的请求数据，如果需要发送 POST 请求，则可以通过这个参数提供数据。
- timeout 是可选的超时时间，以秒为单位。
- cafile, capath, cadefault 用于指定 SSL/TLS 连接的证书。
- context 用于指定 SSL 上下文。
```
from urllib.request import urlopen

with urlopen('https://www.example.com') as response:
    html = response.read()
    print(html)
```

发送 HTTP 请求：

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)：
- 用于构建一个 HTTP 请求对象，可以在 urlopen 中使用。
- url 是要请求的 URL。
- data 是可选的请求数据。
- headers 是可选的请求头字典。
- method 是可选的请求方法，如 ‘GET’ 或 ‘POST’。
```
from urllib.request import Request, urlopen

req = Request('https://www.example.com', headers={
      
      'User-Agent': 'Mozilla/5.0'})
with urlopen(req) as response:
    html = response.read()
    print(html)
```

处理响应：

HTTPResponse 对象：

由 urlopen 返回的对象是一个 http.client.HTTPResponse 类型的实例。
提供了读取响应内容、获取响应头、获取状态码等方法和属性。

from urllib.request import urlopen

with urlopen('https://www.example.com') as response:
    status_code = response.getcode()
    headers = response.getheaders()
    html = response.read()
    print(f"Status Code: {
        
        status_code}")
    print(f"Headers: {
        
        headers}")
    print(html)

这些是 urllib.request 模块中一些主要的功能和方法。使用这个模块，可以在 Python 中轻松处理 URL 请求，获取远程资源的内容。

应用

如何读取并显示网页内容

当你想要读取并显示网页内容时，可以按照以下步骤使用Python的urllib.request库：

**导入urllib.request模块：**首先，导入urllib.request模块，它包含了用于打开URL的功能。
```
import urllib.request
```
**指定要读取的网页URL：**在你的代码中指定你想要读取的网页的URL。
```
url = 'http://www.example.com'
```
将'http://www.example.com'替换为你感兴趣的网页的URL。
**打开URL并获取文件对象：**使用urllib.request.urlopen函数打开指定的URL，它返回一个文件对象，你可以使用该对象来读取网页内容。
```
with urllib.request.urlopen(url) as response:
    # 在此处执行对网页内容的操作
```
使用with语句可以确保在处理完网页内容后自动关闭文件对象，这是良好的实践。
**读取网页内容：**使用文件对象的read()方法来读取网页的内容。
```
with urllib.request.urlopen(url) as response:
    web_content = response.read()
```
web_content现在包含了网页的字节内容。
**将字节内容转换为字符串并显示：**使用decode()方法将字节内容转换为字符串，并将其打印出来。
```
with urllib.request.urlopen(url) as response:
    web_content = response.read()
    print(web_content.decode('utf-8'))
```
在这里，假设网页使用UTF-8编码。如果你知道网页使用其他编码，可以相应地调整decode方法的参数。

整个代码示例：

import urllib.request

url = 'http://www.example.com'

with urllib.request.urlopen(url) as response:
    web_content = response.read()
    print(web_content.decode('utf-8'))

这个代码会打开指定的URL，读取网页内容，然后将其作为字符串显示在控制台上。

提交网页参数

当你需要向网页提交参数时，可以使用HTTP请求中的POST方法。下面是使用Python的requests库的一步步介绍：

**安装requests库：**如果你还没有安装requests库，可以通过以下命令安装：
```
pip install requests
```
**导入requests模块：**在你的Python脚本中导入requests模块。
```
import requests
```
**指定要提交参数的URL：**在你的代码中指定你要提交参数的网页的URL。
```
url = 'http://www.example.com/post_endpoint'
```
将'http://www.example.com/post_endpoint'替换为你要提交参数的实际网页地址。
**准备要提交的参数：**创建一个字典，其中包含你想要提交的参数。
```
payload = {
      
      'param1': 'value1', 'param2': 'value2'}
```
这里的payload是一个字典，包含了两个参数param1和param2以及对应的值。
**发送POST请求并传递参数：**使用requests.post方法发送POST请求，并通过data参数传递参数。
```
response = requests.post(url, data=payload)
```
在这里，url是你指定的网页地址，data是要提交的参数字典。
**检查响应：**检查服务器的响应，看是否请求成功。
```
if response.status_code == 200:
    print('请求成功!')
    print('响应内容:', response.text)
else:
    print(f'请求失败，状态码: {
        
        response.status_code}')
```
这里我们通过response.status_code检查HTTP响应状态码，如果状态码是200，则表示请求成功。你可以根据实际需要处理不同的状态码。

整个代码示例：

import requests

url = 'http://www.example.com/post_endpoint'

payload = {
    
    'param1': 'value1', 'param2': 'value2'}

response = requests.post(url, data=payload)

if response.status_code == 200:
    print('请求成功!')
    print('响应内容:', response.text)
else:
    print(f'请求失败，状态码: {
      
      response.status_code}')

这个代码将以POST方式向指定的URL提交参数，并输出服务器的响应。

使用HTTP 代理访问页面

当你想要通过HTTP代理访问页面时，你可以使用Python的requests库，并设置代理。以下是一步步的介绍：

**安装requests库：**如果你还没有安装requests库，可以通过以下命令安装：
```
pip install requests
```
**导入requests模块：**在你的Python脚本中导入requests模块。
```
import requests
```
**指定要访问的URL：**在你的代码中指定你要访问的网页的URL。
```
url = 'http://www.example.com'
```
将'http://www.example.com'替换为你要访问的实际网页地址。
**指定代理：**设置代理服务器的地址。代理服务器可以是HTTP代理或者HTTPS代理，具体取决于你的代理类型。
```
proxy = {
      
      
    'http': 'http://your_http_proxy_address',
    'https': 'http://your_https_proxy_address'
}
```
将your_http_proxy_address和your_https_proxy_address替换为你实际使用的代理服务器地址。
**发送请求时使用代理：**通过proxies参数将代理传递给requests.get或requests.post等方法。
```
response = requests.get(url, proxies=proxy)
```
在这里，url是你指定的网页地址，proxies是包含了代理地址的字典。
**检查响应：**检查服务器的响应，看是否请求成功。
```
if response.status_code == 200:
    print('请求成功!')
    print('响应内容:', response.text)
else:
    print(f'请求失败，状态码: {
        
        response.status_code}')
```
这里我们通过response.status_code检查HTTP响应状态码，如果状态码是200，则表示请求成功。你可以根据实际需要处理不同的状态码。

整个代码示例：

import requests

url = 'http://www.example.com'

proxy = {
    
    
    'http': 'http://your_http_proxy_address',
    'https': 'http://your_https_proxy_address'
}

response = requests.get(url, proxies=proxy)

if response.status_code == 200:
    print('请求成功!')
    print('响应内容:', response.text)
else:
    print(f'请求失败，状态码: {
      
      response.status_code}')

请注意，具体的代理设置可能因你的网络环境和代理类型而有所不同。确保使用你实际网络环境中的正确代理信息。