python——全方位爬虫

一、urllib

1、request简介

1.1 urlopen

扫描二维码关注公众号，回复： 16886350 查看本文章

4、urllib——robotparser模块

正则表达式——只保留中文/汉字字符（过滤非汉字字符

五，Python logging 模块之 logging.basicConfig 用法和参数详解

1.1. logging 模块简介

2 logging.basicConfig(**kwargs)

1.4 使用文件（filename）保存日志文件

1.7 读取json文件的方法(json. load)

4 爬取壁纸实例（自己编写的实例，暴力的正则匹配，main函数测试的时候使用break）

3 Beautiful Soup的基本使用

8.1 add_class和remove_class方法

一、urllib

1、request简介

request是最基本的HTTP请求模块，可以模拟请求的发送，其过程与在浏览器中输入网址1然后回车一样，只要给库方法传入URL以及额外的参数，就可以模拟实现发送请求的过程。

1.1 urlopen

rullib.request模块可以模拟浏览器的请求发起过程，同时还具备处理授权验证（authentication）、重定向（redirection)、浏览器Cookie等功能。

基本写法如下：

import urllib.request

response = urllib.request.urlopen("https://www.python.org/")
print(response.read().decode('utf-8'))

该方法为GET请求方法。使用type方法得到响应的类型：

print(type(response))

输出：<class 'http.client.HTTPResponse'>

所以响应是一个HTTPResponse类型的对象。

使用方法输出响应的状态码和响应的头信息:

print(response.status) #得到响应的状态码
print(response.getheaders()) #得到响应的响应头信息
print(response.getheader("Server")) #获取响应头的键为Server的值

urlopen的API用法：

response = urllib.request.Request(url, data = None, [timeout]*,cafile = None,capath = None,cadefault = False,context = None)

1.2 data参数设计

data参数是可选的，在添加该参数时，需要使用bytes方法将参数转化为字节流编码格式的内容，即bytes类型。如果传递了data参数，则请求方方式为GET，而不是POST了。

实例:

import urllib.request
import urllib.parse

data = bytes(urllib.parse.urlencode({'name':'germey'}), encoding = 'utf-8')
response = urllib.request.urlopen('https://www.httpbin.org/post', data = data)
print(response.read().decode('utf-8'))

得到结果：

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "germey"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Content-Length": "11", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "www.httpbin.org", 
    "User-Agent": "Python-urllib/3.11", 
    "X-Amzn-Trace-Id": "Root=1-64997bdd-011711375cc64ba54dba4056"
  }, 
  "json": null, 
  "origin": "1.202.187.118", 
  "ur

python——全方位爬虫

一、urllib

1、request简介

1.1 urlopen

1.2 data参数设计

猜你喜欢