Python3网络爬虫-基本库使用

Python3网络爬虫-基本库使用

1、 HTTP基本原理

1、URL&URN

  • URL Universa Resource Locator ,即统 资源定位符 。例如:https://github .com/favicon.ico

  • URN 它的全称为 Universal Resource Name ,即统一资源名称 ,例如:um:isbn:0451450523 指定了一本书的 ISBN

  • URI 全称为 ifo rm Resource Identifier ,即 统一资源标志符

    URI=URL+URN , 目前URN使用较少,几乎所有的URL都是URI

2、 HTTP&HTTPS

URL 的开头会有 http https,也许还会看到坤、 smb 开头的URL

  • HTTP Hyper Text Transfer Protocol ,超文本传输协议
  • HTTPS HTTP+SSL/STL ,安全的超文本传输协议

3、请求

  • 请求方法

    • GET 请求中的参数包含在 URL 里面,数据可以在 URL 中看到,而 POST 请求的 URL不 会包
      含这些数据,数据都是通过表单形式传输的,会包含在请求体中
    • GET 请求提交的数据最多只有 1024 字节,而 POST 方式没有限制
    • 表单、敏感信息、文件使用POST提交
  • 请求网址

    即统 资惊定位符 URL ,它可以唯一确定我们想请求的资源

  • 请求头

    服务器重要附加信息

    • Accept 请求报头域,用于指定客户端可接受哪些类型的信息
    • Accept-Language :指定客户端可接受的语言类型
    • Accept-Encoding :指定客户端可接受的内容编码
    • Host :用于指定请求资源的主机 IP 和端口号,其内容为请求 URL 的原始服务器或网关的位置。HTTP 1. 版本开始,请求必须包含此内容
    • Cookie :也常用复数形式 Cookies ,这是网站为了辨别用户进行会话跟踪而存储在用户本地
      的数据 它的主要功能是维持当前访问会话
    • Referer : 此内容用来标识这个请求是从哪个页面发过来的,服务器可以拿到这 信息并做相
      应的处理,如做来源统计、防盗链处理等
    • User-Agent:简称 UA ,它是一个特殊的字符串头,可以使服务器识别客户使用的操作系统
      及版本 浏览器及版本等信息 在做爬虫时加上此信息,可以伪装为浏览器;如果不加,很
      可能会被识别州为爬虫
    • Content-Type :也叫互联网媒体类型( Internet Media Type )或者 MIME 类型,在 HTT 协议
      消息头中,它用来表示具体请求中的媒体类型信息 例如, text/html 代表 HTML 格式,image/gif 代表 GIF 图片, app lication/json 代表JSON 类型,更多对应关系可以查看此对照表http://tool.oschina.neνcommons
  • 请求体

    请求体 般承载的内容是 POST 请求中的表单数据,而对于 GET 请求,请求体则为空

    Content-Type 提交数据的方式
    application/x-www-forrn-urlencoded 表单数据
    multi part/form-data 表单文件上传
    application/json 序列化 JSON 数据
    text/xml XML 数据

4、 响应

响应,由服务端返回给客户端,可以分为 部分:响应状态码( Response Status Code )、响应头( Response Headers )和响应体( Response Body )

  • 项目码

    常见的错误代码及错误原因

    状态码 说明 详情
    100 继续 请求者应当继续提出请求 服务器已收到请求的一部分,正在等待其余部分
    101 切换协议 请求者已要求服务器切换协议,服务器已确认并准备切换
    200 成功 服务然已成功处理了请求
    201 已创建 请求成功并且服务器创建了新的资源
    202 已接收 服务器已经接受请求,但尚未处理
    203 非授权信息 服务器已成功处理了请求,但返回的信息可能来自另 个源
    204 无内容 服务器成功处理了请求 但没有返回任何内容
    205 重置内容 服务器成功处理了请求,内容被重宜
    206 部分内容 服务器成功处理了部分请求
    300 多种选择 针对请求,服务器可执行多种操作
    301 永久移动 请求的网页已永久移动到新位置,即永久重定向
    302 l临时移动 请求的网页暂时跳转到其他页面,即暂时重定向
    303 查看其他位置 如果原来的请求是POST , 重定向目标文档应该通过GET 提取
    304 未修改 此次请求返回的网页未修改, 继续使用上次的资源
    305 使用代理 请求者应该使用代理访问该网页
    307 临时重定向 请求的资源临时从其他位置l响应
    400 错误请求 服务器无法解析该请求
    401 没授权 请求没有进行身份验证或验证未通过
    403 禁止访问 服务将拒绝此请求
  • 响应头

    响应头包含了服务器对请求的应答信息

    • Date : 标识响应产生的时间。
    • Last-Modified : 指定资源的最后修改时间。
    • Content-Encoding : 指定响应内容的编码。
    • Server : 包含服务器的信息,比如名称、版本号等。
    • Content-Type : 文档类型,指定返回的数据类型是什么,如tex t/htm l 代表返回HTML 文档,app li cation/x-javascript !J!U 代表返回JavaScript 文件, image/jpeg 则代表返回图片。
    • Set-Cookie : 设置Cookies 。响应头中的Set- Cooki e 告诉浏览器需要将此内容放在Cookies中, 下次请求携带Cookies 请求
    • Expires : 指定响应的过期时间, 可以使代理服务器或浏览器将加载的内容更新到缓存中。如
      果再次访问时,就可以直接从缓存中加载, 降低服务器负载,缩短加载时间。

2、基本库使用

1、urllib

判断超时情况

import urllib.request
import urllib.error
import socket
try:
	reponse=urllib.request.urlopen( 'http://httpbin.org/get' , timeout=0.1)
except urllib.error.URLError as e:
    if isinstance( e.reason , socket.timeout):
        print('TIME OUT')
    

构造HTTP请求头

from urllib import request , parse
url='http://httpbin.org/post'
headers={
    'User-Agent':'Mozilla/4.0 (compatible; MSIE S. S; Windows NT )',
    'Host':'httpbin.org'
}
dict={
    'name':'Germey'
}
data=bytes( parse.urlencode(dict), encoding='utf-8')
req= request.Request(url=url,headers=headers,data=data,method='POST')
response= request.urlopen(req)
print(response.read().decode('utf-8'))

3、requests

简单爬取知乎问答内容

import requests
import re
import sys
#设置正确的浏览器信息否则返回400,GOOGLE F12 现抓个User-Agent最好
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' 
}
r=requests.get('https://www.zhihu.com/explore',headers=headers)
#如果有不确定的地方可以先发到http://httpbin.org/get 看看请求是否正确
#r=requests.get('http://httpbin.org/get',headers=headers)
if r.status_code != 200 :
    print( "return status_code : % " %r.status_code );
    sys.exit()
pattern=re.compile('explore-feed.*?question_link.*?>(.*?)</a>', re.S)
titles=re.findall(pattern,r.text)
print(titles)
    

抓取图片、视频、音频

import requests
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36' 
}
r=requests.get('http://github.com/favicon.ico' , headers=headers )
#print( t.text ) --打印文本
#print( t.content )--打印bytes
with open( 'favicon.ico' , 'wb' ) as f:
    f.write( r.content )

  

状态码查询

import requests
import sys
r = requests.get('http://www.jianshu.com')
sys.exit if r.status_code != requests.codes.ok else print('Request Successfully')
#class 'requests.structures.CaseInsensitiveDict'
print( type(r.headers) , r.headers)
#class 'requests.cookies.RequestsCookieJar'
print( type(r.cookies) , r.cookies)
print( type(r.url) , r.url )

requests.codes

import requests
dict=requests.codes.__dict__  
#键值对颠倒
si=[(status_code,info) for info,status_code in dict.items()]
dist_si={}
#字典去重合并
for code_dict in si:
	code_dict_key=code_dict[0]
	code_dict_val=code_dict[1]
	print( code_dict_key , code_dict_val )
	if  dist_si.get(code_dict_key):
		dist_si[code_dict_key].append(code_dict_val)
	else:
		dist_si[code_dict_key]=[code_dict_val]
for (status_code,info) in dist_si.items():
	print( status_code,info ) 
=====+++++++++++++++++++++++++++++=输出==================================
status_codes ['name']
#信息状态码
100 ['continue', 'CONTINUE']
101 ['switching_protocols', 'SWITCHING_PROTOCOLS']
102 ['processing', 'PROCESSING']
103 ['checkpoint', 'CHECKPOINT']
122 ['uri_too_long', 'URI_TOO_LONG', 'request_uri_too_long', 'REQUEST_URI_TOO_LONG']
#成功状态码
200 ['ok', 'OK', 'okay', 'OKAY', 'all_ok', 'ALL_OK', 'all_okay', 'ALL_OKAY', 'all_good', 'ALL_GOOD', '\\o/', '✓']
201 ['created', 'CREATED']
202 ['accepted', 'ACCEPTED']
203 ['non_authoritative_info', 'NON_AUTHORITATIVE_INFO', 'non_authoritative_information', 'NON_AUTHORITATIVE_INFORMATION']
204 ['no_content', 'NO_CONTENT']
205 ['reset_content', 'RESET_CONTENT', 'reset', 'RESET']
206 ['partial_content', 'PARTIAL_CONTENT', 'partial', 'PARTIAL']
207 ['multi_status', 'MULTI_STATUS', 'multiple_status', 'MULTIPLE_STATUS', 'multi_stati', 'MULTI_STATI', 'multiple_stati', 'MULTIPLE_STATI']
208 ['already_reported', 'ALREADY_REPORTED']
226 ['im_used', 'IM_USED']
#重定向状态码
300 ['multiple_choices', 'MULTIPLE_CHOICES']
301 ['moved_permanently', 'MOVED_PERMANENTLY', 'moved', 'MOVED', '\\o-']
302 ['found', 'FOUND']
303 ['see_other', 'SEE_OTHER', 'other', 'OTHER']
304 ['not_modified', 'NOT_MODIFIED']
305 ['use_proxy', 'USE_PROXY']
306 ['switch_proxy', 'SWITCH_PROXY']
307 ['temporary_redirect', 'TEMPORARY_REDIRECT', 'temporary_moved', 'TEMPORARY_MOVED', 'temporary', 'TEMPORARY']
308 ['permanent_redirect', 'PERMANENT_REDIRECT', 'resume_incomplete', 'RESUME_INCOMPLETE', 'resume', 'RESUME']
#客户端错误状态码
400 ['bad_request', 'BAD_REQUEST', 'bad', 'BAD']
401 ['unauthorized', 'UNAUTHORIZED']
402 ['payment_required', 'PAYMENT_REQUIRED', 'payment', 'PAYMENT']
403 ['forbidden', 'FORBIDDEN']
404 ['not_found', 'NOT_FOUND', '-o-', '-O-']
405 ['method_not_allowed', 'METHOD_NOT_ALLOWED', 'not_allowed', 'NOT_ALLOWED']
406 ['not_acceptable', 'NOT_ACCEPTABLE']
407 ['proxy_authentication_required', 'PROXY_AUTHENTICATION_REQUIRED', 'proxy_auth', 'PROXY_AUTH', 'proxy_authentication', 'PROXY_AUTHENTICATION']
408 ['request_timeout', 'REQUEST_TIMEOUT', 'timeout', 'TIMEOUT']
409 ['conflict', 'CONFLICT']
410 ['gone', 'GONE']
411 ['length_required', 'LENGTH_REQUIRED']
412 ['precondition_failed', 'PRECONDITION_FAILED']
428 ['precondition', 'PRECONDITION', 'precondition_required', 'PRECONDITION_REQUIRED']
413 ['request_entity_too_large', 'REQUEST_ENTITY_TOO_LARGE']
414 ['request_uri_too_large', 'REQUEST_URI_TOO_LARGE']
415 ['unsupported_media_type', 'UNSUPPORTED_MEDIA_TYPE', 'unsupported_media', 'UNSUPPORTED_MEDIA', 'media_type', 'MEDIA_TYPE']
416 ['requested_range_not_satisfiable', 'REQUESTED_RANGE_NOT_SATISFIABLE', 'requested_range', 'REQUESTED_RANGE', 'range_not_satisfiable', 'RANGE_NOT_SATISFIABLE']
417 ['expectation_failed', 'EXPECTATION_FAILED']
418 ['im_a_teapot', 'IM_A_TEAPOT', 'teapot', 'TEAPOT', 'i_am_a_teapot', 'I_AM_A_TEAPOT']
421 ['misdirected_request', 'MISDIRECTED_REQUEST']
422 ['unprocessable_entity', 'UNPROCESSABLE_ENTITY', 'unprocessable', 'UNPROCESSABLE']
423 ['locked', 'LOCKED']
424 ['failed_dependency', 'FAILED_DEPENDENCY', 'dependency', 'DEPENDENCY']
425 ['unordered_collection', 'UNORDERED_COLLECTION', 'unordered', 'UNORDERED']
426 ['upgrade_required', 'UPGRADE_REQUIRED', 'upgrade', 'UPGRADE']
429 ['too_many_requests', 'TOO_MANY_REQUESTS', 'too_many', 'TOO_MANY']
431 ['header_fields_too_large', 'HEADER_FIELDS_TOO_LARGE', 'fields_too_large', 'FIELDS_TOO_LARGE']
444 ['no_response', 'NO_RESPONSE', 'none', 'NONE']
449 ['retry_with', 'RETRY_WITH', 'retry', 'RETRY']
450 ['blocked_by_windows_parental_controls', 'BLOCKED_BY_WINDOWS_PARENTAL_CONTROLS', 'parental_controls', 'PARENTAL_CONTROLS']
451 ['unavailable_for_legal_reasons', 'UNAVAILABLE_FOR_LEGAL_REASONS', 'legal_reasons', 'LEGAL_REASONS']
499 ['client_closed_request', 'CLIENT_CLOSED_REQUEST']
#服务端错误状态码
500 ['internal_server_error', 'INTERNAL_SERVER_ERROR', 'server_error', 'SERVER_ERROR', '/o\\', '✗']
501 ['not_implemented', 'NOT_IMPLEMENTED']
502 ['bad_gateway', 'BAD_GATEWAY']
503 ['service_unavailable', 'SERVICE_UNAVAILABLE', 'unavailable', 'UNAVAILABLE']
504 ['gateway_timeout', 'GATEWAY_TIMEOUT']
505 ['http_version_not_supported', 'HTTP_VERSION_NOT_SUPPORTED', 'http_version', 'HTTP_VERSION']
506 ['variant_also_negotiates', 'VARIANT_ALSO_NEGOTIATES']
507 ['insufficient_storage', 'INSUFFICIENT_STORAGE']
509 ['bandwidth_limit_exceeded', 'BANDWIDTH_LIMIT_EXCEEDED', 'bandwidth', 'BANDWIDTH']
510 ['not_extended', 'NOT_EXTENDED']
511 ['network_authentication_required', 'NETWORK_AUTHENTICATION_REQUIRED', 'network_auth', 'NETWORK_AUTH', 'network_authentication', 'NETWORK_AUTHENTICATION']

文件上传(#“Content-Type”: “multipart/form-data”)

import requests
files={
    'file':open('1.pem','rb')
}
r=requests.post( 'http://httpbin.org/post' , files=files )
print(r.text)

Cookies

  • #获取Cookies
    import requests
    r=requests.get('https://baidu.com')
    for key , val in r.cookies.items() :
        print( "%s=%s" % (key , val) )
    
  • #手工设置Cookies ---Google浏览器F12 获取
    #####################方法1#####################################
    import requests
    headers={
        'Cookie':'_zap=cc672834-3e63-4a4e-9246-93b54dc74a76; __DAYU_PP=yuUeiiVeaVZEayUab2rFffffffffd3f1f0f5bc9c; d_c0="AMCkrWxHuw2PTh4QnK1aQBQcA2l7rd2aSjY=|1528686380"; l_n_c=1; q_c1=35d4a692ec7d4c3c88351f8b8959668b|1553738732000|1516775913000; _xsrf=d632891773e10dc462a07feb2f829368; n_c=1; _xsrf=aDKGdn6TfOkYfk43vsekRV75FfebYNba; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; __utmc=51854390; __utmz=51854390.1553738668.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); BL_D_PROV=; BL_T_PROV=; tgw_l7_route=66cb16bc7f45da64562a077714739c11; l_cap_id="YzJjYzEyY2ExZGMxNGJkMmFjNmNkNTM3MDg1ZWRiM2E=|1553762062|9d1547776eebfb3b42ca92369b2d3a9df4245339"; r_cap_id="Yjg3NTg0YjRhNmZjNDEyMDk2MmFkMjI4NzgyODgzYzU=|1553762062|efff30851f845765634ec9bae5bde07dce11315e"; cap_id="M2M0MjNjMzUyNzdlNGQxMThlNTRhOGVhOTY5ZDkwMjM=|1553762062|48aac3689381c89f5ecccbdc02c001de923e6fe2"; __utma=51854390.1821104099.1553738668.1553738668.1553761992.2; __utmb=51854390.0.10.1553761992; capsion_ticket="2|1:0|10:1553762071|14:capsion_ticket|44:ODBmZjRiMWMzN2MxNDM1OTlkMDUzNTA5NTNjM2ZlMDI=|6a6ccc9cf7d944da04671d627a7be433a0911b39d8918dc4ae65184d1d7fff89"; z_c0="2|1:0|10:1553762113|4:z_c0|92:Mi4xVHg3NkRnQUFBQUFBd0tTdGJFZTdEU1lBQUFCZ0FsVk5RZFdKWFFBU2RTWmpnTUIwSXF3ODZ1TEFNTlJraFJsbjh3|fb442f693e4ef8cc9837064a6e4e1bdd766d26db24f0bb4b0b765f36e7672ac8"; tst=r; __utmv=51854390.100--|2=registration_date=20190328=1^3=entry_date=20180124=1' ,
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    r=requests.get('https://www.zhihu.com/collections' ,headers=headers)
    print(r.text)
    #####################方法2#####################################
    import requests
    headers={
        'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    cookies='_zap=cc672834-3e63-4a4e-9246-93b54dc74a76; __DAYU_PP=yuUeiiVeaVZEayUab2rFffffffffd3f1f0f5bc9c; d_c0="AMCkrWxHuw2PTh4QnK1aQBQcA2l7rd2aSjY=|1528686380"; l_n_c=1; q_c1=35d4a692ec7d4c3c88351f8b8959668b|1553738732000|1516775913000; _xsrf=d632891773e10dc462a07feb2f829368; n_c=1; _xsrf=aDKGdn6TfOkYfk43vsekRV75FfebYNba; SL_GWPT_Show_Hide_tmp=1; SL_wptGlobTipTmp=1; __utmc=51854390; __utmz=51854390.1553738668.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); BL_D_PROV=; BL_T_PROV=; tgw_l7_route=66cb16bc7f45da64562a077714739c11; l_cap_id="YzJjYzEyY2ExZGMxNGJkMmFjNmNkNTM3MDg1ZWRiM2E=|1553762062|9d1547776eebfb3b42ca92369b2d3a9df4245339"; r_cap_id="Yjg3NTg0YjRhNmZjNDEyMDk2MmFkMjI4NzgyODgzYzU=|1553762062|efff30851f845765634ec9bae5bde07dce11315e"; cap_id="M2M0MjNjMzUyNzdlNGQxMThlNTRhOGVhOTY5ZDkwMjM=|1553762062|48aac3689381c89f5ecccbdc02c001de923e6fe2"; __utma=51854390.1821104099.1553738668.1553738668.1553761992.2; __utmb=51854390.0.10.1553761992; capsion_ticket="2|1:0|10:1553762071|14:capsion_ticket|44:ODBmZjRiMWMzN2MxNDM1OTlkMDUzNTA5NTNjM2ZlMDI=|6a6ccc9cf7d944da04671d627a7be433a0911b39d8918dc4ae65184d1d7fff89"; z_c0="2|1:0|10:1553762113|4:z_c0|92:Mi4xVHg3NkRnQUFBQUFBd0tTdGJFZTdEU1lBQUFCZ0FsVk5RZFdKWFFBU2RTWmpnTUIwSXF3ODZ1TEFNTlJraFJsbjh3|fb442f693e4ef8cc9837064a6e4e1bdd766d26db24f0bb4b0b765f36e7672ac8"; tst=r; __utmv=51854390.100--|2=registration_date=20190328=1^3=entry_date=20180124=1'
    jar=requests.cookies.RequestsCookieJar()
    for cookie in cookies.split(';'):
        key,val = cookie.split('=',1)
        jar.set(key,val)
    r=requests.get('https://www.zhihu.com/collections' ,cookies=jar,headers=headers)
    print(r.text)
    
    
  • #维持回话
    import requests
    s=requests.Session()
    r=s.get('http://httpbin.org/cookies/set/number/123456789')
    print(r.text)
    r=s.get('http://httpbin.org/cookies')
    print(r.text)
    

SSL证书验证

相关资料

[理解服务器证书 CA&SSL][https://www.v2ex.com/t/436240]

[SSL/TLS原理详解][https://segmentfault.com/a/1190000002554673]

python有自己的CA列表(不是跟IE,GOOGLE一样用操作系统的) ,由certifi模块提供。我测试环境的CA文件:

(site_test) wujun@wujun-VirtualBox:~$ sudo find ./ -name cacert.pem 
./env_site_test/lib/python3.6/site-packages/pip/_vendor/certifi/cacert.pem
(site_test) wujun@wujun-VirtualBox:~$ python
Python 3.6.7 (default, Oct 22 2018, 11:32:17) 
[GCC 8.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import certifi
>>> 
#忽略警告信息
import logging
import requests
#捕获警告到日志
logging.captureWarnings(True)
#我实验的时候12306已经不是自签名证书了,没法实验
response=requests.get('https://www.12306.cn')
#双向认证的时候需要制定客户端证书和私钥。私钥用于证书签名。requests.get要求这个私钥不能是明文
#response=requests.get('https://www.12306.cn',cert=('/path/ser.crt','/path/key'))
print(response.status_code)

proxy代理

  • 如果需要SOCK协议 需要单独安装 pip install ‘requests[socks]’
import requests
proxies={
    'http':'http://211.149.172.228:9999',
    'https':'https://182.150.35.173:80',
     #HTTP Basic Auth
    'https':'sock5://user:[email protected]:3128/'
}
#响应时间timeout最好大于3秒(因为TCP数据包重传窗口的默认大小是3),timeout可以细化:例如timeout=(connect,read,total) 。默认timeout=None阻塞等待...
requests.get('http://httpbin.org/get' , proxies = proxies ,timeout=(4,5,10)

tcpdump抓包可以看到tcp/ip协议头的目的地址已经变成代理的地址(211.149.172.228)

[免费代理&购买代理点击][http://www.qydaili.com/free/]

1553850086241

身份认证

  • basic auth

    import requests
    from requests.auth import HTTPBasicAuth 
    #测试用户名称test_name 密码:123456 , URL中的basic-auth标识是基本认证
    r=requests.get( 'http://httpbin.org/basic-auth/test_name/123456' ,auth = HTTPBasicAuth('test_name','123456'))
    r.text
    '''
    输出测试1,输入正确的密码(200-ok): 
    >>> r=requests.get( 'http://httpbin.org/basic-auth/test_name/123456' ,auth = HTTPBasicAuth('test_name','123456'))
    >>> print(r.headers)
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Sun, 31 Mar 2019 12:27:00 GMT', 'Server': 'nginx', 'Content-Length': '68', 'Connection': 'keep-alive'}
    >>> print(r.status_code)
    200
    
    输出测试2,输入错误的密码(401-unauthorized):
    >>> r=requests.get( 'http://httpbin.org/basic-auth/test_name/123456' ,auth = HTTPBasicAuth('test_name','1234567'))
    >>> print(r.status_code)
    401
    >>> print(r.headers)    
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Date': 'Sun, 31 Mar 2019 12:30:17 GMT', 'Server': 'nginx', 'WWW-Authenticate': 'Basic realm="Fake Realm"', 'Content-Length': '0', 'Connection': 'keep-alive'}
    >>> 
    
    请求测试,BASIC AUTH请求报文什么样子
    >>> r=requests.get( 'http://httpbin.org/get' ,auth = HTTPBasicAuth('test_name','1234567')) 
    >>> print(r.text)
    {
      "args": {}, 
      "headers": {
        "Accept": "*/*", 
        "Accept-Encoding": "gzip, deflate", 
        "Authorization": "Basic dGVzdF9uYW1lOjEyMzQ1Njc=", 
        "Host": "httpbin.org", 
        "User-Agent": "python-requests/2.18.4"
      }, 
      "origin": "218.88.16.199, 218.88.16.199", 
      "url": "https://httpbin.org/get"
    }
    '''
    

    1、从上面可以看出服务器需要BASIC AUTH是响应401 ,报文头WWW-Authenticate 提示Fake Realm域需要验证

    2、客户端对添加"Authorization": “Basic dGVzdF9uYW1lOjEyMzQ1Njc=”, user:password 用base64转换够放在Basic后发给服务器

    3、如果用户名和密码不匹配在重新响应401提示需要基本验证

  • 摘要认证

    import requests
    from requests.auth import HTTPDigestAuth
    url = 'http://httpbin.org/digest-auth/auth/user/pass'
    r=requests.get(url, auth=HTTPDigestAuth('user', 'pass'))
    r.status_code
    print(r.headers)
    '''
    #输出测试1
    >>> r.status_code
    200
    >>> print(r.headers)
    {'Access-Control-Allow-Credentials': 'true', 'Access-Control-Allow-Origin': '*', 'Content-Encoding': 'gzip', 'Content-Type': 'application/json', 'Date': 'Mon, 01 Apr 2019 03:14:28 GMT', 'Server': 'nginx', 'Set-Cookie': 'fake=fake_value; Path=/, stale_after=never; Path=/', 'Content-Length': '59', 'Connection': 'keep-alive'}
    #输出测试2,服务器返回401
    import requests
    from requests.auth import HTTPDigestAuth
    text=requests.get('http://httpbin.org/digest-auth/auth/user/pass1', auth=HTTPDigestAuth('user', 'pass')).headers
    for head,response_msg in text.items():
    	print(head,response_msg) 
        
    Access-Control-Allow-Credentials true
    Access-Control-Allow-Origin *
    Content-Type text/html; charset=utf-8
    Date Mon, 01 Apr 2019 04:26:46 GMT
    Server nginx
    Set-Cookie stale_after=never; Path=/, last_nonce=d0d5882d37dcf4b76dee54e9c0d2bb5a; Path=/, fake=fake_value; Path=/
    WWW-Authenticate Digest realm="[email protected]", nonce="3969731c4f2ce3545a8266fe7d41a67c", qop="auth", opaque="3f15a8256cb961c0e0add04854f1f15d", algorithm=MD5, stale=FALSE
    Content-Length 0
    Connection keep-alive
    >>> 
    输入测试1,请求报文是什么样子
    看下图
    '''
    
    
    

    1554093891935

    1、重TCPDUMP截图可以看出 request进行 了两次请求, 第一次请求为了获取服务随机数、摘要算法等信息。第二次请求才带上用户名和密码

    2、第二次请求中Authorization 中的response是计算结果。 [OAuth 2.0: Bearer Token Usage][https://www.cnblogs.com/XiongMaoMengNan/p/6785155.html]

prepared request

  • 为方便进程调度方便引入

    from requests import Request, Session
    url='http://httpbin.org/post'
    data={
        'name':'wujun'
    }
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'
    }
    s=Session()
    req=Request('POST',url,data=data,headers=headers)
    prepped=s.prepare_request(req)
    r=s.send(prepped)
    print(r.text)
    

4、正则表达式

在线测试网站http://tool.oschina.net/regex#

模式 描述
\w 匹配字母、数字及下划线
\W 匹配非字母、数字及下划线
\s 匹配任意空白字符,等价于[\t\n\r\f]
\S 匹配任意非空字符
\d 匹配任意数字,等价于[0- 9]
\D 匹配任意非数字的字符
\A 匹配字符串开头
\Z 匹配字符串结尾,如果存在换行,只匹配到换行前的结束字符串
\z 匹配字符串结尾,如果存在换行,同时还会匹配换行符
\G 匹配最后匹配完成的位宣
\n 匹配 个换行符
\t 匹配一个制表衔’
^ 匹配 行字符串的开头
$ 匹配 行字符串的结尾
. 匹配任意字符,除了换行符,当 re.DOTALL 标记被指定时,则可以匹配包括换行符的任意字符
[…] 用来表示 组字符,单独列出
[^…] 不在[]中的字符
* 匹配 0个或多个表达式
+ 匹配 1个或多个表达式
? 匹配 0个或1 个前面的正则表达式定义的片段,非贪婪方式
{n} 精确匹配 n个前面的表达式
{n,m} 匹配 n到m次由前面正则表达式定义的片段,贪婪方式
a|b 匹配a或b
( ) 匹配括号内的表达式,也表示 1个组
  • match()

    需要匹配的内容用()扩起来,用group按顺序输出

    import re
    content = 'Hello 1234567 World_tHIS is Regex Demo'
    result= re.match('^Hello\s(\d+)\s',content)
    print(result)
    print(result.group(1))
    print(result.span())
    #非贪婪模式1 输出 1234567
    result=re.match('^Hello.*?(\d+).*Demo$',content)
    >>> print(result.group(1))
    1234567
    #非贪婪模式2 输出 '' 意料之外(因为最少匹配字符)
    result=re.match('^Hello.*Regex (.*?)',content)
    >>> print(result.group(1))                        
    
    #贪婪模式  输出 7
    result=re.match('^Hello.*(\d+).*Demo$',content)
    >>> print(result.group(1))                         
    7
    #换行 需要增加修饰符re.S 这个修饰符的作用是使.匹配包括换行符在内的所有字符
    content = '''Hello 1234567 World_tHIS 
    is Regex Demo'''
    result= re.match('^Hello\s(\d+)\s',content,re.S)
    >>> print(result.group(1))  
    12345
    #转义 使用"\"
    
    
  • search()

    它在匹配时会扫描整个字符串,然后返回第一个成功匹配的结结果。

    import re
    content = 'extra Hello 1234567 World_tHIS is Regex Demo'
    result= re.search('Hello\s(\d+)\s',content)
    >>> print(result.group(1))
    1234567
    
  • findall()

    提取多个内容,注意贪婪和非贪婪模式

    import re
    html='''
    <li data-view="5"><a href="/4.mp3" singer="beyond">尤辉岁月</a></li>
    <li data-view="5"><a href="/4.mp3" 
    singer="beyond">尤辉岁月</a></li>
    '''
    result= re.findall('<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a></li>',html,re.S)
    for r in result:
        print(r[0],r[1],r[2])
    
  • sub()

    字符替换

    import re
    content='123wujun456'
    result=re.sub('\d+' , '' , content)
    >>> print(result)
    wujun
    
  • compile()

    这个方法可以将正则字符串编译成正则表达式对象,以便在后面的匹配中复用

    import re
    content1 = '2019-12-15 12:00'
    content2 = '2019-12-16 12:00'
    content3 = '2019-12-17 12:00'
    pattern = re.compile('\d{2}:\d{2}',re.S)
    result1=re.sub(pattern ,'' , content1 )
    result2=re.sub(pattern ,'' , content2 )
    result3=re.sub(pattern ,'' , content3 )
    >>> print( result1 , result2 , result3)
    2019-12-15  2019-12-16  2019-12-17 
    
    
  • 抓取猫眼电影TOP100

    import requests
    import re
    import json
    def write_to_file(content):
    	with open('result.txt' , 'a' , encoding='utf-8') as f :
    		f.write( json.dumps(content , ensure_ascii = False) + '\n' ) 
    		
    def get_one_page(url):
    	headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
    	response = requests.get( url , headers = headers )
    	if response.status_code != 200 :
    		print(esponse.status_code )
    		return None
    	return response.text
    def parse_one_page(html):
    	pattern=re.compile('<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)".*?name.*?<a.*?>(.*?)</a>.*?star.*?>(.*?)</p>.*?releasetime.*?>(.*?)</p>.*?score.*?integer.*?>(.*?)</i>.*?fraction.*?>(.*?)</i>' , re.S)
    	items = re.findall( pattern , html )
    	'''
    	print( "items=",items)
    	for item in items:
    		print( 
    		item[0] , 
    		item[1] , 
    		item[2].strip() , 
    		item[3].strip()[3:]   if len(item[3].strip()) > 3 else ''  , 
    		item[4][5:]   if len(item[4]) > 5 else ''  , 
    		item[5]+item[6] ) 
    		print( "="*50 )
    	'''
    	for item in items:
    		yield {
    			'index' : item[0],
    			'image' : item[1],
    			'title' : item[2].strip(),
    			'actor' : item[3].strip()[3:]   if len(item[3].strip()) > 3 else '',
    			'time'  : item[4][5:]   if len(item[4]) > 5 else '',
    			'score' : item[5]+item[6]
    		}
    if __name__ == "__main__":
    	for pages in range( 10 ):
    		url='https://maoyan.com/board/4?offset=' + str(pages*10)
    		html=get_one_page(url)
    		for content in parse_one_page(html) :
    			print(content)
    			write_to_file(content)
    	
    
    

5、XPath

  • 第一个XPath程序
from lxml import etree
text='''
<div>
<ul>
<li class ="item-0"><a href="link1.html">first item</a></li>
<li class ="item-1"><a href="link2.html">second item</a></li>
<li class ="item-inactive"><a href="link3.html">third item</a></li>
<li class ="item-1"><a href="link4.html">fourth item</a></li>
<li class ="item-0"><a href="link5.html">程序</a>
<li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
</ul>
</div>
'''
html=etree.HTML(text)
#自动修正HTML报文
result=etree.tostring(html)
#bytes转换成str
print(result.decode('utf-8'))

###或者直接解析程序
html=etree.parse(text , etree.HTMLParse())
result=etree.tostring(html)
print(result.decode('utf-8'))
###属性匹配

  • 按顺选择

    from lxml import etree
    text='''
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    </div>
    '''
    html=etree.HTML(text)
    #第一个li节点
    html.xpath('//li[1]')
    #最好一个
    html.xpath('//li[last()]')
    #位置小于3的节点
    html.xpath('//li[position()<3]')
    #倒数第二个
    html.xpath('//li[last()-2]')
    
    
  • 节点轴选择

    from lxml import etree
    text='''
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    <ul>
    <li class ="a-item-0"><a href="link1.html">first item</a></li>
    </ul>
    <ul>
    <li class ="b-item-0"><a href="link1.html">first item</a></li>
    </ul>
    </div>
    '''
    html=etree.HTML(text)
    #所有的父节点
    html.xpath('//li[1]/ancestor::*')
    #父body节点
    html.xpath('//li[1]/ancestor::body')
    #选中节点的所有属性
    html.xpath('//li[1]/attribute::*')
    #获取直接子节点
    html.xpath('//li[1]/child::a[contains(@href , "link1.html")]')
    #获取子孙节点
    html.xpath('//li[1]/descendant::*')
    #获取所有同级节点
    html.xpath('//li[1]/following-sibling::*')
    

6、Beautiful Soup

  • 基本用法

    text='''
    <html><head><title>The Dormouse's story </title></head>
    <body>	
    <p class = "title 1 2 3" name = "dromouse"> <b>The Dormouse's story</b></p>
    <div>
    <ul>
    <li class ="item-0"><a href="link1.html">first item</a></li>
    <li class ="item-1"><a href="link2.html">second item</a></li>
    <li class ="item-inactive"><a href="link3.html">third item</a></li>
    <li class ="item-1"><a href="link4.html">fourth item</a></li>
    <li class ="item-0"><a href="link5.html">程序</a>
    <li class ="item-3 item-4" name = "name" ><a href="link5.html">程序</a>
    </ul>
    <ul>
    <li class ="a-item-0"><a href="link1.html">first item</a></li>
    <ul>
    <li class ="b-item-0"><a href="link1.html">first item</a></li>
    </div>
    '''
    from bs4 import BeautifulSoup
    #使用lxml解释器
    soup= BeautifulSoup(text,'lxml')
    #输出美化后的HTML报文
    print(soup.prettify())
    #Tag类型 ,string是属性
    print(type(soup.title))
    <class 'bs4.element.Tag'>
    #输出li标签的文本(仅仅选取第一个)
    print(soup.li.string)
    #不带属性,选取一大段
    print(soup.head)
    
  • 提取信息

    #节点名 name
    >>> print(soup.head.name)
    head
    #属性1 attrs
    >>> print(soup.p.attrs['name'])
    dromouse
    >>> print(soup.p['name'])      
    dromouse
    >>> print(soup.p['class'])
    ['title', '1', '2', '3']
    >>> 
    #获取内容
    >>> print(soup.title.string)
    The Dormouse's story
    #嵌套选择
    >>> print(soup.p.b.string)  
    The Dormouse's story
    >>> 
    #获取子节点
    >>> soup.div.contents
    ['\n', <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>, '\n', <ul>
    <li class="a-item-0"><a href="link1.html">first item</a></li>
    </ul>, '\n', <ul>
    <li class="b-item-0"><a href="link1.html">first item</a></li>
    </ul>, '\n']
    >>> soup.div.children
    <list_iterator object at 0x7f1fbcea9908>
    
    >>> for i , child  in enumerate(soup.div.children): 
    ...     print(i, child)
    ... 
    0 
    
    1 <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>
    2 
    
    3 <ul>
    <li class="a-item-0"><a href="link1.html">first item</a></li>
    </ul>
    4 
    
    5 <ul>
    <li class="b-item-0"><a href="link1.html">first item</a></li>
    </ul>
    6 
    
    #获取子节点
    >>> for i , child  in enumerate(soup.div.descendants):
    ...     print(i,child)
    
    #获取父节点,第一个li的父亲
    >>> soup.li.parent
    <ul>
    <li class="item-0"><a href="link1.html">first item</a></li>
    <li class="item-1"><a href="link2.html">second item</a></li>
    <li class="item-inactive"><a href="link3.html">third item</a></li>
    <li class="item-1"><a href="link4.html">fourth item</a></li>
    <li class="item-0"><a href="link5.html">程序</a>
    </li><li class="item-3 item-4" name="name"><a href="link5.html">程序</a>
    </li></ul>
    #获取所有祖先节点
    >>> list(enumerate(soup.div.parents))
    #获取兄弟节点
    text='''
    <p>a<a>a</a>c<a></a>d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    soup.a.previous_sibling
    soup.a.next_sibling
    list(enumerate(soup.a.previous_siblings))
    list(enumerate(soup.a.next_siblings))
    'a'
    >>> soup.a.next_sibling
    'c'
    >>> list(enumerate(soup.a.previous_siblings))
    [(0, 'a')]
    >>> list(enumerate(soup.a.next_siblings))
    [(0, 'c'), (1, <a></a>), (2, 'd')]
    >>> 
    #提取信息
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    soup.a.previous_sibling
    soup.a.next_sibling.string
    list(soup.a.parents)
    list(soup.a.parents)[0]
    list(soup.a.parents)[0].attrs['class']
    
    
  • find_all()

    #按节点查询
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    print(soup.find_all(name='a'))
    print(type(soup.find_all(name='a')[0]))
    for a in soup.find_all(name='a'):
        print(a.string)
    #按属性查询
    text='''
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    <p id = "1" class="12345">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    '''
    soup= BeautifulSoup(text,'lxml')
    #print(soup.find_all(attrs={'class':'12345'})) 或者 print( soup.find_all(class_="12345") )
    print( soup.find_all(id="1") )
    print(type(soup.find_all(attrs={'class':'1234'})[0]))
    for a in soup.find_all(attrs={'class':'1234'}):
        print(a.string)
    #text 正则匹配节点的!文本!
    text='''
    <p>
    Hello,this is link
    </p>
    <p>
    Hello,this is link,too
    </p>
    '''
    soup= BeautifulSoup(text,'lxml')
    print(soup.find_all(text=re.compile('link')))
    
  • find()

    与find_all比较,它返回单个Tag

  • 其他函数

    函数 功能
    find_parents 返回所有父节点
    find_parent 直接父节点
    find_next_siblings 返回后面所有兄弟节点
    find_next_sibling 返回后面第一个兄弟节点
    find_previous_siblings 返回前面所有兄弟节点
    find_previous_sibling 返回前面第一个兄弟节点
    find_all_next 返回后面所有复合条件的节点
    find_next 返回后面第一个复合条件的节点
    find_all_previous 返回前面所有复合条件的节点
    find_previous 返回前面第一个复合条件的节点
  • css

[w3c-css选择器][http://www.w3school.com.cn/cssref/css_selectors.asp]

#按节点查询
text='''
<div class ='panle'>
<div class = 'panle-heading' >
<p class="1234">a
<a>a1</a>
<a>a2</a>
d</p>
</div>
<div>
<ul class='ul-1'>
<li id = "item-1">test1</li>
<li id = "item-3">test2</li>
</ul>
<ul class='ul-2'>
<li id = "item-1">test1</li>
<li id = "item-3">test2</li>
</ul>
</div>
'''
from bs4 import BeautifulSoup
soup= BeautifulSoup(text,'lxml')
print(soup.select('.panle .panle-heading'))
print(soup.select('ul li'))
print(soup.select('.ul-1 #item-1'))
print(type(soup.select('ul')[0]))
print(soup.select('ul')[0])
>>> for ul in soup.select('ul'):
...     print( ul.select('li'))
... 
[<li id="item-1">test1</li>, <li id="item-3">test2</li>]
[<li id="item-1">test1</li>, <li id="item-3">test2</li>]
>>> print(soup.select('ul li')[0].get_text())
test1
>>> print(soup.select('ul li')[0].string)
test1
>>> 

7、 pyquery

  • 字符串初始化

    text='''
    <div class ='panle'>
    <div class = 'panle-heading' >
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    </div>
    <div>
    <ul class='ul-1'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    <ul class='ul-2'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    </div>
    '''
    from pyquery import PyQuery as pq
    doc=pq(text)
    >>>print(doc('li'))
    <li id="item-1">test1</li>
    <li id="item-3">test2</li>
    <li id="item-1">test1</li>
    <li id="item-3">test2</li>
    
  • URL初始化

    from pyquery import PyQuery as pq
    >>> html=pq(url='http://www.sina.com.cn',encoding='utf-8')     
    >>> print(html('title'))                                   
    <title>新浪首页</title>
    
  • 文件初始化

    from pyquery import PyQuery as pq
    html=pq(filename='demo.html',encoding='utf-8') 
    
  • CSS

    text='''
    <div id='AAA' class ='panle'>
    <div class = 'panle-heading' >
    <p class="1234">a
    <a>a1</a>
    <a>a2</a>
    d</p>
    </div>
    <div>
    <ul class='ul-1'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    <ul class='ul-2'>
    <li id = "item-1">test1</li>
    <li id = "item-3">test2</li>
    </ul>
    </div>
    >/div>
    '''
    from pyquery import PyQuery as pq
    doc=pq(text)
    >>> print(doc('.panle .panle-heading a')) 
    <a>a1</a>
    <a>a2</a>
    d
    >>> print(type(doc('.panle .panle-heading a')) )
    <class 'pyquery.pyquery.PyQuery'>
    
    
  • 查找节点

    1. 子节点 find-子孙节点 children-子节点

      #使用上面的HTML文本
      from pyquery import PyQuery as pq
      doc=pq(text)
      items=doc('.ul-1')
      >>> print(type(items))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(items)
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      
      >>> lis=items.find('li')
      >>> print(type(lis))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(lis)
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      
      >>> lis=items.children()
      >>> print(lis)
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      #按id筛选
      >>> lis=items.children('#item-1')
      >>> print(lis)                   
      <li id="item-1">test1</li>
      
      
    2. 父节点 parent -直接父节点 parents-祖先

      #使用上面的HTML文本
      from pyquery import PyQuery as pq
      doc=pq(text)
      items=doc('.ul-1')
      container=items.parent()
      print(type(container))
      print(container)
      >>> items=doc('.ul-1')
      >>> container=items.parent()
      >>> print(type(container))
      <class 'pyquery.pyquery.PyQuery'>
      >>> print(container)
      <div>
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      <ul class="ul-2">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      </div>
      
      >>> container=items.parents('.panle')        
      >>> print(container)                 
      <div id="AAA" class="panle">
      <div class="panle-heading">
      <p class="1234">a
      <a>a1</a>
      <a>a2</a>
      d</p>
      </div>
      <div>
      <ul class="ul-1">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      <ul class="ul-2">
      <li id="item-1">test1</li>
      <li id="item-3">test2</li>
      </ul>
      </div>
      </div>
      
      
    3. 兄弟节点

      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> li=doc( '#item-1')     
      >>> print(li)
      <li id="item-1">test1</li>
      <li id="item-1">test1</li>
      
      >>> print(li.siblings())
      <li id="item-3">test2</li>
      <li id="item-3">test2</li>
      
      
    4. 遍历

      text='''
      <div class= "div0 div1">
      <li id="1" >li-1</li>
      <li>li-2</li>
      <li>li-3</li>
      <li>li-3</li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> lis=doc('li').items()
      >>> print(type(lis))     
      <class 'generator'>
      >>> for li in lis:
      ...     print(li,type(li))
      ... 
      <li id="1">li-1</li>
       <class 'pyquery.pyquery.PyQuery'>
      <li>li-2</li>
       <class 'pyquery.pyquery.PyQuery'>
      <li>li-3</li>
       <class 'pyquery.pyquery.PyQuery'>
      <li>li-3</li>
       <class 'pyquery.pyquery.PyQuery'>
      
      
    5. 获取信息

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> a=doc('li')
      >>> print(a , type(a))
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
       <class 'pyquery.pyquery.PyQuery'>
      >>> print(a.attr('id'))
      1
      >>> print(a.attr.id)
      1
      #遍历
      >>> a=doc('li').items()
      >>> for li in a:
      ...     print(li.attr.id)
      ... 
      1
      2
      3
      4
      
      
    6. 取 文本

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      li_text=doc('li')
      >>> print(a,li_text.text())
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
       li-1 li-2 li-3 li-4
      >>> print(a,li_text.html())
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>    
       <span class="bold1">li-1</span>
      >>> text=li_text.items()       
      >>> for html in text:
      ...     print(html.html())
      ... 
      <span class="bold1">li-1</span>
      <span class="bold2">li-2</span>
      <span class="bold3">li-3</span>
      <span class="bold4">li-4</span>
          
      
      
    7. 节点操作

      text='''
      <div class= "div0 div1">
      <li id="1" ><span class='bold1'>li-1</span></li>
      <li id="2" ><span class='bold2'>li-2</span></li>
      <li id="3" ><span class='bold3'>li-3</span></li>
      <li id="4" ><span class='bold4'>li-4</span></li>
      </div>
      '''
      from pyquery import PyQuery as pq
      doc=pq(text)
      >>> li_text=doc('div')      
      >>> print(li_text)          
      <div class="div0 div1">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      >>> li_text.removeClass('div0')
      [<div.div1>]
      >>> print(li_text)             
      <div class="div1">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      >>> li_text.addClass('div2')   
      [<div.div1.div2>]
      >>> print(li_text)          
      <div class="div1 div2">
      <li id="1"><span class="bold1">li-1</span></li>
      <li id="2"><span class="bold2">li-2</span></li>
      <li id="3"><span class="bold3">li-3</span></li>
      <li id="4"><span class="bold4">li-4</span></li>
      </div>
      
      >>> li_text=doc('#1')    
      >>> print(li_text)
      <li id="1"><span class="bold1">li-1</span></li>
      
      >>> print(li_text.attr('name','modify'))
      <li id="1" name="modify"><span class="bold1">li-1</span></li>
      
      >>> print(li_text.text('test modify'))  
      <li id="1" name="modify">test modify</li>
      
      >>> print(li_text.html('<b>AAA</b>'))     
      <li id="1" name="modify"><b>AAA</b></li>
      >>>
      
      

猜你喜欢

转载自blog.csdn.net/weixin_39555721/article/details/89281216