Python 爬虫常用代码

其他 2021-01-12 10:16:22 阅读次数: 0

目录

Python 爬虫常用代码

Request核心代码：

BeautifulSoup：

web基础：

Response Headers中的数据为浏览器向网站所传送的内容，包括浏览器的信息及cookies等。

status：200 正常 418 被发现时爬虫orz

（418时需要进行包装（User-Agent详见后文））

Request核心代码：

在python3中 urllib 已经与 urllib2库整合，import urllib即可

import urllib.request

response = urllib.request.urlopen("http://www.baidu.com")
print(response.read().decode("utf-8"))

decode 可将其解码为便于浏览的文本。

POST相关：

常用post的测试网站：

post 一些data的测试（使用bytes转为二进制文件）（模拟用户真实登录（cookies））

import urllib.request,urllib.parse

if __name__ == '__main__':
    data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
    response = urllib.request.urlopen("http://httpbin.org/post",data = data)
    print(response.read().decode("utf-8"))

记得使用try...except urllib.error.URLError: 来实现超时的错误处理（常用）

data = bytes(urllib.parse.urlencode({"hello": "world"}), encoding="utf-8")
    try:
        req = urllib.request.Request("http://douban.com", data=data, headers=headers, method="POST")
        response = urllib.request.urlopen(req)
        print(response.read().decode("utf-8"))
    except urllib.error.URLError as e:
        if hasattr(e,"code"):
            print(e.code)
        if hasattr(e,"reason"):
            print(e.reason)

反418操作：（亦是request 和 response结合使用的示例，用request获取用urlopen打开request实例）

headers = {  #模拟浏览器头部信息 向豆瓣服务器发送消息
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36" #from your browser
}

data = bytes(urllib.parse.urlencode({"hello":"world"}),encoding="utf-8")
    req = urllib.request.Request("http://douban.com",data = data,headers = headers, method = "POST")
    response = urllib.request.urlopen(req)
    print(response.read().decode("utf-8"))

BeautifulSoup：

作用：将复杂的HTML文档转换成一个复杂的树形结构，每个节点都是Python对象，所有对象可归纳为4种

（强大的搜索html标签及内容的工具，免于繁重的find工作）

--Tag 标签

--NavigableString

--BeautifulSoup

--Comment

基础示例：

from bs4 import BeautifulSoup

file = open("baidu.html","rb")
html = file.read()
bs = BeautifulSoup(html,"html.parser")


print(bs.a)# get tag
print(bs.title.string) # get string inside tag
print(bs.a.attrs) # get attrs inside tag

常用搜索函数：

bs.find_all("a") # 完全匹配
bs.find_all(re.compile("a")) # 正则匹配
bs.find_all(id = "head") # 寻找id为head的标签


# 方法搜索： 传入一个函数，根据函数的要求来搜索
def rules():
    return tag.has_attr("name")
bs.find_all(rules)

tlist = bs.select('title') # 通过标签来查找
tlist = bs.select('.mnav') # 通过类名来查找(# id)
tlist = bs.select('head > title') # 查找子标签
tlist = bs.select('.manv ~ .bri') # 查找兄弟标签
# 获取该文本的方法
print(tlist[0].get_text())

猜你喜欢

转载自blog.csdn.net/ShuoCHN/article/details/112389441

Python 爬虫常用代码

Python常用爬虫代码总结方便查询

python爬虫✦基本知识与常用代码模块

爬虫-Python爬虫常用库

python爬虫常用HTMLParser

python爬虫的常用技巧

Python爬虫常用模块

常用的Python爬虫

常用python爬虫框架

python爬虫常用的库

python爬虫常用库

python - 爬虫常用的UserAgents

Python爬虫代码框架

python爬虫完整代码

python 爬虫代码：

python爬虫学习代码

python 爬虫代码

Python网页爬虫代码

爬虫笔记1：Python爬虫常用库

Python爬虫入门，常用爬虫技巧盘点

Python爬虫常用库的安装

Python爬虫常用模块安装

Python爬虫常用库的使用

常用Python爬虫库汇总

Python爬虫：BeautifulSoup常用操作

常用的 Python 爬虫技巧总结

python爬虫常用库安装

Python 爬虫常用库的安装

Crawler - python常用爬虫框架

Python爬虫常用哪些库？

今日推荐

《美国对全球网络空间安全与发展的威胁和破坏》报告发布

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

周排行

让自己的头脑极度开放

CentOS 6.5(x64) 和Redhat6.5操作系误删libc

高可用注册中心

【日记】12.28/【题解】AtCoder AGC041

XML（5）_XML 约束_DTD

Java集合Map（四）

树梅派安装桌面环境教程

pipenv 的使用和安装

小程序白屏问题和内存研究

C语言简单选择排序

每日归档

更多

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)