python-爬虫-bs4-BeautifulSoup - 代码天地

python-爬虫-bs4-BeautifulSoup

其他 2019-08-08 12:48:05 阅读次数: 0

代码的使用流程：

核心思想：将html文档转换成Beautiful对象，然后调用该对象中的

属性和方法进行html文档指定内容的定位查找。

1 导包：from bs4 import BeautifulSoup

创建Beautiful对象：- 如果html文档的来源是来源于本地：

1 Beautiful（'open('本地的html文件')','lxml'）

- 如果html是来源于网络

1 Beautiful（‘网络请求到的页面数据’，‘lxml’）

- 属性和方法：

（1）根据标签名查找

- soup.a 只能找到第一个符合要求的标签

（2）获取属性

- soup.a.attrs 获取a所有的属性和属性值，返回一个字典

- soup.a.attrs['href'] 获取href属性

- soup.a['href'] 也可简写为这种形式

（3）获取内容

- soup.a.string /text()

- soup.a.text //text()

- soup.a.get_text() //text()

【注意】如果标签还有标签，那么string获取到的结果为None，

而其它两个，可以获取文本内容

（4）find：找到第一个符合要求的标签

- soup.find('a') 找到第一个符合要求的

- soup.find('a', title="xxx")

- soup.find('a', alt="xxx")

- soup.find('a', class_="xxx")

- soup.find('a', id="xxx")

（5）find_all：找到所有符合要求的标签

- soup.find_all('a')

- soup.find_all(['a','b']) 找到所有的a和b标签- soup.find_all('a', limit=2) 限制前两个

（6）根据选择器选择指定的内容

select:soup.select('#feng')

- 常见的选择器：标签选择器(a)、类选择器(.)、id选择器(#)、层

级选择器

- 层级选择器：

div .dudu #lala .meme .xixi 下面好多级 div//img

div > p > a > .lala 只能是下面一级 div/img

【注意】select选择器返回永远是列表，需要通过下标提取指定的

对象

猜你喜欢

转载自www.cnblogs.com/person1-0-1/p/11320392.html

python-爬虫-bs4-BeautifulSoup

python爬虫-bs4-BeautifulSoup

用bs4-Beautifulsoup爬取三国演义的小说章节内容

python爬虫二:bs4库中的BeautifulSoup模块

python 爬虫之beautifulsoup（bs4）使用

python 爬虫：BeautifulSoup(bs4) 找不到对应的元素

python爬虫思路及BeautifulSoup bs4使用

Python爬虫学习笔记（六）————BeautifulSoup（bs4）解析

python bs4 BeautifulSoup

Python-爬虫-Beautifulsoup解析

beautifulsoup爬虫使用-bs4

爬虫——bs4.BeautifulSoup 模块

爬虫系列-beautifulsoup(bs4)

python bs4(beautifulsoup4)

python3-bs4~Beautifulsoup

python bs4 BeautifulSoup用法

python爬虫学习笔记3：bs4及BeautifulSoup库学习

Python 爬虫学习04 bs库示例学习(beautifulSoup)

python 爬虫-beautifulsoup4

爬虫-beautifulsoup-bs库

BeautifulSoup4--bs4

python 在linux上面安装beautifulsoup4(bs4) No module named 'bs4'

【Python网络爬虫】150讲轻松搞定Python网络爬虫付费课程笔记篇八——爬虫解析库 bs4 BeautifulSoup

python bs4模块 BeautifulSoup 学习笔记

find_all的用法 Python（bs4，BeautifulSoup）

python库的解析--BeautifulSoup(bs4库)

python报错cannot import name ‘BeautifulSoup‘ from ‘bs4‘

Python bs4 BeautifulSoup库使用记录

bs4中的BeautifulSoup

Bs4 BeautifulSoup取值

今日推荐

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

面壁智能发布 Eurux-8x22B 开源大模型 —— 堪称「理科状元」

开源日报 | 谷歌扶持鸿蒙上位；开源Rabbit R1；Docker加持的安卓手机；微软的焦虑和野心；海尔电器把开放平台关了

中国码农的“35岁魔咒”

蘭雅 CorelDRAW 插件 2024.5.1 国际劳动节版，免费下载

Arc Browser for Windows 1.0 正式 GA

90后程序员开发视频搬运软件、不到一年获利超 700 万，结局很刑！

周排行

Java自定义时间格式

同步整形电路

在开发中最最最常用的字符串的属性大集合

Linux 查看端口占用并杀掉

Java基础四：ArrayList

多线程之死锁就是这么简单

mysql 基础命令集

awk 命令详解

Centos6.3编译安装nginx+php步骤

OCR （Optical Character Recognition，光学字符识别）

每日归档

更多

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)

2024-05-01(4)

2024-04-30(1)

2024-04-29(40)