python网络爬虫入门 - 代码天地

python网络爬虫入门

其他 2018-12-13 12:49:31 阅读次数: 0

1、获取网页源码

from urllib import request
fp=request.urlopen("https://blog.csdn.net")
content=fp.read()
fp.close()

2、从源码中提取信息

这里需要使用可以从HTML或者xml文件中提取数据的python库，beautiful soup
安装该库：

pip3 install beautifulsoup4

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
for x in soup.findAll(name='a')： # 找出所有的a标签
	print('attrs:',a.attrs) # 输出a标签的属性

#利用正则,找出所有id=link数字 标签
for a in soup.findAll(attrs={'id':re.compile('link\d')})
	print(a)

3、对信息进行处理

可以写入文件，也可以做进一步处理，例如清洗

示例代码如下：

from bs4 import BeautifulSoup
from urllib import request
import re
import chardet
import sys
import io
#改变标准输出的默认编码
sys.stdout = io.TextIOWrapper(sys.stdout.buffer,encoding='utf8') 
fp=request.urlopen("https://blog.csdn.net")
html=fp.read()
fp.close()
# 判断编码方式
det = chardet.detect(html)
# 使用该页面的编码方式
soup = BeautifulSoup(html.decode(det['encoding']))
# 找出属性为href=http或者href=https开头的标签
for tag in soup.findAll(attrs={'href':re.compile('http|https')}):
	print(tag)
	with open(r'C:\Users\Van\Desktop\test.csv', 'a+') as file:
		content = tag.attrs['href'] + '\n'
		file.write(content) #写入文件

猜你喜欢

转载自blog.csdn.net/qq_27466827/article/details/84112716

python网络爬虫入门

python网络爬虫入门概论

Python网络爬虫入门案例

Python网络爬虫入门详解

python网络爬虫入门简介

Python 网络爬虫入门详解

Python网络爬虫实战入门

Python网络爬虫入门，带你领略Python爬虫的乐趣！

python 爬虫/网络数据采集----入门知识

Python3.5 网络爬虫简单入门

Python网络爬虫快速入门到精通

《Python网络爬虫从入门到实践》-笔记

python入门18网络爬虫

Python网络爬虫Requests库入门

Python网络爬虫实战(一)快速入门

快速python网络爬虫入门（学习目录）

Python网络爬虫（一）——Request入门

python网络爬虫之入门[一]

python网络爬虫从入门到实战开发

python网络爬虫 Requests库入门

python网络爬虫学习笔记之一爬虫基础入门

16. python从入门到精通——Python网络爬虫

网络爬虫入门（1）

网络爬虫入门学习

网络爬虫篇——入门

网络爬虫入门

Python入门网络爬虫之精华版

《搜索引擎》Python网络爬虫入门

2018.5.4(python网络爬虫与信息提取入门)Robots协议

《PYTHON网络爬虫从入门到实践》pdf 附下载链接

今日推荐

火速冲上 GitHub 热榜 —— 开源编程语言、框架哪有这么可爱？

北京人形机器人创新中心发布全球首个纯电驱拟人奔跑的全尺寸人形机器人“天工”

LFOSSA 源来如此公开课 | 掌握云原生未来：CNCF 认证全面攻略与备考秘籍

国产云输入法——仅华为无云端数据上传安全问题

开源日报 | 工业开源项目OGG 1.0；姐姐，你要和我一起配置火狐吗；苹果AI遥遥落后？Fedora 40

开放签电子签章：停止新增，优化体验，前进更进（五一假期前工作）

周排行

Metasploit文件目录与入侵基本概念

跨域(CORS)请求问题[No 'Access-Control-Allow-Origin' header is present on the requested resource]常见解决方案

CodeIgniter 源码解读之 CodeIgniter.php（二）

SAS入门之（四）改变数据类型

初识元组

[数学建模]数学建模算法和模型（B站视频）（二）

Nginx 服务器源码安装配置流程

C#实现语音视频录制【基于MCapture + MFile】

开发进度4

下载安装vue的方法网址

每日归档

更多

2024-04-28(0)

2024-04-27(56)

2024-04-26(39)

2024-04-25(22)

2024-04-24(36)

2024-04-23(26)

2024-04-22(39)

2024-04-21(0)

2024-04-20(6)

2024-04-19(5)