Python----爬虫入门（基于正则表达式的实现） - 代码天地

Python----爬虫入门（基于正则表达式的实现）

其他 2018-07-07 01:10:00 阅读次数: 0

此次要爬取的对象是百度图片

首先放url

https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%A8%8B%E5%BA%8F%E5%91%98&oq=%E7%A8%8B%E5%BA%8F%E5%91%98&rsp=-1

浏览网页源代码可以找到我们图片所放在的位置是JS里

而我们的BeautifulSoup不能直接从JS里面抽取

这时候我们就能用比较传统的方法了，也比较不好理解

那就是用正则表达式。

正则表达式就是一种规则，然后在这个规则之下进行某些操作，常用操作有替换和查找。

import re
import os
import time
import requests
from urllib.request import urlretrieve
# 伪装头部，让网站误以为是使用浏览器
headers={
	'Host':'image.baidu.com',
	'Upgrade-Insecure-Requests':'1',
	'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'
}
#爬取的目标url
url = "https://image.baidu.com/search/index?tn=baiduimage&ipn=r&ct=201326592&cl=2&lm=-1&st=-1&fm=index&fr=&hs=0&xthttps=111111&sf=1&fmq=&pv=&ic=0&nc=1&z=&se=1&showtab=0&fb=0&width=&height=&face=0&istype=2&ie=utf-8&word=%E7%A8%8B%E5%BA%8F%E5%91%98&oq=%E7%A8%8B%E5%BA%8F%E5%91%98&rsp=-1"
#访问url
response = requests.get(url,headers=headers)
# 因为response.content是byte类型，所以首先进行转换
html = str(response.content,'utf-8')
#然后进行正则表达式匹配
list = re.findall(r'"objURL":"(.*?)"',html)
#输出30，一开始加载的只有三十张
print(len(list))
print(list)
#下载图片
#图片的名字
index = 0
for i in list:
	#捕获异常，假设下载不了的就跳过，进行下一张的下载
	try:
		#设置下载的路径
		path = os.path.join('images',str(index)+".jpg")
		#下载
		urlretrieve(i,filename=path)
		index = index+1
		time.sleep(2)
	except Exception as e:
		index = index + 1
		continue

猜你喜欢

转载自blog.csdn.net/qq_36457148/article/details/80869289

Python----爬虫入门（基于正则表达式的实现）

【python----发轫之始】【正则表达式总结】

Python爬虫入门七之正则表达式

python爬虫入门<七>--正则表达式

爬虫入门_正则表达式

Python 爬虫_正则表达式

Python爬虫（正则表达式）

Python爬虫--正则表达式

Python爬虫与正则表达式

python爬虫正则表达式

python爬虫——正则表达式

【python爬虫】正则表达式

基于 Python 的正则表达式

正则表达式基于python

Python入门：正则表达式

python入门--正则表达式

Python 正则表达式入门

python正则表达式入门

正则表达式-Python实现

python 爬虫：学爬虫必学的正则表达式

【爬虫】三、正则表达式-re入门

爬虫从入门到入狱(1)——正则表达式

【python&爬虫】快速入门python正则表达式

使用正则表达式实现网页爬虫。

芝麻HTTP：Python爬虫入门之正则表达式

python3爬虫之入门和正则表达式

Python爬虫入门——2. 3 正则表达式

python爬虫从入门到放弃（二）正则表达式

【python爬虫学习笔记】06 正则表达式以及Re库入门

Python爬虫从入门到精通——基本库re的使用：正则表达式

今日推荐

美国拟限制 AI 大模型出口中国和俄罗斯

苹果将与 OpenAI 达成协议，将 ChatGPT 应用于 iPhone

openKylin 社区生态委员会第六次会议圆满召开

阿里云正式发布通义千问 2.5

Python 3.13 发布首个 Beta：实验性自由线程模式和 JIT、改进交互式解释器

Stack Overflow 拿我的代码去训练 AI 大模型，还封了我的账号

Pop!_OS 的 COSMIC 桌面完成 App Store 上架工作

报告：Django 仍然是 74% 开发者的首选

《2024 年一季度互联网投融资运行情况》研究报告

15 年前上了“FFmpeg 耻辱柱”，今天他还得谢谢咱——腾讯QQPlayer一雪前耻？

TIOBE 5 月榜单：Fortran “复活”进入 Top 10

GCC 14.1 发布

周排行

NEFU 117 素数个数的位数

Closest Common Ancestors (Lca,tarjan)

ELK部署

【转载】Hive笔记整理（三）

SQL语句（一）基本表的定义

关于Java web开发中的MySQL的事务语句

MFC创建自定义窗体

如何用一句话激怒程序员？

《逆袭大学》文摘——9.4 基础和应用的平衡中找到大学的节奏

【spring源码分析】@Value注解原理

每日归档

更多

2024-05-11(38)

2024-05-10(38)

2024-05-09(35)

2024-05-08(42)

2024-05-07(14)

2024-05-06(40)

2024-05-05(0)

2024-05-04(7)

2024-05-03(19)

2024-05-02(0)