怎样买东西最划算你真的知道吗?学学用Python爬取价格!

一 编码过程

1 确定目标:运用刚学到的正则表达式,爬取电商网站的 商品名称、商品价格;

2 确定方案:

①选取电商网站:

淘宝网按关键字查询商品需要先登录,查了下淘宝网的登录过程,需要网络抓包,需要花一段时间分析,还不一定能分析得出来,故放弃淘宝;

京东按按关键字查询,可以看到url里面加入关键字即可,

'https://search.jd.com/Search?keyword='+关键字+'&enc=utf-8&wq='+关键字

且查看搜索结构页面源码,可以看到 商品价格和商品名称,没有用javaScrip,故决定用比较友好的京东来学习;

②确定从商品列表页面源码中抓取 价格 和 商品名称 的正则表达式,可以从源码中拷贝一段过来做实验,实验成功后,就放到代码中。

实验代码如下:

扫描二维码关注公众号,回复: 16881037 查看本文章
#试验获取商品价格、商品名称
import requests
import bs4
from bs4 import BeautifulSoup
goods1='手机壳'
html1 = '<div class="p-price"><strong class="J_55346447381" data-done="1"><em>¥</em><i>28.80</i></strong>		</div>		<div class="p-name p-name-type-2">			<a target="_blank" title="【推荐苹果11/X系列隐形钻石膜,防爆不碎边】专享买1送1,领券59减3、满79减5!京配免邮,次日达!多买多优惠!猛戳这里去购买!" href="//item.jd.com/55346447381.html" onclick="searchlog(1,55346447381,0,1,flagsClk=20971660)">				<em><span class="p-tag" style="background-color:#c81623">京东超市</span>亿色(ESR)苹果11/11Pro<font class="skcolor_ljg">手机壳</font> iPhone11 Pro max保护套超薄全透明防摔硅胶壳 苹果11【6.1英寸】送钢化膜</em>				<i class="promo-words" id="J_AD_55346447381">【推荐苹果11/X系列隐形钻石膜,防爆不碎边】专享买1送1,领券59减3、满79减5!京配免邮,次日达!多买多优惠!猛戳这里去购买!</i>			</a>		</div>'
plt = re.findall(r'<em>¥</em><i>.*?\.\d\d',html1) #获取商品价格,搜索以<em>¥</em><i>开头,以.数字数字结尾的字符串   
print(plt)
price = plt[0].split('<i>')[1] 
print(price)        
tlt1 = re.findall(r'[^(<em>¥</em>)]<em>.*?'+goods1+r'.*?</em>',html1) #获取商品名称,搜索以<em>开始,以遇到的第一个</em>结尾的字符串,且 第一个字符是(<em>¥</em>)]<em>除外
tlt2 = re.findall(r'[^(<em>¥</em>)]<em>.*?[\u4e00-\u9fa5].*?</em>',html1) #获取商品名称,搜索以<em>开始,以遇到的第一个</em>结尾的字符串,且 第一个字符是(<em>¥</em>)]<em>除外                  
print(tlt1)
print(tlt2)

二 爬取的全部代码

京东能够大方的开放给我们菜鸟学习真的大气,如下代码仅供学习交流,请模仿人类的行为来爬取,别像机器样频繁爬取。

#爬取的代码
import requests
import re
import time
goods='书包'  #搜索关键字
depth = 2  #搜索深度为2,即爬取第1页,第2页
start_url = 'https://search.jd.com/Search?keyword='+ goods+'&enc=utf-8&wq='+goods
infoList=[]
hd = {'user-agent':'Mozilla/5.0'}
for j in range(depth):  #对每一个页面进行处理,使用for循环
    try:
        url = start_url + '&page=' + str(j) # 组合成带翻页功能的url https://search.jd.com/Search?keyword=书包=utf-8&wq=书包&page=1
        try:
            r = requests.get(url,headers=hd,timeout=30)
            r.raise_for_status()
            r.encoding=r.apparent_encoding  #把获取到的页面信息 替换成utf-8信息,这样就不会乱码
            print(r.status_code)
            html = r.text
            print(r.url)
            print(r.text)
        except:
            print("抓取异常")
        try:
            plt = re.findall(r'<em>¥</em><i>.*?\.\d\d',html) #获取商品价格,搜索以<em>¥</em><i>开头,以.数字数字结尾的字符串            
            tlt = re.findall(r'[^(<em>¥</em>)]<em>.*?[\u4e00-\u9fa5].*?</em>',html)  #获取商品名称,搜索以<em>开始,以遇到的第一个</em>结尾的字符串,且 第一个字符是(<em>¥</em>)]<em>除外        
            for i in range(len(plt)):
                price = plt[i].split('<i>')[1]
                title = tlt[i]
                infoList.append([price,title]) # append() 方法用于在列表末尾添加新的对象。
        except: #让程序不会因为异常执行而溢出
            print("分析异常")
    except:
        continue  #如果某一个页面解析出了entity,那么继续解析下一个页面。
    time.sleep(2)
    
tplt = "{:^10}\t{:^10}\t{:^20}" #设定一个print模板,用大括号{}来定义槽函数
print(tplt.format("序号","价格","商品名称"))# Python2.6 开始,新增了一种格式化字符串的函数 str.format(),它增强了字符串格式化的功能。format用法举例:print("网站名:{name}, 地址 {url}".format(name="菜鸟教程", url="www.runoob.com"))
count=0
for g in infoList:
     count = count +1
     print(tplt.format(count,g[0],g[1])) #打印商品价格、名称,字符串没做处理

三 爬取到的信息

截取了列表中的前10条,商品标题没做去多余字符串处理

序号 价格 商品名称

1 49.00 <em>多功能学生挂书袋可调课桌挂袋书本收纳袋 学生挂书袋 书挂袋书桌收纳袋文件文具挂书袋课桌神器挂架 蓝色</em>

2 99.00 <em>稻草人双肩包男女14/15.6英寸大容量笔记本电脑包多功能旅行出差背包防泼水商务休闲学生<font class="skcolor_ljg">书包</font>50470黑色</em>

3 69.00 <em>双肩包男士背包大容量时尚休闲商务旅行笔记本电脑包高中大学生<font class="skcolor_ljg">书包</font>男潮流USb充电包包65199 黑色</em>

4 159.00 <em>七匹狼双肩包 背包男15.6英寸电脑包商务休闲通勤防泼水牛津布<font class="skcolor_ljg">书包</font> 黑色B0301872-201</em>

5 168.00 <em>瑞士SWICKY瑞驰双肩包男士背包新款大容量休闲商务旅行笔记本电脑包学生<font class="skcolor_ljg">书包</font>出差包USb充电包 黑色 大号带usb送多功能刀+锁</em>

6 169.00 <em>七匹狼背包男 牛津布双肩包休闲简约15.6寸电脑包时尚潮流旅行包大容量学生<font class="skcolor_ljg">书包</font>男 黑色B0301062-201</em>

7 69.80 <em><font class="skcolor_ljg">书包</font>男士夜光双肩背包中小学生男韩版休闲电脑包大学usb旅行包 USB大号音乐小子+笔袋+防盗锁</em>

8 159.00 <em><img class="p-tag3" src="//img14.360buyimg.com/uba/jfs/t6919/268/501386350/1257/92e5fb39/5976fcf9Nd915775f.png" />第九城V.NINE 小学生<font class="skcolor_ljg">书包</font>男女孩儿童护脊<font class="skcolor_ljg">书包</font>1-3-6年级减负双肩背包初中学生休闲<font class="skcolor_ljg">书包</font> VD9BV33972J 蓝配粉</em>

9 79.00 <em><span class="p-tag" style="background-color:#c81623">京东超市</span>第九城V.NINE 双肩包男女卡通印花<font class="skcolor_ljg">书包</font>六件套帆布休闲背包校园中小学生<font class="skcolor_ljg">书包</font> VB7BV32884J 粉色套装</em>

10 59.90 <em>2020新款<font class="skcolor_ljg">书包</font>男背包女初中生开学包休闲简约时尚潮流帆布包百搭高中学生 黑色</em>

四、当时(2020-4-12)抓取商品列表页面部分源码。

把这段源码,对照着代码,才明白为啥代码这么写。

<div class="p-scroll">
			<span class="ps-prev">&lt;</span>
			<span class="ps-next">&gt;</span>
			<div class="ps-wrap">
				<ul class="ps-main">
					<li class="ps-item"><a href="javascript:;" class="curr" title="蓝色"><img data-url="https://item.jd.com/64923971966.html"  data-presale="" data-sku="64923971966" data-img="1" data-lazy-img="//img11.360buyimg.com/n9/jfs/t1/89937/11/18011/133904/5e8e7872E5d238ffa/e2752ecd1eb188cc.jpg" class="err-product" width="25" height="25" /></a></li>
									</ul>
			</div>
		</div>
		<div class="p-price">
<strong class="J_64923971966" data-done="1"><em>¥</em><i>49.00</i></strong>		</div>
		<div class="p-name p-name-type-2">
			<a target="_blank" title="多功能学生挂书袋可调课桌挂袋书本收纳袋 学生挂书袋 书挂袋书桌收纳袋文件文具挂书袋课桌神器挂架 蓝色" href="https://item.jd.com/64923971966.html" onclick="searchlog(1,64923971966,0,1,'','adwClk=1');searchAdvPointReport('https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS82NDkyMzk3MTk2Ni5odG1s&log=4o6yQPJy6XmVSDUPaAlnilzQoTl0WfQq_iFkBg-nAELRr_jWgST6F3gHkDceKGeLFNVwe-soMnCpciBNs23mQ-Ilfi01tO75IDlJJX-6zhuGhAHxgFmEvKNeQT_qOIh8ZDU-NBcY8BsO9QLaz0X57aPu3e23a54_KScadwVylpD691LvcQa8ZbIjXHcQ17QOvtke4mexTr2lONtxOUaqrutZv5jV-h-7aOPjf_pruYgj_evBk7UICQoYrHVO0KZ_lui2p5hOalWxF3oKDmIkyo4ZwP8laIw9XFGI5tSiiOm1NqThyDWIwpRknK91PjiHNrlTIzMDemk-v03a2rjIi-Q9nHrG7vrq_SP0hc3z8aqUgN5VvW8WeChuIzSBJSGEoENy3HEx0XnARSKCiUbYBcrU--XghhLCocnp0a8x_sX7vMd1idTT4W7eeYfs-2v1u2ftQZz3UWxuI3bljxX0ZQ7obwL7Nyw9KbZS9wasMO5UY9kv5KyTRUc3-SQCTeEhUnCFou_VllDAaoHd90ols2Ca3lLUcCgcWEqv8HL7xiQ17MN8mm9-HFMIyYlZWwGZ1E9NuCW9M2PZ2IqYDTqGY5aVRFkJez8V3wQrqn61VwU9KCrU8GT2WOUmahNglOKLTvkdAzsKbg5Un2kUV3D2mssvsf76pw5itFaS5nSyxg3QPpBBd_gWrWCMrbuQ858X&v=404&clicktype=1&&clicktype=1');">
				<em>多功能学生挂书袋可调课桌挂袋书本收纳袋 学生挂书袋 书挂袋书桌收纳袋文件文具挂书袋课桌神器挂架 蓝色</em>
				<i class="promo-words" id="J_AD_64923971966"></i>
			</a>
		</div>
		<div class="p-commit">
			<strong><a id="J_comment_64923971966" target="_blank" href="https://item.jd.com/64923971966.html" onclick="searchlog(1,64923971966,0,3,'','adwClk=1')"></a></strong>
		</div>
		<div class="p-focus"><a class="J_focus" data-sku="64923971966" href="javascript:;" title="点击关注" onclick="searchlog(1,64923971966,0,5,'','adwClk=1')"><i></i>关注</a></div>
		<div class="p-shop" data-dongdong="" data-selfware="0" data-score="0" data-reputation="20" data-verderId="800106" data-shopid="795794">
		</div>	
		
		<div class="p-icons" id="J_pro_64923971966">
		</div>
		<span class="p-promo-flag">广告</span>
		
		<img source-data-lazy-advertisement="https://im-x.jd.com/dsp/np?log=4o6yQPJy6XmVSDUPaAlnilzQoTl0WfQq_iFkBg-nAELRr_jWgST6F3gHkDceKGeLFNVwe-soMnCpciBNs23mQ-Ilfi01tO75IDlJJX-6zhuGhAHxgFmEvKNeQT_qOIh8ZDU-NBcY8BsO9QLaz0X57aPu3e23a54_KScadwVylpARyavUgVRRZoP_thQ20x2cxcX9K-q692C4F-Ae3UlBOJQTPbwpeA47iOpQzp8MW-tnzhG4QcrgoNATCpmXhtOptt3X3m7MguGIYN2oKkU75SMlgTYm8masby6PnX7SeBx1yBcShcgL6IjCCrM_6RK9vJw8wVwmwW7VFgwAA5Ns0XspwYX1RIa8NoHIg6fzhJpq6wv56y7ePNvsosaGVfQoHjghgzr7XUaKnRhD-mRppyp0YHaLEuKPRIbqKvGO0ZTX4_iqFQyyOA24W8owSLkyKcUNiuRzv87NVKxkWEczyI_NvmrKLVtAy2pSNQKG1Q1tR84c1U_94w39kgMmZf9F0cNk-vsR2zq1DzwzJXILKv6BEWLANsPlDiKA9LkBsErwzHkoPKETW5cxZxubDxCnB9UpJcJ4GaGOrPu--5kmV2gsn1Cnj7OmpvttAZ9oRynB68bXmO5NQY3kaE2WOLOhXGG50Zx7KB1gRCCyB4zZMTr93pHBKNRK0LZCK3f4cbAOFs4uV5yL_vME_7tl_bFJmfgiBSlZZcHouiD99UcLmQ&v=404&rt=3" >
	</div>
</li>
<li class="gl-item" data-sku="5181576" data-spu="5181576" data-pid="5181576">
	<div class="gl-i-wrap">
		<div class="p-img">
			<a target="_blank" title="【稻草人爆款双肩包,15.6英寸超大容量,三大隔层,出行轻松搞掂】8-12日,每满119减20元,稻草人品质保证。快来抢购吧!" href="//item.jd.com/5181576.html" onclick="searchlog(1,5181576,1,2,'','flagsClk=1077940872')">
				<img width="220" height="220" class="err-product" data-img="1" source-data-lazy-img="//img11.360buyimg.com/n7/jfs/t1/96664/15/14541/395640/5e675971E689c5511/5b03b94f7fa247d1.jpg" />
</a>			<div data-lease="" data-catid="12071" data-venid="1000001048" data-presale=""></div>
		</div>
		<div class="p-scroll">
			<span class="ps-prev">&lt;</span>
			<span class="ps-next">&gt;</span>
			<div class="ps-wrap">
				<ul class="ps-main">
					<li class="ps-item"><a href="javascript:;" class="curr" title="主图款15.6英寸黑色款"><img  data-presale="" data-sku="5181576" data-img="1" data-lazy-img="//img11.360buyimg.com/n9/jfs/t1/96664/15/14541/395640/5e675971E689c5511/5b03b94f7fa247d1.jpg" class="err-product" width="25" height="25" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色17.3英寸"><img  data-presale="" data-sku="100003909565" data-img="1" width="25" height="25" data-lazy-img="//img10.360buyimg.com/n9/jfs/t1/108779/28/8424/281341/5e675356E20f0c196/12e26e33f67909a0.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="主图款15.6英寸灰色"><img  data-presale="" data-sku="4242121" data-img="1" width="25" height="25" data-lazy-img="//img11.360buyimg.com/n9/jfs/t1/97571/6/14547/438010/5e675538Ee03aedd1/3f6a133e5ee04207.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色款"><img  data-presale="" data-sku="100002467473" data-img="1" width="25" height="25" data-lazy-img="//img13.360buyimg.com/n9/jfs/t1/85679/31/14640/129467/5e675a9aE87526b01/f41bf024388f7456.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="主图款15.6英寸蓝色"><img  data-presale="" data-sku="4242123" data-img="1" width="25" height="25" data-lazy-img="//img13.360buyimg.com/n9/jfs/t1/105047/20/14381/391870/5e675500E822f2475/8a835c1bce455204.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="灰色款"><img  data-presale="" data-sku="100002467469" data-img="1" width="25" height="25" data-lazy-img="//img14.360buyimg.com/n9/jfs/t1/86089/15/14660/208728/5e675b0fEbda83d7c/ab11a079415adc5f.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="深灰色款"><img  data-presale="" data-sku="100003060945" data-img="1" width="25" height="25" data-lazy-img-slave="//img10.360buyimg.com/n9/jfs/t1/108751/29/8393/194375/5e675317Ed80a440f/69f564da9cf0c212.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色A款"><img  data-presale="" data-sku="100005730491" data-img="1" width="25" height="25" data-lazy-img-slave="//img11.360buyimg.com/n9/jfs/t1/104931/2/14580/631645/5e675cf5E81bae1c9/5c3346eabfbe5b30.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色B款"><img  data-presale="" data-sku="100010262064" data-img="1" width="25" height="25" data-lazy-img-slave="//img14.360buyimg.com/n9/jfs/t1/89293/5/14530/560061/5e675c7fEe5c723c8/2ca305d297ae82c3.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色C款"><img  data-presale="" data-sku="100010262050" data-img="1" width="25" height="25" data-lazy-img-slave="//img10.360buyimg.com/n9/jfs/t1/98662/13/14678/545819/5e675cc2Ee6f433a2/076556df49cd2936.jpg" class="err-product" /></a></li>
									</ul>
			</div>
		</div>
		<div class="p-price">
<strong class="J_5181576" data-done="1"><em>¥</em><i>99.00</i></strong>		</div>
		<div class="p-name p-name-type-2">
			<a target="_blank" title="【稻草人爆款双肩包,15.6英寸超大容量,三大隔层,出行轻松搞掂】8-12日,每满119减20元,稻草人品质保证。快来抢购吧!" href="//item.jd.com/5181576.html" onclick="searchlog(1,5181576,1,1,'','flagsClk=1077940872')">
				<em>稻草人双肩包男女14/15.6英寸大容量笔记本电脑包多功能旅行出差背包防泼水商务休闲学生<font class="skcolor_ljg">书包</font>50470黑色</em>
				<i class="promo-words" id="J_AD_5181576">【稻草人爆款双肩包,15.6英寸超大容量,三大隔层,出行轻松搞掂】8-12日,每满119减20元,稻草人品质保证。快来抢购吧!</i>
			</a>
		</div>
		<div class="p-commit">
			<strong><a id="J_comment_5181576" target="_blank" href="//item.jd.com/5181576.html#comment" onclick="searchlog(1,5181576,1,3,'','flagsClk=1077940872')"></a></strong>
		</div>
		<div class="p-focus"><a class="J_focus" data-sku="5181576" href="javascript:;" title="点击关注" onclick="searchlog(1,5181576,1,5,'','flagsClk=1077940872')"><i></i>关注</a></div>
		<div class="p-shop" data-dongdong="" data-selfware="1" data-score="5" data-reputation="98">
<span class="J_im_icon"><a target="_blank" class="curr-shop hd-shopname" onclick="searchlog(1,1000001048,0,58)" href="//mall.jd.com/index-1000001048.html" title="稻草人京东自营旗舰店">稻草人京东自营旗舰店</a></span>		</div>	
		
		<div class="p-icons" id="J_pro_5181576" data-done="1">
			<i class="goods-icons J-picon-tips J-picon-fix" data-idx="1" data-tips="京东自营,品质保障">自营</i>
    		<i class="goods-icons4 J-picon-tips" style="border-color:#4d88ff;color:#4d88ff;" data-idx="1" data-tips="品质服务,放心购物" >放心购</i>
<i class="goods-icons4 J-picon-tips" data-tips="本商品参与满减促销">每满119-20</i>		</div>
	</div>
</li>
<li class="gl-item" data-sku="59975470952" data-spu="13810867851" data-pid="59975470952">
	<div class="gl-i-wrap">
		<div class="p-img">
			<a target="_blank" title="【好店认证】【买一送“一”送钥匙包】【支持7天无理由退换货,赠送运费险,售后无忧】【支持货到付款】" href="//item.jd.com/59975470952.html" onclick="searchlog(1,59975470952,2,2,'','flagsClk=1094713996')">
				<img width="220" height="220" class="err-product" data-img="1" source-data-lazy-img="//img12.360buyimg.com/n7/jfs/t1/100706/25/17185/130140/5e8459f0Efbd3fdcf/379d9e03eea2a5d7.jpg" />
</a>			<div data-lease="" data-catid="12071" data-venid="84618" data-presale=""></div>
		</div>
		<div class="p-scroll">
			<span class="ps-prev">&lt;</span>
			<span class="ps-next">&gt;</span>
			<div class="ps-wrap">
				<ul class="ps-main">
					<li class="ps-item"><a href="javascript:;" class="curr" title="黑色"><img  data-presale="" data-sku="59975470952" data-img="1" data-lazy-img="//img12.360buyimg.com/n9/jfs/t1/100706/25/17185/130140/5e8459f0Efbd3fdcf/379d9e03eea2a5d7.jpg" class="err-product" width="25" height="25" /></a></li>
									</ul>
			</div>
		</div>
		<div class="p-price">
<strong class="J_59975470952" data-done="1"><em>¥</em><i>69.00</i></strong>		</div>
		<div class="p-name p-name-type-2">
			<a target="_blank" title="【好店认证】【买一送“一”送钥匙包】【支持7天无理由退换货,赠送运费险,售后无忧】【支持货到付款】" href="//item.jd.com/59975470952.html" onclick="searchlog(1,59975470952,2,1,'','flagsClk=1094713996')">
				<em>双肩包男士背包大容量时尚休闲商务旅行笔记本电脑包高中大学生<font class="skcolor_ljg">书包</font>男潮流USb充电包包65199 黑色</em>
				<i class="promo-words" id="J_AD_59975470952">【好店认证】【买一送“一”送钥匙包】【支持7天无理由退换货,赠送运费险,售后无忧】【支持货到付款】</i>
			</a>
		</div>
		<div class="p-commit">
			<strong><a id="J_comment_59975470952" target="_blank" href="//item.jd.com/59975470952.html#comment" onclick="searchlog(1,59975470952,2,3,'','flagsClk=1094713996')"></a></strong>
		</div>

这段最重要的是需要学会正则表达式,这里附上正则表达式的简要解释

猜你喜欢

转载自blog.csdn.net/Everly_/article/details/133139074