目标网站介绍
快科技 快科技(原驱动之家)为您提供第一手的科技新闻资讯、产品评测、驱动下载等服务。老牌的驱动下载频道通过方便快捷的驱动分类、搜索服务,助您快速找到所需的驱动程序。…
开始Scrapy
数据采集准备
1. 不了解5分钟快速抓网站思路的小伙伴先看
【Scrapy 五分钟撸网站】全站数据必备基础知识
2. 不了解数据抓取业务管理整理小伙伴先看
【Scrapy 五分钟撸网站】爬虫目标整理和数据准备
3. 不了解Scrapy模板量产的小伙伴先看(必看)
【Scrapy 五分钟撸网站】数据抓取项目框架通用模板
数据整理结果
1. 全频道Url获取地址
错误的url列表,这里是该网站的tag标签url列表,变化频繁。
正确的url列表在这里,需要手动处理一下
2. Excel保存截图
模板套用
Spider下的<项目>.py文件
1. 创建spider项目
scrapy genspider www_mydrivers_com " "
2. 整理全站css样式
先来看下页面的CSS样式,全站统一样式。
3. 修改www_mydrivers_com.py的的内容
这里将需要修改的地方进行说明,其他地方参考模板,不需修改。
- 作用域&自定义说明
allowed_domains = []
web_name = "快科技"
- 添加抓取数据信息
start_menu = [
# 全频道
[
{
"channel_name": "资讯中心", "url": "https://news.mydrivers.com/", },
{
"channel_name": "资讯中心-电脑办公", "url": "https://news.mydrivers.com/class/801/", },
{
"channel_name": "资讯中心-手机平板", "url": "https://news.mydrivers.com/class/802/", },
{
"channel_name": "资讯中心-IT业界", "url": "https://news.mydrivers.com/class/803/", },
{
"channel_name": "资讯中心-爱车一族", "url": "https://news.mydrivers.com/class/807/", },
{
"channel_name": "资讯中心-游戏世界", "url": "https://news.mydrivers.com/class/806/", },
{
"channel_name": "资讯中心-家电数码", "url": "https://news.mydrivers.com/class/804/", },
{
"channel_name": "资讯中心-软件之家", "url": "https://news.mydrivers.com/class/805/", },
{
"channel_name": "资讯中心-科学动态", "url": "https://news.mydrivers.com/class/808/", },
{
"channel_name": "资讯中心-影音达人", "url": "https://news.mydrivers.com/class/809/", },
{
"channel_name": "资讯中心-便携机", "url": "https://news.mydrivers.com/class/69/", },
{
"channel_name": "资讯中心-服务器", "url": "https://news.mydrivers.com/class/68/", },
{
"channel_name": "资讯中心-台式机", "url": "https://news.mydrivers.com/class/67/", },
{
"channel_name": "资讯中心-笔记本", "url": "https://news.mydrivers.com/class/66/", },
{
"channel_name": "资讯中心-科技前沿", "url": "https://news.mydrivers.com/class/65/", },
{
"channel_name": "资讯中心-视点人物", "url": "https://news.mydrivers.com/class/62/", },
{
"channel_name": "资讯中心-操作系统", "url": "https://news.mydrivers.com/class/58/", },
{
"channel_name": "资讯中心-电脑驱动", "url": "https://news.mydrivers.com/class/57/", },
{
"channel_name": "资讯中心-电脑软件", "url": "https://news.mydrivers.com/class/56/", },
{
"channel_name": "资讯中心-掌机游戏", "url": "https://news.mydrivers.com/class/55/", },
{
"channel_name": "资讯中心-游戏主机", "url": "https://news.mydrivers.com/class/54/", },
{
"channel_name": "资讯中心-主机游戏", "url": "https://news.mydrivers.com/class/53/", },
{
"channel_name": "资讯中心-电脑游戏", "url": "https://news.mydrivers.com/class/52/", },
{
"channel_name": "资讯中心-传真机", "url": "https://news.mydrivers.com/class/51/", },
{
"channel_name": "资讯中心-扫描仪", "url": "https://news.mydrivers.com/class/49/", },
{
"channel_name": "资讯中心-投影机", "url": "https://news.mydrivers.com/class/48/", },
{
"channel_name": "资讯中心-一体机", "url": "https://news.mydrivers.com/class/47/", },
{
"channel_name": "资讯中心-复印机", "url": "https://news.mydrivers.com/class/46/", },
{
"channel_name": "资讯中心-打印机", "url": "https://news.mydrivers.com/class/45/", },
{
"channel_name": "资讯中心-网络存储", "url": "https://news.mydrivers.com/class/43/", },
{
"channel_name": "资讯中心-网卡", "url": "https://news.mydrivers.com/class/41/", },
{
"channel_name": "资讯中心-路由器", "url": "https://news.mydrivers.com/class/38/", },
{
"channel_name": "资讯中心-交换机", "url": "https://news.mydrivers.com/class/37/", },
{
"channel_name": "资讯中心-电子书", "url": "https://news.mydrivers.com/class/33/", },
{
"channel_name": "资讯中心-科技资讯", "url": "https://news.mydrivers.com/class/329/", },
{
"channel_name": "资讯中心-快递物流", "url": "https://news.mydrivers.com/class/328/", },
{
"channel_name": "资讯中心-其他网络", "url": "https://news.mydrivers.com/class/327/", },
{
"channel_name": "资讯中心-机器人", "url": "https://news.mydrivers.com/class/326/", },
{
"channel_name": "资讯中心-火车高铁", "url": "https://news.mydrivers.com/class/325/", },
{
"channel_name": "资讯中心-网络红人", "url": "https://news.mydrivers.com/class/324/", },
{
"channel_name": "资讯中心-考勤机", "url": "https://news.mydrivers.com/class/323/", },
{
"channel_name": "资讯中心-网络安全", "url": "https://news.mydrivers.com/class/322/", },
{
"channel_name": "资讯中心-生活周边", "url": "https://news.mydrivers.com/class/321/", },
{
"channel_name": "资讯中心-共享经济", "url": "https://news.mydrivers.com/class/320/", },
{
"channel_name": "资讯中心-U盘存储卡", "url": "https://news.mydrivers.com/class/32/", },
{
"channel_name": "资讯中心-自行车", "url": "https://news.mydrivers.com/class/317/", },
{
"channel_name": "资讯中心-摩托车", "url": "https://news.mydrivers.com/class/316/", },
{
"channel_name": "资讯中心-多轴无人机", "url": "https://news.mydrivers.com/class/314/", },
{
"channel_name": "资讯中心-电动车", "url": "https://news.mydrivers.com/class/310/", },
{
"channel_name": "资讯中心-摄像头", "url": "https://news.mydrivers.com/class/31/", },
{
"channel_name": "资讯中心-智能家居", "url": "https://news.mydrivers.com/class/302/", },
{
"channel_name": "资讯中心-生活百科", "url": "https://news.mydrivers.com/class/301/", },
{
"channel_name": "资讯中心-数码相机", "url": "https://news.mydrivers.com/class/30/", },
{
"channel_name": "资讯中心-电子竞技", "url": "https://news.mydrivers.com/class/297/", },
{
"channel_name": "资讯中心-移动应用", "url": "https://news.mydrivers.com/class/292/", },
{
"channel_name": "资讯中心-智能穿戴", "url": "https://news.mydrivers.com/class/290/", },
{
"channel_name": "资讯中心-摄像机", "url": "https://news.mydrivers.com/class/29/", },
{
"channel_name": "资讯中心-安卓手机", "url": "https://news.mydrivers.com/class/288/", },
{
"channel_name": "资讯中心-其他智能", "url": "https://news.mydrivers.com/class/287/", },
{
"channel_name": "资讯中心-教育未来", "url": "https://news.mydrivers.com/class/285/", },
{
"channel_name": "资讯中心-超极本", "url": "https://news.mydrivers.com/class/278/", },
{
"channel_name": "资讯中心-创意摄影", "url": "https://news.mydrivers.com/class/274/", },
{
"channel_name": "资讯中心-样张赏析", "url": "https://news.mydrivers.com/class/273/", },
{
"channel_name": "资讯中心-镜头", "url": "https://news.mydrivers.com/class/271/", },
{
"channel_name": "资讯中心-MP3/MP4", "url": "https://news.mydrivers.com/class/27/", },
{
"channel_name": "资讯中心-艺术设计", "url": "https://news.mydrivers.com/class/269/", },
{
"channel_name": "资讯中心-电影动画", "url": "https://news.mydrivers.com/class/267/", },
{
"channel_name": "资讯中心-精彩影视", "url": "https://news.mydrivers.com/class/266/", },
{
"channel_name": "资讯中心-汽车厂商", "url": "https://news.mydrivers.com/class/264/", },
{
"channel_name": "资讯中心-车载配件", "url": "https://news.mydrivers.com/class/263/", },
{
"channel_name": "资讯中心-车载系统", "url": "https://news.mydrivers.com/class/262/", },
{
"channel_name": "资讯中心-无人驾驶汽车", "url": "https://news.mydrivers.com/class/261/", },
{
"channel_name": "资讯中心-其他汽车", "url": "https://news.mydrivers.com/class/260/", },
{
"channel_name": "资讯中心-PDA相关", "url": "https://news.mydrivers.com/class/26/", },
{
"channel_name": "资讯中心-电动汽车", "url": "https://news.mydrivers.com/class/259/", },
{
"channel_name": "资讯中心-普通汽车", "url": "https://news.mydrivers.com/class/258/", },
{
"channel_name": "资讯中心-奇趣探险", "url": "https://news.mydrivers.com/class/256/", },
{
"channel_name": "资讯中心-科普知识", "url": "https://news.mydrivers.com/class/255/", },
{
"channel_name": "资讯中心-数理化学", "url": "https://news.mydrivers.com/class/254/", },
{
"channel_name": "资讯中心-游戏厂商", "url": "https://news.mydrivers.com/class/253/", },
{
"channel_name": "资讯中心-壁纸主题", "url": "https://news.mydrivers.com/class/252/", },
{
"channel_name": "资讯中心-手机配件", "url": "https://news.mydrivers.com/class/25/", },
{
"channel_name": "资讯中心-Windows平板", "url": "https://news.mydrivers.com/class/242/", },
{
"channel_name": "资讯中心-安卓平板", "url": "https://news.mydrivers.com/class/241/", },
{
"channel_name": "资讯中心-苹果iPad", "url": "https://news.mydrivers.com/class/240/", },
{
"channel_name": "资讯中心-手机厂商", "url": "https://news.mydrivers.com/class/24/", },
{
"channel_name": "资讯中心-飞机航空", "url": "https://news.mydrivers.com/class/236/", },
{
"channel_name": "资讯中心-生活电器", "url": "https://news.mydrivers.com/class/234/", },
{
"channel_name": "资讯中心-手机系统", "url": "https://news.mydrivers.com/class/232/", },
{
"channel_name": "资讯中心-音箱", "url": "https://news.mydrivers.com/class/23/", },
{
"channel_name": "资讯中心-键鼠", "url": "https://news.mydrivers.com/class/22/", },
{
"channel_name": "资讯中心-其他手机", "url": "https://news.mydrivers.com/class/211/", },
{
"channel_name": "资讯中心-声卡", "url": "https://news.mydrivers.com/class/21/", },
{
"channel_name": "资讯中心-手机游戏", "url": "https://news.mydrivers.com/class/209/", },
{
"channel_name": "资讯中心-山寨机", "url": "https://news.mydrivers.com/class/208/", },
{
"channel_name": "资讯中心-移动处理器", "url": "https://news.mydrivers.com/class/206/", },
{
"channel_name": "资讯中心-微软手机", "url": "https://news.mydrivers.com/class/205/", },
{
"channel_name": "资讯中心-黑莓手机", "url": "https://news.mydrivers.com/class/204/", },
{
"channel_name": "资讯中心-塞班手机", "url": "https://news.mydrivers.com/class/203/", },
{
"channel_name": "资讯中心-苹果手机", "url": "https://news.mydrivers.com/class/201/", },
{
"channel_name": "资讯中心-光驱", "url": "https://news.mydrivers.com/class/20/", },
{
"channel_name": "资讯中心-工程建筑", "url": "https://news.mydrivers.com/class/197/", },
{
"channel_name": "资讯中心-地理自然", "url": "https://news.mydrivers.com/class/196/", },
{
"channel_name": "资讯中心-生科医学", "url": "https://news.mydrivers.com/class/195/", },
{
"channel_name": "资讯中心-历史考古", "url": "https://news.mydrivers.com/class/194/", },
{
"channel_name": "资讯中心-生物世界", "url": "https://news.mydrivers.com/class/193/", },
{
"channel_name": "资讯中心-散热器", "url": "https://news.mydrivers.com/class/19/", },
{
"channel_name": "资讯中心-耳塞耳机", "url": "https://news.mydrivers.com/class/185/", },
{
"channel_name": "资讯中心-小家电", "url": "https://news.mydrivers.com/class/184/", },
{
"channel_name": "资讯中心-线材线缆", "url": "https://news.mydrivers.com/class/183/", },
{
"channel_name": "资讯中心-网络运营商", "url": "https://news.mydrivers.com/class/180/", },
{
"channel_name": "资讯中心-电源", "url": "https://news.mydrivers.com/class/18/", },
{
"channel_name": "资讯中心-天文航天", "url": "https://news.mydrivers.com/class/175/", },
{
"channel_name": "资讯中心-企业动态", "url": "https://news.mydrivers.com/class/174/", },
{
"channel_name": "资讯中心-平板电视", "url": "https://news.mydrivers.com/class/173/", },
{
"channel_name": "资讯中心-机箱", "url": "https://news.mydrivers.com/class/17/", },
{
"channel_name": "资讯中心-显示器", "url": "https://news.mydrivers.com/class/168/", },
{
"channel_name": "资讯中心-其他数码", "url": "https://news.mydrivers.com/class/167/", },
{
"channel_name": "资讯中心-其他硬件", "url": "https://news.mydrivers.com/class/166/", },
{
"channel_name": "资讯中心-硬盘", "url": "https://news.mydrivers.com/class/16/", },
{
"channel_name": "资讯中心-内存", "url": "https://news.mydrivers.com/class/15/", },
{
"channel_name": "资讯中心-主板", "url": "https://news.mydrivers.com/class/14/", },
{
"channel_name": "资讯中心-CPU", "url": "https://news.mydrivers.com/class/13/", },
{
"channel_name": "资讯中心-显卡", "url": "https://news.mydrivers.com/class/12/", },
]
]
- 样式整理
整体网站数据列表有多少种样式就要做多少个parseX,并添加到
parse_list = [
self.parse1, # 全频道
]
- 标题&链接&封面
Item_title = response.xpath('//ul[@class="news_lb"]/li/h3/a/text()').extract() # 文章标题列表
Item_url = response.xpath('//ul[@class="news_lb"]/li/h3/a/@href').extract() # 文章链接列表
Item_thumbImg = response.xpath('//ul[@class="news_lb"]/li/div[@class="news_left photo"]/a/img/@src').extract() # 文章封面图片列表
Spider下的parse_detail.py文件
1. 抓取详情页内容
修改列表数据详情页的CSS抓取样式
# 处理详情页带格式,这里整个页面进行抓取
item['content'] = ""
if 'class="news_info"' in response.text and len(None2Str(item['content'])) < 5:
item['content'] = response.xpath('//div[@class="news_info"]').extract_first()
2. 特别说明
有些网站的程序员丧心病狂到一定程度10个页面9种样式这种,由于我们不可能每个页面都打开看一下详情页的CSS格式,因此有个通用的解决办法。
- 第一次抓取完内容之后打开MongoDB数据库执行下面的命令会把包含body的页面数据筛选出来,这些是没有根据指定样式抓取的数据,而是直接抓的全部页面的数据。
db.你的表名.find({content:/body/})
- 打开任意的link循环处理详情页的内容直到mongo命令没有筛选出来内容为止即可。