【Scrapy 五分钟撸网站】[科技行业新闻]Scrapy实战快科技全站数据抓取

目标网站介绍

快科技 快科技(原驱动之家)为您提供第一手的科技新闻资讯、产品评测、驱动下载等服务。老牌的驱动下载频道通过方便快捷的驱动分类、搜索服务,助您快速找到所需的驱动程序。…
在这里插入图片描述

开始Scrapy

数据采集准备

1. 不了解5分钟快速抓网站思路的小伙伴先看
【Scrapy 五分钟撸网站】全站数据必备基础知识

2. 不了解数据抓取业务管理整理小伙伴先看
【Scrapy 五分钟撸网站】爬虫目标整理和数据准备

3. 不了解Scrapy模板量产的小伙伴先看(必看)
【Scrapy 五分钟撸网站】数据抓取项目框架通用模板

数据整理结果

1. 全频道Url获取地址
错误的url列表,这里是该网站的tag标签url列表,变化频繁。
在这里插入图片描述
正确的url列表在这里,需要手动处理一下
在这里插入图片描述
2. Excel保存截图
在这里插入图片描述

模板套用

Spider下的<项目>.py文件

1. 创建spider项目

scrapy genspider www_mydrivers_com " "

2. 整理全站css样式
先来看下页面的CSS样式,全站统一样式。
在这里插入图片描述

3. 修改www_mydrivers_com.py的的内容

这里将需要修改的地方进行说明,其他地方参考模板,不需修改。

  • 作用域&自定义说明
    allowed_domains = []
    web_name = "快科技"
  • 添加抓取数据信息
    start_menu = [
        # 全频道
        [
            {
    
    "channel_name": "资讯中心", "url": "https://news.mydrivers.com/", },
            {
    
    "channel_name": "资讯中心-电脑办公", "url": "https://news.mydrivers.com/class/801/", },
            {
    
    "channel_name": "资讯中心-手机平板", "url": "https://news.mydrivers.com/class/802/", },
            {
    
    "channel_name": "资讯中心-IT业界", "url": "https://news.mydrivers.com/class/803/", },
            {
    
    "channel_name": "资讯中心-爱车一族", "url": "https://news.mydrivers.com/class/807/", },
            {
    
    "channel_name": "资讯中心-游戏世界", "url": "https://news.mydrivers.com/class/806/", },
            {
    
    "channel_name": "资讯中心-家电数码", "url": "https://news.mydrivers.com/class/804/", },
            {
    
    "channel_name": "资讯中心-软件之家", "url": "https://news.mydrivers.com/class/805/", },
            {
    
    "channel_name": "资讯中心-科学动态", "url": "https://news.mydrivers.com/class/808/", },
            {
    
    "channel_name": "资讯中心-影音达人", "url": "https://news.mydrivers.com/class/809/", },
            {
    
    "channel_name": "资讯中心-便携机", "url": "https://news.mydrivers.com/class/69/", },
            {
    
    "channel_name": "资讯中心-服务器", "url": "https://news.mydrivers.com/class/68/", },
            {
    
    "channel_name": "资讯中心-台式机", "url": "https://news.mydrivers.com/class/67/", },
            {
    
    "channel_name": "资讯中心-笔记本", "url": "https://news.mydrivers.com/class/66/", },
            {
    
    "channel_name": "资讯中心-科技前沿", "url": "https://news.mydrivers.com/class/65/", },
            {
    
    "channel_name": "资讯中心-视点人物", "url": "https://news.mydrivers.com/class/62/", },
            {
    
    "channel_name": "资讯中心-操作系统", "url": "https://news.mydrivers.com/class/58/", },
            {
    
    "channel_name": "资讯中心-电脑驱动", "url": "https://news.mydrivers.com/class/57/", },
            {
    
    "channel_name": "资讯中心-电脑软件", "url": "https://news.mydrivers.com/class/56/", },
            {
    
    "channel_name": "资讯中心-掌机游戏", "url": "https://news.mydrivers.com/class/55/", },
            {
    
    "channel_name": "资讯中心-游戏主机", "url": "https://news.mydrivers.com/class/54/", },
            {
    
    "channel_name": "资讯中心-主机游戏", "url": "https://news.mydrivers.com/class/53/", },
            {
    
    "channel_name": "资讯中心-电脑游戏", "url": "https://news.mydrivers.com/class/52/", },
            {
    
    "channel_name": "资讯中心-传真机", "url": "https://news.mydrivers.com/class/51/", },
            {
    
    "channel_name": "资讯中心-扫描仪", "url": "https://news.mydrivers.com/class/49/", },
            {
    
    "channel_name": "资讯中心-投影机", "url": "https://news.mydrivers.com/class/48/", },
            {
    
    "channel_name": "资讯中心-一体机", "url": "https://news.mydrivers.com/class/47/", },
            {
    
    "channel_name": "资讯中心-复印机", "url": "https://news.mydrivers.com/class/46/", },
            {
    
    "channel_name": "资讯中心-打印机", "url": "https://news.mydrivers.com/class/45/", },
            {
    
    "channel_name": "资讯中心-网络存储", "url": "https://news.mydrivers.com/class/43/", },
            {
    
    "channel_name": "资讯中心-网卡", "url": "https://news.mydrivers.com/class/41/", },
            {
    
    "channel_name": "资讯中心-路由器", "url": "https://news.mydrivers.com/class/38/", },
            {
    
    "channel_name": "资讯中心-交换机", "url": "https://news.mydrivers.com/class/37/", },
            {
    
    "channel_name": "资讯中心-电子书", "url": "https://news.mydrivers.com/class/33/", },
            {
    
    "channel_name": "资讯中心-科技资讯", "url": "https://news.mydrivers.com/class/329/", },
            {
    
    "channel_name": "资讯中心-快递物流", "url": "https://news.mydrivers.com/class/328/", },
            {
    
    "channel_name": "资讯中心-其他网络", "url": "https://news.mydrivers.com/class/327/", },
            {
    
    "channel_name": "资讯中心-机器人", "url": "https://news.mydrivers.com/class/326/", },
            {
    
    "channel_name": "资讯中心-火车高铁", "url": "https://news.mydrivers.com/class/325/", },
            {
    
    "channel_name": "资讯中心-网络红人", "url": "https://news.mydrivers.com/class/324/", },
            {
    
    "channel_name": "资讯中心-考勤机", "url": "https://news.mydrivers.com/class/323/", },
            {
    
    "channel_name": "资讯中心-网络安全", "url": "https://news.mydrivers.com/class/322/", },
            {
    
    "channel_name": "资讯中心-生活周边", "url": "https://news.mydrivers.com/class/321/", },
            {
    
    "channel_name": "资讯中心-共享经济", "url": "https://news.mydrivers.com/class/320/", },
            {
    
    "channel_name": "资讯中心-U盘存储卡", "url": "https://news.mydrivers.com/class/32/", },
            {
    
    "channel_name": "资讯中心-自行车", "url": "https://news.mydrivers.com/class/317/", },
            {
    
    "channel_name": "资讯中心-摩托车", "url": "https://news.mydrivers.com/class/316/", },
            {
    
    "channel_name": "资讯中心-多轴无人机", "url": "https://news.mydrivers.com/class/314/", },
            {
    
    "channel_name": "资讯中心-电动车", "url": "https://news.mydrivers.com/class/310/", },
            {
    
    "channel_name": "资讯中心-摄像头", "url": "https://news.mydrivers.com/class/31/", },
            {
    
    "channel_name": "资讯中心-智能家居", "url": "https://news.mydrivers.com/class/302/", },
            {
    
    "channel_name": "资讯中心-生活百科", "url": "https://news.mydrivers.com/class/301/", },
            {
    
    "channel_name": "资讯中心-数码相机", "url": "https://news.mydrivers.com/class/30/", },
            {
    
    "channel_name": "资讯中心-电子竞技", "url": "https://news.mydrivers.com/class/297/", },
            {
    
    "channel_name": "资讯中心-移动应用", "url": "https://news.mydrivers.com/class/292/", },
            {
    
    "channel_name": "资讯中心-智能穿戴", "url": "https://news.mydrivers.com/class/290/", },
            {
    
    "channel_name": "资讯中心-摄像机", "url": "https://news.mydrivers.com/class/29/", },
            {
    
    "channel_name": "资讯中心-安卓手机", "url": "https://news.mydrivers.com/class/288/", },
            {
    
    "channel_name": "资讯中心-其他智能", "url": "https://news.mydrivers.com/class/287/", },
            {
    
    "channel_name": "资讯中心-教育未来", "url": "https://news.mydrivers.com/class/285/", },
            {
    
    "channel_name": "资讯中心-超极本", "url": "https://news.mydrivers.com/class/278/", },
            {
    
    "channel_name": "资讯中心-创意摄影", "url": "https://news.mydrivers.com/class/274/", },
            {
    
    "channel_name": "资讯中心-样张赏析", "url": "https://news.mydrivers.com/class/273/", },
            {
    
    "channel_name": "资讯中心-镜头", "url": "https://news.mydrivers.com/class/271/", },
            {
    
    "channel_name": "资讯中心-MP3/MP4", "url": "https://news.mydrivers.com/class/27/", },
            {
    
    "channel_name": "资讯中心-艺术设计", "url": "https://news.mydrivers.com/class/269/", },
            {
    
    "channel_name": "资讯中心-电影动画", "url": "https://news.mydrivers.com/class/267/", },
            {
    
    "channel_name": "资讯中心-精彩影视", "url": "https://news.mydrivers.com/class/266/", },
            {
    
    "channel_name": "资讯中心-汽车厂商", "url": "https://news.mydrivers.com/class/264/", },
            {
    
    "channel_name": "资讯中心-车载配件", "url": "https://news.mydrivers.com/class/263/", },
            {
    
    "channel_name": "资讯中心-车载系统", "url": "https://news.mydrivers.com/class/262/", },
            {
    
    "channel_name": "资讯中心-无人驾驶汽车", "url": "https://news.mydrivers.com/class/261/", },
            {
    
    "channel_name": "资讯中心-其他汽车", "url": "https://news.mydrivers.com/class/260/", },
            {
    
    "channel_name": "资讯中心-PDA相关", "url": "https://news.mydrivers.com/class/26/", },
            {
    
    "channel_name": "资讯中心-电动汽车", "url": "https://news.mydrivers.com/class/259/", },
            {
    
    "channel_name": "资讯中心-普通汽车", "url": "https://news.mydrivers.com/class/258/", },
            {
    
    "channel_name": "资讯中心-奇趣探险", "url": "https://news.mydrivers.com/class/256/", },
            {
    
    "channel_name": "资讯中心-科普知识", "url": "https://news.mydrivers.com/class/255/", },
            {
    
    "channel_name": "资讯中心-数理化学", "url": "https://news.mydrivers.com/class/254/", },
            {
    
    "channel_name": "资讯中心-游戏厂商", "url": "https://news.mydrivers.com/class/253/", },
            {
    
    "channel_name": "资讯中心-壁纸主题", "url": "https://news.mydrivers.com/class/252/", },
            {
    
    "channel_name": "资讯中心-手机配件", "url": "https://news.mydrivers.com/class/25/", },
            {
    
    "channel_name": "资讯中心-Windows平板", "url": "https://news.mydrivers.com/class/242/", },
            {
    
    "channel_name": "资讯中心-安卓平板", "url": "https://news.mydrivers.com/class/241/", },
            {
    
    "channel_name": "资讯中心-苹果iPad", "url": "https://news.mydrivers.com/class/240/", },
            {
    
    "channel_name": "资讯中心-手机厂商", "url": "https://news.mydrivers.com/class/24/", },
            {
    
    "channel_name": "资讯中心-飞机航空", "url": "https://news.mydrivers.com/class/236/", },
            {
    
    "channel_name": "资讯中心-生活电器", "url": "https://news.mydrivers.com/class/234/", },
            {
    
    "channel_name": "资讯中心-手机系统", "url": "https://news.mydrivers.com/class/232/", },
            {
    
    "channel_name": "资讯中心-音箱", "url": "https://news.mydrivers.com/class/23/", },
            {
    
    "channel_name": "资讯中心-键鼠", "url": "https://news.mydrivers.com/class/22/", },
            {
    
    "channel_name": "资讯中心-其他手机", "url": "https://news.mydrivers.com/class/211/", },
            {
    
    "channel_name": "资讯中心-声卡", "url": "https://news.mydrivers.com/class/21/", },
            {
    
    "channel_name": "资讯中心-手机游戏", "url": "https://news.mydrivers.com/class/209/", },
            {
    
    "channel_name": "资讯中心-山寨机", "url": "https://news.mydrivers.com/class/208/", },
            {
    
    "channel_name": "资讯中心-移动处理器", "url": "https://news.mydrivers.com/class/206/", },
            {
    
    "channel_name": "资讯中心-微软手机", "url": "https://news.mydrivers.com/class/205/", },
            {
    
    "channel_name": "资讯中心-黑莓手机", "url": "https://news.mydrivers.com/class/204/", },
            {
    
    "channel_name": "资讯中心-塞班手机", "url": "https://news.mydrivers.com/class/203/", },
            {
    
    "channel_name": "资讯中心-苹果手机", "url": "https://news.mydrivers.com/class/201/", },
            {
    
    "channel_name": "资讯中心-光驱", "url": "https://news.mydrivers.com/class/20/", },
            {
    
    "channel_name": "资讯中心-工程建筑", "url": "https://news.mydrivers.com/class/197/", },
            {
    
    "channel_name": "资讯中心-地理自然", "url": "https://news.mydrivers.com/class/196/", },
            {
    
    "channel_name": "资讯中心-生科医学", "url": "https://news.mydrivers.com/class/195/", },
            {
    
    "channel_name": "资讯中心-历史考古", "url": "https://news.mydrivers.com/class/194/", },
            {
    
    "channel_name": "资讯中心-生物世界", "url": "https://news.mydrivers.com/class/193/", },
            {
    
    "channel_name": "资讯中心-散热器", "url": "https://news.mydrivers.com/class/19/", },
            {
    
    "channel_name": "资讯中心-耳塞耳机", "url": "https://news.mydrivers.com/class/185/", },
            {
    
    "channel_name": "资讯中心-小家电", "url": "https://news.mydrivers.com/class/184/", },
            {
    
    "channel_name": "资讯中心-线材线缆", "url": "https://news.mydrivers.com/class/183/", },
            {
    
    "channel_name": "资讯中心-网络运营商", "url": "https://news.mydrivers.com/class/180/", },
            {
    
    "channel_name": "资讯中心-电源", "url": "https://news.mydrivers.com/class/18/", },
            {
    
    "channel_name": "资讯中心-天文航天", "url": "https://news.mydrivers.com/class/175/", },
            {
    
    "channel_name": "资讯中心-企业动态", "url": "https://news.mydrivers.com/class/174/", },
            {
    
    "channel_name": "资讯中心-平板电视", "url": "https://news.mydrivers.com/class/173/", },
            {
    
    "channel_name": "资讯中心-机箱", "url": "https://news.mydrivers.com/class/17/", },
            {
    
    "channel_name": "资讯中心-显示器", "url": "https://news.mydrivers.com/class/168/", },
            {
    
    "channel_name": "资讯中心-其他数码", "url": "https://news.mydrivers.com/class/167/", },
            {
    
    "channel_name": "资讯中心-其他硬件", "url": "https://news.mydrivers.com/class/166/", },
            {
    
    "channel_name": "资讯中心-硬盘", "url": "https://news.mydrivers.com/class/16/", },
            {
    
    "channel_name": "资讯中心-内存", "url": "https://news.mydrivers.com/class/15/", },
            {
    
    "channel_name": "资讯中心-主板", "url": "https://news.mydrivers.com/class/14/", },
            {
    
    "channel_name": "资讯中心-CPU", "url": "https://news.mydrivers.com/class/13/", },
            {
    
    "channel_name": "资讯中心-显卡", "url": "https://news.mydrivers.com/class/12/", },
        ]
    ]
  • 样式整理

整体网站数据列表有多少种样式就要做多少个parseX,并添加到

        parse_list = [
            self.parse1,  # 全频道
        ]
  • 标题&链接&封面
        Item_title = response.xpath('//ul[@class="news_lb"]/li/h3/a/text()').extract()  # 文章标题列表
        Item_url = response.xpath('//ul[@class="news_lb"]/li/h3/a/@href').extract()  # 文章链接列表
        Item_thumbImg = response.xpath('//ul[@class="news_lb"]/li/div[@class="news_left  photo"]/a/img/@src').extract()  # 文章封面图片列表

Spider下的parse_detail.py文件

1. 抓取详情页内容

修改列表数据详情页的CSS抓取样式
在这里插入图片描述

    # 处理详情页带格式,这里整个页面进行抓取
    item['content'] = ""
    if 'class="news_info"' in response.text and len(None2Str(item['content'])) < 5:
        item['content'] = response.xpath('//div[@class="news_info"]').extract_first()

2. 特别说明

有些网站的程序员丧心病狂到一定程度10个页面9种样式这种,由于我们不可能每个页面都打开看一下详情页的CSS格式,因此有个通用的解决办法。

  • 第一次抓取完内容之后打开MongoDB数据库执行下面的命令会把包含body的页面数据筛选出来,这些是没有根据指定样式抓取的数据,而是直接抓的全部页面的数据。
db.你的表名.find({content:/body/})

在这里插入图片描述

  • 打开任意的link循环处理详情页的内容直到mongo命令没有筛选出来内容为止即可。

猜你喜欢

转载自blog.csdn.net/qq_20288327/article/details/114009634