【Scrapy 五分钟撸网站】[各省市新闻信息]Scrapy实战中国甘肃网全站数据抓取

目标网站介绍

中国甘肃网 甘肃省地方重点新闻门户网站,是甘肃权威新闻发布平台和对外传播窗口。目前,有新闻、政务、文化等40多个频道,“两微两端”甘肃头条、飞天手机台和公众…
在这里插入图片描述

开始Scrapy

数据采集准备

1. 不了解5分钟快速抓网站思路的小伙伴先看
【Scrapy 五分钟撸网站】全站数据必备基础知识

2. 不了解数据抓取业务管理整理小伙伴先看
【Scrapy 五分钟撸网站】爬虫目标整理和数据准备

3. 不了解Scrapy模板量产的小伙伴先看(必看)
【Scrapy 五分钟撸网站】数据抓取项目框架通用模板

数据整理结果

1. Excel保存截图
在这里插入图片描述

模板套用

Spider下的<项目>.py文件

1. 创建spider项目

scrapy genspider www_gscn_com_cn " "

2. 整理全站css样式
先来看下页面的CSS样式,全站统一基础样式一种,其余特殊样式比较多统一交给gerapy_auto_extractor.extractors 的 extract_list处理。
在这里插入图片描述

3. 修改 www_gscn_com_cn.py 的的内容

这里将需要修改的地方进行说明,其他地方参考模板,不需修改。

扫描二维码关注公众号,回复: 12582991 查看本文章
  • 作用域&自定义说明
    allowed_domains = []
    web_name = "中国甘肃网"
  • 添加抓取数据信息
    start_menu = [
        # 财经频道
        [
            {
    
    "channel_name": "财经频道-财经锐评", "url": "http://finance.gscn.com.cn/cjpl/index.html", },
            {
    
    "channel_name": "财经频道-产经", "url": "http://finance.gscn.com.cn/tzbd/index.html", },
            {
    
    "channel_name": "财经频道-股市热点", "url": "http://finance.gscn.com.cn/rdsj/index.html", },
            {
    
    "channel_name": "财经频道-国际财经", "url": "http://finance.gscn.com.cn/qqzx/index.html", },
            {
    
    "channel_name": "财经频道-国内财经", "url": "http://finance.gscn.com.cn/cjjd/index.html", },
            {
    
    "channel_name": "财经频道-金融", "url": "http://finance.gscn.com.cn/sxy/index.html", },
            {
    
    "channel_name": "财经频道-省内经济", "url": "http://finance.gscn.com.cn/sljj/index.html", },
        ],
        # 大学生
        [
            {
    
    "channel_name": "大学生-大学生论坛", "url": "http://dxs.gscn.com.cn/lt/", },
            {
    
    "channel_name": "大学生-考试大全", "url": "http://dxs.gscn.com.cn/ksdq/", },
            {
    
    "channel_name": "大学生-校园新鲜事", "url": "http://dxs.gscn.com.cn/xxs/", },
            {
    
    "channel_name": "大学生-要闻扫描", "url": "http://dxs.gscn.com.cn/ywsm/", },
            {
    
    "channel_name": "大学生-招聘·就业", "url": "http://dxs.gscn.com.cn/zpjy/index.shtml", },
        ],
        # 法治甘肃
        [
            {
    
    "channel_name": "法治甘肃-H5", "url": "http://gscn.com.cn/fzgs/H5/index.shtml", },
            {
    
    "channel_name": "法治甘肃-案件速递", "url": "http://gscn.com.cn/fzgs/ajkd/index.shtml", },
            {
    
    "channel_name": "法治甘肃-本网关注", "url": "http://gscn.com.cn/fzgs/bwgz/index.shtml", },
            {
    
    "channel_name": "法治甘肃-大案特写", "url": "http://gscn.com.cn/fzgs/datx/index.shtml", },
            {
    
    "channel_name": "法治甘肃-法治视听", "url": "http://gscn.com.cn/fzgs/fzst/index.shtml", },
            {
    
    "channel_name": "法治甘肃-精图荟萃", "url": "http://gscn.com.cn/fzgs/jthc/index.shtml", },
            {
    
    "channel_name": "法治甘肃-普法教育", "url": "http://gscn.com.cn/fzgs/pfjy/index.shtml", },
            {
    
    "channel_name": "法治甘肃-图解", "url": "http://gscn.com.cn/fzgs/tj/index.shtml", },
            {
    
    "channel_name": "法治甘肃-要闻", "url": "http://gscn.com.cn/fzgs/yw/index.shtml", },
            {
    
    "channel_name": "法治甘肃-以案说法", "url": "http://gscn.com.cn/fzgs/yasf/index.shtml", },
            {
    
    "channel_name": "法治甘肃-资讯动态", "url": "http://gscn.com.cn/fzgs/zxdt/index.shtml", },
        ],
        # 飞天评论
        [
            {
    
    "channel_name": "飞天评论-编辑推荐", "url": "http://opinion.gscn.com.cn/bjtj/", },
            {
    
    "channel_name": "飞天评论-飞天锐评", "url": "http://opinion.gscn.com.cn/ftrp/", },
            {
    
    "channel_name": "飞天评论-陇风", "url": "http://opinion.gscn.com.cn/xfl/", },
            {
    
    "channel_name": "飞天评论-媒体观点", "url": "http://opinion.gscn.com.cn/mtgd/", },
            {
    
    "channel_name": "飞天评论-民生杂谈", "url": "http://opinion.gscn.com.cn/mszt/", },
            {
    
    "channel_name": "飞天评论-图中有话", "url": "http://opinion.gscn.com.cn/tzyh/", },
            {
    
    "channel_name": "飞天评论-网评甘肃", "url": "http://opinion.gscn.com.cn/wyzs/", },
            {
    
    "channel_name": "飞天评论-微评甘肃", "url": "http://opinion.gscn.com.cn/wpgs/", },
            {
    
    "channel_name": "飞天评论-文教观察", "url": "http://opinion.gscn.com.cn/wjgc/", },
            {
    
    "channel_name": "飞天评论-杂文随笔", "url": "http://opinion.gscn.com.cn/zwsb/", },
            {
    
    "channel_name": "飞天评论-政经评论", "url": "http://opinion.gscn.com.cn/zjpl/", },
        ],
        # 甘肃地理
        [
            {
    
    "channel_name": "甘肃地理-出行路线", "url": "http://www.gscn.com.cn/geography/cxlx/", },
            {
    
    "channel_name": "甘肃地理-地理资讯", "url": "http://www.gscn.com.cn/geography/dlzx/", },
            {
    
    "channel_name": "甘肃地理-甘肃考古", "url": "http://www.gscn.com.cn/geography/gskg/", },
            {
    
    "channel_name": "甘肃地理-甘肃探险", "url": "http://www.gscn.com.cn/geography/gstx/", },
            {
    
    "channel_name": "甘肃地理-古道遗址", "url": "http://www.gscn.com.cn/geography/gdyz/", },
            {
    
    "channel_name": "甘肃地理-户外宝典", "url": "http://www.gscn.com.cn/geography/hwbd/", },
            {
    
    "channel_name": "甘肃地理-民俗风情", "url": "http://www.gscn.com.cn/geography/msfq/", },
            {
    
    "channel_name": "甘肃地理-山岳地质", "url": "http://www.gscn.com.cn/geography/sydz/", },
            {
    
    "channel_name": "甘肃地理-视频专区", "url": "http://www.gscn.com.cn/geography/spzq/", },
            {
    
    "channel_name": "甘肃地理-特色地貌", "url": "http://www.gscn.com.cn/geography/tsdm/", },
            {
    
    "channel_name": "甘肃地理-峡谷河流", "url": "http://www.gscn.com.cn/geography/xghl/", },
        ],
        # 甘肃宽频
        [
            {
    
    "channel_name": "甘肃宽频-本网视点", "url": "http://video.gscn.com.cn/bwsd/", },
            {
    
    "channel_name": "甘肃宽频-甘肃要闻", "url": "http://video.gscn.com.cn/gansu/", },
            {
    
    "channel_name": "甘肃宽频-微记录", "url": "http://video.gscn.com.cn/wjl/", },
            {
    
    "channel_name": "甘肃宽频-现场直播", "url": "http://video.gscn.com.cn/wlzb/", },
            {
    
    "channel_name": "甘肃宽频-小陇随便侃", "url": "http://video.gscn.com.cn/xl/", },
            {
    
    "channel_name": "甘肃宽频-洋芋蛋视频", "url": "http://video.gscn.com.cn/yyd/", },
        ],
        # 甘肃美食
        [
            {
    
    "channel_name": "甘肃美食-陇上风味", "url": "http://www.gscn.com.cn/food/lsfw/index.shtml", },
            {
    
    "channel_name": "甘肃美食-陇上美食", "url": "http://www.gscn.com.cn/food/lsms/index.html", },
            {
    
    "channel_name": "甘肃美食-美食文化", "url": "http://www.gscn.com.cn/food/mswh/index.shtml", },
            {
    
    "channel_name": "甘肃美食-美食养生", "url": "http://www.gscn.com.cn/food/msys/index.shtml", },
        ],
        # 甘肃能源
        [
            {
    
    "channel_name": "甘肃能源-传统能源-电力", "url": "http://energy.gscn.com.cn/ctny/dl/", },
            {
    
    "channel_name": "甘肃能源-传统能源-煤炭", "url": "http://energy.gscn.com.cn/ctny/mt/", },
            {
    
    "channel_name": "甘肃能源-传统能源-石油化工", "url": "http://energy.gscn.com.cn/xxny/hn/", },
            {
    
    "channel_name": "甘肃能源-传统能源-天然气", "url": "http://energy.gscn.com.cn/ctny/trq/", },
            {
    
    "channel_name": "甘肃能源-低碳生活", "url": "http://energy.gscn.com.cn/dtsh/", },
            {
    
    "channel_name": "甘肃能源-环境保护", "url": "http://energy.gscn.com.cn/hjbh/", },
            {
    
    "channel_name": "甘肃能源-建筑节能", "url": "http://energy.gscn.com.cn/jzjn/", },
            {
    
    "channel_name": "甘肃能源-交通节能", "url": "http://energy.gscn.com.cn/jtjn/", },
            {
    
    "channel_name": "甘肃能源-清洁能源-风电", "url": "http://energy.gscn.com.cn/xxny/fd/", },
            {
    
    "channel_name": "甘肃能源-清洁能源-核能", "url": "http://energy.gscn.com.cn/xxny/hn/", },
            {
    
    "channel_name": "甘肃能源-清洁能源-生物质能", "url": "http://energy.gscn.com.cn/xxny/swzn/", },
            {
    
    "channel_name": "甘肃能源-清洁能源-太阳能", "url": "http://energy.gscn.com.cn/xxny/tyn/", },
            {
    
    "channel_name": "甘肃能源-专家解读", "url": "http://energy.gscn.com.cn/zjjd/", },
        ],
        # 甘肃人物
        [
            {
    
    "channel_name": "甘肃人物-道德模范", "url": "http://www.gscn.com.cn/figure/ddmf/", },
            {
    
    "channel_name": "甘肃人物-教育社科界", "url": "http://www.gscn.com.cn/figure/jysk/", },
            {
    
    "channel_name": "甘肃人物-金融科技界", "url": "http://www.gscn.com.cn/figure/kjj/", },
            {
    
    "channel_name": "甘肃人物-陇人骄子", "url": "http://www.gscn.com.cn/figure/lrjz/", },
            {
    
    "channel_name": "甘肃人物-陇上好人", "url": "http://www.gscn.com.cn/figure/lshr/", },
            {
    
    "channel_name": "甘肃人物-民族宗教界", "url": "http://www.gscn.com.cn/figure/mzzj/", },
            {
    
    "channel_name": "甘肃人物-农业产业界", "url": "http://www.gscn.com.cn/figure/nycy/", },
            {
    
    "channel_name": "甘肃人物-体育娱乐界", "url": "http://www.gscn.com.cn/figure/tyyl/", },
            {
    
    "channel_name": "甘肃人物-文化艺术界", "url": "http://www.gscn.com.cn/figure/whys/", },
            {
    
    "channel_name": "甘肃人物-先锋引领", "url": "http://www.gscn.com.cn/figure/sdxf/", },
            {
    
    "channel_name": "甘肃人物-新闻出版界", "url": "http://www.gscn.com.cn/figure/xwcb/", },
            {
    
    "channel_name": "甘肃人物-医药卫生界", "url": "http://www.gscn.com.cn/figure/yyws/", },
        ],
        # 甘肃省情
        [
            {
    
    "channel_name": "甘肃省情-伏羲", "url": "http://www.gscn.com.cn/province/fx/", },
            {
    
    "channel_name": "甘肃省情-黄河", "url": "http://www.gscn.com.cn/province/hh/", },
            {
    
    "channel_name": "甘肃省情-经济社会发展-对外开放", "url": "http://www.gscn.com.cn/province/jjsh/dwkf/", },
            {
    
    "channel_name": "甘肃省情-经济社会发展-宏观经济", "url": "http://www.gscn.com.cn/province/jjsh/hgjj/", },
            {
    
    "channel_name": "甘肃省情-经济社会发展-惠民富民", "url": "http://www.gscn.com.cn/province/jjsh/hmfm/", },
            {
    
    "channel_name": "甘肃省情-经济社会发展-社会事业", "url": "http://www.gscn.com.cn/province/jjsh/shsy/", },
            {
    
    "channel_name": "甘肃省情-经济社会发展-数字甘肃", "url": "http://www.gscn.com.cn/province/jjsh/szgs/", },
            {
    
    "channel_name": "甘肃省情-历史", "url": "http://www.gscn.com.cn/province/ls/", },
            {
    
    "channel_name": "甘肃省情-陇原风光", "url": "http://www.gscn.com.cn/province/lyfg/", },
            {
    
    "channel_name": "甘肃省情-民俗", "url": "http://www.gscn.com.cn/province/ms/", },
            {
    
    "channel_name": "甘肃省情-人物", "url": "http://www.gscn.com.cn/province/rw/", },
            {
    
    "channel_name": "甘肃省情-丝绸之路", "url": "http://www.gscn.com.cn/province/sczl/", },
            {
    
    "channel_name": "甘肃省情-文化教育", "url": "http://www.gscn.com.cn/province/whjy/", },
            {
    
    "channel_name": "甘肃省情-艺术", "url": "http://www.gscn.com.cn/province/ys/", },
            {
    
    "channel_name": "甘肃省情-自然资源", "url": "http://www.gscn.com.cn/province/zrzy/", },
            {
    
    "channel_name": "甘肃省情-自然资源-风能", "url": "http://www.gscn.com.cn/province/zrzy/fn/", },
            {
    
    "channel_name": "甘肃省情-自然资源-矿产", "url": "http://www.gscn.com.cn/province/zrzy/kc/", },
            {
    
    "channel_name": "甘肃省情-自然资源-水利", "url": "http://www.gscn.com.cn/province/zrzy/sl/", },
            {
    
    "channel_name": "甘肃省情-自然资源-太阳能", "url": "http://www.gscn.com.cn/province/zrzy/tyn/", },
        ],
        # 甘肃书画
        [
            {
    
    "channel_name": "甘肃书画-甘肃书画", "url": "http://shuhua.gscn.com.cn/gssh/", },
            {
    
    "channel_name": "甘肃书画-美术展厅", "url": "http://shuhua.gscn.com.cn/mszt/", },
            {
    
    "channel_name": "甘肃书画-艺海钩沉", "url": "http://shuhua.gscn.com.cn/yhgc/", },
            {
    
    "channel_name": "甘肃书画-艺术资讯", "url": "http://shuhua.gscn.com.cn/yszx/", },
        ],
        # 甘肃特产
        [
            {
    
    "channel_name": "甘肃特产-本地商讯", "url": "http://www.gscn.com.cn/specialties/bdsx/", },
            {
    
    "channel_name": "甘肃特产-地县分布", "url": "http://www.gscn.com.cn/specialties/dxfb/", },
            {
    
    "channel_name": "甘肃特产-名品展示", "url": "http://www.gscn.com.cn/specialties/mpzs/", },
            {
    
    "channel_name": "甘肃特产-特产论坛", "url": "http://www.gscn.com.cn/specialties/tclt/", },
            {
    
    "channel_name": "甘肃特产-特产品评", "url": "http://www.gscn.com.cn/specialties/tcpp/", },
            {
    
    "channel_name": "甘肃特产-特产企业", "url": "http://www.gscn.com.cn/specialties/tcqy/", },
            {
    
    "channel_name": "甘肃特产-特产新闻", "url": "http://www.gscn.com.cn/specialties/tcxw/", },
        ],
        # 甘肃文化
        [
            {
    
    "channel_name": "甘肃文化-独家", "url": "http://www.gscn.com.cn/culture/dj/index.shtml", },
            {
    
    "channel_name": "甘肃文化-发现之旅", "url": "http://www.gscn.com.cn/culture/fxzl/index.shtml", },
            {
    
    "channel_name": "甘肃文化-历史珍闻", "url": "http://www.gscn.com.cn/culture/lszw/index.shtml", },
            {
    
    "channel_name": "甘肃文化-民俗风情", "url": "http://www.gscn.com.cn/culture/msfq/index.shtml", },
            {
    
    "channel_name": "甘肃文化-人文之旅", "url": "http://www.gscn.com.cn/culture/rwzl/index.shtml", },
            {
    
    "channel_name": "甘肃文化-收藏考古", "url": "http://www.gscn.com.cn/culture/sckg/index.shtml", },
            {
    
    "channel_name": "甘肃文化-文化名人", "url": "http://www.gscn.com.cn/culture/whmr/index.shtml", },
            {
    
    "channel_name": "甘肃文化-文化热点", "url": "http://www.gscn.com.cn/culture/whkd/index.shtml", },
            {
    
    "channel_name": "甘肃文化-文化时评", "url": "http://www.gscn.com.cn/culture/whsp/index.shtml", },
            {
    
    "channel_name": "甘肃文化-文化书架", "url": "http://www.gscn.com.cn/culture/whsj/index.shtml", },
            {
    
    "channel_name": "甘肃文化-文史博览", "url": "http://www.gscn.com.cn/culture/wsbl/index.shtml", },
            {
    
    "channel_name": "甘肃文化-影视演出", "url": "http://www.gscn.com.cn/culture/ysyc/index.shtml", },
        ],
        # 甘肃新闻
        [
            {
    
    "channel_name": "甘肃新闻-本网原创", "url": "http://gansu.gscn.com.cn/bwyc/", },
            {
    
    "channel_name": "甘肃新闻-甘肃教育",
             "url": "http://gansu.gscn.com.cn/gsjy/index.html?spm=zm5104-001.0.0.3.mNkOdS", },
            {
    
    "channel_name": "甘肃新闻-甘肃廉政", "url": "http://gansu.gscn.com.cn/gsyq/index.html", },
            {
    
    "channel_name": "甘肃新闻-甘肃美食", "url": "http://gansu.gscn.com.cn/gsms/", },
            {
    
    "channel_name": "甘肃新闻-甘肃美食", "url": "http://gansu.gscn.com.cn/gstc/", },
            {
    
    "channel_name": "甘肃新闻-甘肃人物", "url": "http://gansu.gscn.com.cn/gsrenwu/", },
            {
    
    "channel_name": "甘肃新闻-甘肃特产", "url": "http://gansu.gscn.com.cn/gstc/", },
            {
    
    "channel_name": "甘肃新闻-甘肃文化", "url": "http://gansu.gscn.com.cn/gswh/", },
            {
    
    "channel_name": "甘肃新闻-社会综合", "url": "http://gansu.gscn.com.cn/gssh/", },
            {
    
    "channel_name": "甘肃新闻-省内人事变动", "url": "http://gansu.gscn.com.cn/slrs/", },
            {
    
    "channel_name": "甘肃新闻-省外媒体刊甘肃", "url": "http://gansu.gscn.com.cn/swkgs/", },
            {
    
    "channel_name": "甘肃新闻-市州播报", "url": "http://gansu.gscn.com.cn/gsjsbb/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-白银", "url": "http://gansu.gscn.com.cn/gsjsbb/by/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-定西", "url": "http://gansu.gscn.com.cn/gsjsbb/dx/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-甘南", "url": "http://gansu.gscn.com.cn/gsjsbb/gn/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-嘉峪关", "url": "http://gansu.gscn.com.cn/gsjsbb/jyg/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-酒泉", "url": "http://gansu.gscn.com.cn/gsjsbb/jq/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-兰州",
             "url": "http://gansu.gscn.com.cn/gsjsbb/lz/index.html?spm=zm5104-001.0.0.4.5HyTzv", },
            {
    
    "channel_name": "甘肃新闻-市州播报-临夏", "url": "http://gansu.gscn.com.cn/gsjsbb/lx/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-陇南", "url": "http://gansu.gscn.com.cn/gsjsbb/ln/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-平凉", "url": "http://gansu.gscn.com.cn/gsjsbb/pl/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-庆阳", "url": "http://gansu.gscn.com.cn/gsjsbb/qy/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-天水", "url": "http://gansu.gscn.com.cn/gsjsbb/ts/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-武威", "url": "http://gansu.gscn.com.cn/gsjsbb/ww/", },
            {
    
    "channel_name": "甘肃新闻-市州播报-张掖", "url": "http://gansu.gscn.com.cn/gsjsbb/zy/", },
            {
    
    "channel_name": "甘肃新闻-小陇画报-往期回顾", "url": "http://gansu.gscn.com.cn/xlhb/renwen/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-便民服务", "url": "http://gansu.gscn.com.cn/msrx/bmfw/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-便民服务",
             "url": "http://gansu.gscn.com.cn/msrx/bmfw/index.shtml?spm=zm5104-001.0.0.4.ZhM0a5&file=index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-今日聚焦", "url": "http://gansu.gscn.com.cn/msrx/jrjj/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-热词看民生", "url": "http://gansu.gscn.com.cn/msrx/rckms/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-热点专题", "url": "http://gansu.gscn.com.cn/msrx/rdzt/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-深度观察", "url": "http://gansu.gscn.com.cn/msrx/sdgc/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-天天315", "url": "http://gansu.gscn.com.cn/msrx/tt315/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-问政陇原", "url": "http://gansu.gscn.com.cn/msrx/zxhf/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-小陇热线-政策解读", "url": "http://gansu.gscn.com.cn/msrx/zcjd/index.shtml", },
            {
    
    "channel_name": "甘肃新闻-要闻", "url": "http://gansu.gscn.com.cn/gsyw/", },
        ],
        # 甘肃政务
        [
            {
    
    "channel_name": "甘肃政务-甘肃人事变动", "url": "http://gov.gscn.com.cn/gsrw/", },
            {
    
    "channel_name": "甘肃政务-权威公告", "url": "http://www.gscn.com.cn/government/qwgg/", },
            {
    
    "channel_name": "甘肃政务-权威公告", "url": "http://gov.gscn.com.cn/qwgg/", },
            {
    
    "channel_name": "甘肃政务-省上领导活动报道集", "url": "http://gov.gscn.com.cn/sld/", },
            {
    
    "channel_name": "甘肃政务-政策法规", "url": "http://gov.gscn.com.cn/zcfg/", },
            {
    
    "channel_name": "甘肃政务-政府文件", "url": "http://gov.gscn.com.cn/zf/", },
            {
    
    "channel_name": "甘肃政务-政务动态", "url": "http://gov.gscn.com.cn/zwdt/", },
        ],
        # 甘肃公益
        [
            {
    
    "channel_name": "甘肃政务-甘肃人事变动", "url": "http://gov.gscn.com.cn/gsrw/", },
            {
    
    "channel_name": "甘肃政务-权威公告", "url": "http://www.gscn.com.cn/government/qwgg/", },
            {
    
    "channel_name": "甘肃政务-权威公告", "url": "http://gov.gscn.com.cn/qwgg/", },
            {
    
    "channel_name": "甘肃政务-省上领导活动报道集", "url": "http://gov.gscn.com.cn/sld/", },
            {
    
    "channel_name": "甘肃政务-政策法规", "url": "http://gov.gscn.com.cn/zcfg/", },
            {
    
    "channel_name": "甘肃政务-政府文件", "url": "http://gov.gscn.com.cn/zf/", },
            {
    
    "channel_name": "甘肃政务-政务动态", "url": "http://gov.gscn.com.cn/zwdt/", },
        ],
        # 科教频道
        [
            {
    
    "channel_name": "科教频道-甘肃教育",
             "url": "http://science.gscn.com.cn/gsjy/index.html?spm=zm5104-001.0.0.3.3z3MIk", },
            {
    
    "channel_name": "科教频道-海外留学",
             "url": "http://science.gscn.com.cn/hwlx/index.shtml?spm=zm5104-001.0.0.3.1UEAET&file=index.shtml", },
            {
    
    "channel_name": "科教频道-教育精图",
             "url": "http://science.gscn.com.cn/jyjt/index.html?spm=zm5104-001.0.0.3.7EcjAg", },
            {
    
    "channel_name": "科教频道-今日聚焦",
             "url": "http://science.gscn.com.cn/jrjj/index.shtml?spm=zm5104-001.0.0.3.Xf1bpJ&file=index.shtml", },
            {
    
    "channel_name": "科教频道-考生必备",
             "url": "http://science.gscn.com.cn/ksbb/index.html?spm=zm5104-001.0.0.3.yIQSS1", },
            {
    
    "channel_name": "科教频道-箐箐校园",
             "url": "http://science.gscn.com.cn/qqxy/index.html?spm=zm5104-001.0.0.3.JtKCBf", },
            {
    
    "channel_name": "科教频道-文教评论",
             "url": "http://science.gscn.com.cn/wjpl/index.html?spm=zm5104-001.0.0.3.axsR6F", },
        ],
        # 兰州都市
        [
            {
    
    "channel_name": "兰州都市", "url": "http://gansu.gscn.com.cn/cms_udf/2020/lanzhou/index.shtml", },
        ],
        # 理论频道
        [
            {
    
    "channel_name": "理论频道-党建政治", "url": "http://theory.gscn.com.cn/dsdj/index.shtml", },
            {
    
    "channel_name": "理论频道-甘肃评论", "url": "http://theory.gscn.com.cn/gspl/index.shtml", },
            {
    
    "channel_name": "理论频道-理论动态", "url": "http://theory.gscn.com.cn/lldt/index.shtml", },
            {
    
    "channel_name": "理论频道-理论前沿", "url": "http://theory.gscn.com.cn/llqy/index.shtml", },
            {
    
    "channel_name": "理论频道-学习路上", "url": "http://theory.gscn.com.cn/xxls/index.shtml", },
            {
    
    "channel_name": "理论频道-最新热评", "url": "http://theory.gscn.com.cn/zxrp/index.shtml", },
        ],
        # 廉政频道
        [
            {
    
    "channel_name": "廉政频道-理论探讨", "url": "http://www.gscn.com.cn/gslz/lltt/index.shtml", },
            {
    
    "channel_name": "廉政频道-廉政文化", "url": "http://www.gscn.com.cn/gslz/lzwh/index.shtml", },
            {
    
    "channel_name": "廉政频道-廉政要闻", "url": "http://www.gscn.com.cn/gslz/lzyw/index.shtml", },
            {
    
    "channel_name": "廉政频道-曝光台", "url": "http://www.gscn.com.cn/gslz/bgt/index.shtml", },
            {
    
    "channel_name": "廉政频道-时代风采", "url": "http://www.gscn.com.cn/gslz/sdfc/index.shtml", },
            {
    
    "channel_name": "廉政频道-市州动态", "url": "http://www.gscn.com.cn/gslz/szdt/index.shtml", },
            {
    
    "channel_name": "廉政频道-政策法规", "url": "http://www.gscn.com.cn/gslz/zcfg/index.shtml", },
        ],
        # 陇上有名
        [
            {
    
    "channel_name": "陇上有名-风景名胜-网红景点", "url": "http://www.gscn.com.cn/lsym/fjms/whjd/index.shtml", },
            {
    
    "channel_name": "陇上有名-历史文化-陇原故事", "url": "http://www.gscn.com.cn/lsym/lswh/lygs/index.shtml", },
            {
    
    "channel_name": "陇上有名-历史文化-溯源甘肃", "url": "http://www.gscn.com.cn/lsym/lswh/sygs/index.shtml", },
            {
    
    "channel_name": "陇上有名-历史文化-探史揭秘", "url": "http://www.gscn.com.cn/lsym/lswh/tsjm/index.shtml", },
            {
    
    "channel_name": "陇上有名-历史文化-文化考古", "url": "http://www.gscn.com.cn/lsym/lswh/whkg/index.shtml", },
            {
    
    "channel_name": "陇上有名-名家访谈", "url": "http://www.gscn.com.cn/lsym/mjft/index.shtml", },
            {
    
    "channel_name": "陇上有名-名人名家-非遗", "url": "http://www.gscn.com.cn/lsym/mrmj/fy/index.shtml", },
            {
    
    "channel_name": "陇上有名-名人名家-书法", "url": "http://www.gscn.com.cn/lsym/mrmj/sf/index.shtml", },
            {
    
    "channel_name": "陇上有名-名人名家-文学", "url": "http://www.gscn.com.cn/lsym/mrmj/wx/index.shtml", },
            {
    
    "channel_name": "陇上有名-名人名家-影视", "url": "http://www.gscn.com.cn/lsym/mrmj/ys/index.shtml", },
            {
    
    "channel_name": "陇上有名-名师名校", "url": "http://www.gscn.com.cn/lsym/msmx/index.shtml", },
            {
    
    "channel_name": "陇上有名-作品展示", "url": "http://www.gscn.com.cn/lsym/zpzs/index.shtml", },
        ],
        ##陇原养生
        [
            {
    
    "channel_name": "陇原养生-食疗养生", "url": "http://lyys.gscn.com.cn/sbsl/index.shtml", },
            {
    
    "channel_name": "陇原养生-养生名人", "url": "http://lyys.gscn.com.cn/lymy/index.shtml", },
            {
    
    "channel_name": "陇原养生-养生名医馆", "url": "http://lyys.gscn.com.cn/mqmy/index.shtml", },
            {
    
    "channel_name": "陇原养生-养生杂谈", "url": "http://lyys.gscn.com.cn/yscs/index.shtml", },
            {
    
    "channel_name": "陇原养生-养生资讯", "url": "http://lyys.gscn.com.cn/yszx/index.shtml", },
            {
    
    "channel_name": "陇原养生-运动养生", "url": "http://lyys.gscn.com.cn/ydjs/index.shtml", },
        ],
        ##汽车频道
        [
            {
    
    "type_id": "3", "province_name": "甘肃省", "city_name": "全地区", "channel_name": "中国甘肃网-汽车频道-曝光台",
             "url": "http://auto.gscn.com.cn/bgt/", },
        ],
        # 三农频道
        [
            {
    
    "channel_name": "三农频道-奋斗新农人", "url": "http://gscn.com.cn/snpd/fdxnr/index.shtml", },
            {
    
    "channel_name": "三农频道-美丽乡村", "url": "http://gscn.com.cn/snpd/mlxc/index.shtml", },
            {
    
    "channel_name": "三农频道-农桑论语", "url": "http://gscn.com.cn/snpd/nsly/index.shtml", },
            {
    
    "channel_name": "三农频道-三农讲堂", "url": "http://gscn.com.cn/snpd/snjt/index.shtml", },
            {
    
    "channel_name": "三农频道-三农要闻", "url": "http://gscn.com.cn/snpd/snyw/index.shtml", },
            {
    
    "channel_name": "三农频道-市州动态", "url": "http://gscn.com.cn/snpd/szdt/index.shtml", },
            {
    
    "channel_name": "三农频道-乡村振兴", "url": "http://gscn.com.cn/snpd/xczx/index.shtml", },
            {
    
    "channel_name": "三农频道-一县一品", "url": "http://gscn.com.cn/snpd/yxyp/index.shtml", },
        ],
        # 书香陇原
        [
            {
    
    "channel_name": "书香陇原-畅销排行", "url": "http://www.gscn.com.cn/sxly/cxpx/index.shtml", },
            {
    
    "channel_name": "书香陇原-读书心得", "url": "http://www.gscn.com.cn/sxly/dsxd/index.shtml", },
            {
    
    "channel_name": "书香陇原-甘版图书", "url": "http://www.gscn.com.cn/sxly/gbts/index.shtml", },
            {
    
    "channel_name": "书香陇原-聆听书香", "url": "http://www.gscn.com.cn/sxly/ltsx/index.shtml", },
            {
    
    "channel_name": "书香陇原-陇原新书", "url": "http://www.gscn.com.cn/sxly/lyxs/index.shtml", },
            {
    
    "channel_name": "书香陇原-美文摘编", "url": "http://www.gscn.com.cn/sxly/mwzb/index.shtml", },
            {
    
    "channel_name": "书香陇原-热点资讯", "url": "http://www.gscn.com.cn/sxly/rdzx/index.shtml", },
        ],
        # 图解新闻
        [
            {
    
    "channel_name": "图解新闻", "url": "http://gansu.gscn.com.cn/tj/", },
        ],
        # 图片频道
        [
            {
    
    "channel_name": "图片频道-甘肃视界", "url": "http://photo.gscn.com.cn/gssj/", },
            {
    
    "channel_name": "图片频道-军事", "url": "http://photo.gscn.com.cn/js/", },
            {
    
    "channel_name": "图片频道-趣图荟萃", "url": "http://photo.gscn.com.cn/cthc/", },
            {
    
    "channel_name": "图片频道-社会万象", "url": "http://photo.gscn.com.cn/shwx/", },
            {
    
    "channel_name": "图片频道-时尚", "url": "http://photo.gscn.com.cn/ss/", },
            {
    
    "channel_name": "图片频道-时政", "url": "http://photo.gscn.com.cn/sz/", },
            {
    
    "channel_name": "图片频道-娱乐", "url": "http://photo.gscn.com.cn/yl/", },
        ],
        # 脱贫攻坚频道
        [
            {
    
    "channel_name": "脱贫攻坚频道-扶贫影像", "url": "http://fpgj.gscn.com.cn/tsfp/", },
            {
    
    "channel_name": "脱贫攻坚频道-媒体矩阵", "url": "http://fpgj.gscn.com.cn/tj/", },
            {
    
    "channel_name": "脱贫攻坚频道-脱贫致富", "url": "http://fpgj.gscn.com.cn/fpxd/", },
            {
    
    "channel_name": "脱贫攻坚频道-要闻聚焦", "url": "http://fpgj.gscn.com.cn/fpdt/", },
            {
    
    "channel_name": "脱贫攻坚频道-要闻聚焦-甘肃要闻", "url": "http://fpgj.gscn.com.cn/fpdt/gsyw/", },
            {
    
    "channel_name": "脱贫攻坚频道-要闻聚焦-市州动态", "url": "http://fpgj.gscn.com.cn/fpdt/szsy/", },
            {
    
    "channel_name": "脱贫攻坚频道-政策文件", "url": "http://fpgj.gscn.com.cn/zcwj/", },
            {
    
    "channel_name": "脱贫攻坚频道-专题专栏", "url": "http://fpgj.gscn.com.cn/ztzl/", },
        ],
        # 文化旅游
        [
            {
    
    "channel_name": "文化旅游-风味小吃", "url": "http://www.gscn.com.cn/tourism/fwxc/", },
            {
    
    "channel_name": "文化旅游-甘肃地理", "url": "http://www.gscn.com.cn/tourism/dl/", },
            {
    
    "channel_name": "文化旅游-甘肃旅游精品线路", "url": "http://www.gscn.com.cn/tourism/lyxl/", },
            {
    
    "channel_name": "文化旅游-国内游", "url": "http://www.gscn.com.cn/tourism/gny/index.shtml", },
            {
    
    "channel_name": "文化旅游-行摄之旅", "url": "http://www.gscn.com.cn/tourism/xs/index.html", },
            {
    
    "channel_name": "文化旅游-境外游", "url": "http://www.gscn.com.cn/tourism/jwy/index.shtml", },
            {
    
    "channel_name": "文化旅游-旅游活动", "url": "http://www.gscn.com.cn/tourism/lyhd/index.shtml", },
            {
    
    "channel_name": "文化旅游-旅游贴士", "url": "http://www.gscn.com.cn/tourism/ts/index.html", },
            {
    
    "channel_name": "文化旅游-省内游", "url": "http://www.gscn.com.cn/tourism/sny/index.shtml", },
        ],
        # 新闻发布会
        [
            {
    
    "channel_name": "新闻发布会-历年新闻发布会-2019年新闻发布会", "url": "http://fbh.gscn.com.cn/lnfbh/2019nxwfbh/", },
            {
    
    "channel_name": "新闻发布会-历年新闻发布会-2019年新闻发布会-党委发布会",
             "url": "http://fbh.gscn.com.cn/lnfbh/2019nxwfbh/dwfbh/", },
            {
    
    "channel_name": "新闻发布会-历年新闻发布会-2019年新闻发布会-企事业新闻",
             "url": "http://fbh.gscn.com.cn/lnfbh/2019nxwfbh/qsyxwfbh/", },
            {
    
    "channel_name": "新闻发布会-历年新闻发布会-2019年新闻发布会-政府新闻发",
             "url": "http://fbh.gscn.com.cn/lnfbh/2019nxwfbh/zfxwfbh/index.shtml", },
        ],
        # 新闻中心
        [
            {
    
    "channel_name": "新闻中心-国际要闻", "url": "http://news.gscn.com.cn/gjyw/", },
            {
    
    "channel_name": "新闻中心-国际要闻", "url": "http://energy.gscn.com.cn/nyxw/", },
            {
    
    "channel_name": "新闻中心-国内要闻", "url": "http://news.gscn.com.cn/glyw/", },
            {
    
    "channel_name": "新闻中心-军事要闻", "url": "http://news.gscn.com.cn/js/", },
            {
    
    "channel_name": "新闻中心-名刊精选", "url": "http://news.gscn.com.cn/mkjx/", },
            {
    
    "channel_name": "新闻中心-社会综合", "url": "http://news.gscn.com.cn/sh/", },
            {
    
    "channel_name": "新闻中心-体育新闻", "url": "http://news.gscn.com.cn/ty/", },
            {
    
    "channel_name": "新闻中心-新闻人物", "url": "http://news.gscn.com.cn/xwrw/index.html", },
            {
    
    "channel_name": "新闻中心-新闻语录", "url": "http://news.gscn.com.cn/xwyl/index.html", },
            {
    
    "channel_name": "新闻中心-最新任免信息", "url": "http://news.gscn.com.cn/zxrm/", },
        ],
        # 娱乐频道
        [
            {
    
    "channel_name": "娱乐频道-明星", "url": "http://www.gscn.com.cn/ent/mx/index.shtml", },
            {
    
    "channel_name": "娱乐频道-时尚", "url": "http://www.gscn.com.cn/ent/ss/index.shtml", },
            {
    
    "channel_name": "娱乐频道-音乐", "url": "http://www.gscn.com.cn/ent/yy/index.shtml", },
            {
    
    "channel_name": "娱乐频道-影视", "url": "http://www.gscn.com.cn/ent/ys/index.shtml", },
        ],
        # 舆情频道
        [
            {
    
    "channel_name": "舆情频道-甘肃舆情", "url": "http://yqpd.gscn.com.cn/gsyq/", },
            {
    
    "channel_name": "舆情频道-焦点舆评", "url": "http://yqpd.gscn.com.cn/jdyp/", },
            {
    
    "channel_name": "舆情频道-企业舆情", "url": "http://yqpd.gscn.com.cn/qyyq/", },
            {
    
    "channel_name": "舆情频道-市州舆情", "url": "http://yqpd.gscn.com.cn/szyq/", },
            {
    
    "channel_name": "舆情频道-舆情观察", "url": "http://yqpd.gscn.com.cn/yqgc/", },
            {
    
    "channel_name": "舆情频道-舆情聚焦", "url": "http://yqpd.gscn.com.cn/yqjj/", },
            {
    
    "channel_name": "舆情频道-政务舆情", "url": "http://yqpd.gscn.com.cn/zwyq/", },
        ],
        # 专题
        [
            {
    
    "channel_name": "专题-2013专题", "url": "http://special.gscn.com.cn/2013zt/", },
            {
    
    "channel_name": "专题-2014专题", "url": "http://special.gscn.com.cn/2014zt/", },
            {
    
    "channel_name": "专题-2015专题", "url": "http://special.gscn.com.cn/2015zt/", },
            {
    
    "channel_name": "专题-2016专题", "url": "http://special.gscn.com.cn/2016zt/", },
            {
    
    "channel_name": "专题-2017专题", "url": "http://special.gscn.com.cn/2017zt/", },
            {
    
    "channel_name": "专题-2018专题", "url": "http://special.gscn.com.cn/2018zt/", },
            {
    
    "channel_name": "专题-2019专题", "url": "http://special.gscn.com.cn/2019zt/", },
            {
    
    "channel_name": "专题-2020年专题", "url": "http://special.gscn.com.cn/2020nzt/", },
            {
    
    "channel_name": "专题-2020专题", "url": "http://special.gscn.com.cn/2020zt/", },
        ],
    ]
  • 样式整理

整体网站数据列表有多少种样式就要做多少个parseX,并添加到

        parse_list = [
           self.parse1,  # 财经频道
            self.parse1,  # 大学生
            self.parse1,  # 法治甘肃
            self.parse1,  # 飞天评论
            self.parse1,  # 甘肃地理
            self.parse1,  # 甘肃宽频
            self.parse1,  # 甘肃美食
            self.parse1,  # 甘肃能源
            self.parse1,  # 甘肃人物
            self.parse1,  # 甘肃省情
            self.parse1,  # 甘肃书画
            self.parse1,  # 甘肃特产
            self.parse1,  # 甘肃文化
            self.parse1,  # 甘肃新闻
            self.parse1,  # 甘肃政务
            self.parse1,  # 公益频道
            self.parse1,  # 科教频道
            self.parse1,  # 兰州都市
            self.parse1,  # 理论频道
            self.parse1,  # 廉政频道
            self.parse1,  # 陇上有名
            self.parse1,  # 陇原养生
            self.parse1,  # 汽车频道
            self.parse1,  # 三农频道
            self.parse1,  # 书香陇原
            self.parse1,  # 图解新闻
            self.parse1,  # 图片频道
            self.parse1,  # 脱贫攻坚频道
            self.parse1,  # 文化旅游
            self.parse1,  # 新闻发布会
            self.parse1,  # 新闻中心
            self.parse1,  # 娱乐频道
            self.parse1,  # 舆情频道
            self.parse1,  # 专题
        ]
  • 标题&链接&封面
    由于整体网站内容列表没有图片因此不使用Item_thumbImg
# 样式1
        Item_title = response.xpath('//div[@id="content"]/ul/li/a/text()').extract()  # 文章标题列表
        Item_url = response.xpath('//div[@id="content"]/ul/li/a/@href').extract()  # 文章链接列表

# 样式2通用
        data = extract_list(response.text)
        for each in range(len(data)):
			item['title'] = data[each]["title"].strip()  # 内容标题
			item['url'] = parse.urljoin(response.url, data[each]["url"])  # 拼接正文url


Spider下的parse_detail.py文件

1. 抓取详情页内容

修改列表数据详情页的CSS抓取样式,总结了2种样式。
在这里插入图片描述

    # 处理详情页带格式,这里整个页面进行抓取
    	item['content'] = ""
	    if 'class="artical"' in response.text and len(None2Str(item['content'])) < 5:
	        item['content'] = response.xpath('//div[@class="artical"]').extract_first()

2. 特别说明

有些网站的程序员丧心病狂到一定程度10个页面9种样式这种,由于我们不可能每个页面都打开看一下详情页的CSS格式,因此有个通用的解决办法。

  • 第一次抓取完内容之后打开MongoDB数据库执行下面的命令会把包含body的页面数据筛选出来,这些是没有根据指定样式抓取的数据,而是直接抓的全部页面的数据。
db.你的表名.find({content:/body/})

在这里插入图片描述

  • 打开任意的link循环处理详情页的内容直到mongo命令没有筛选出来内容为止即可。

猜你喜欢

转载自blog.csdn.net/qq_20288327/article/details/114128546