Scrapy框架糗事百科自动爬虫

糗事百科自动爬虫:

1、在cmd进入项目文件夹,创建爬虫项目和爬虫文件

>scrapy startproject qsauto

2、进入项目文件夹创建爬虫文件

>scrapy genspider -t crawl cw qiushi.com(我创建的爬虫文件为cw, qiushi.com为糗事百科网域名)

3、用pycharm打开这个项目                          

编辑items.py文件

# -*- coding: utf-8 -*-
# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html
import scrapy
class QsautoItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    content = scrapy.Field()  # 创建容器
    link = scrapy.Field()

 编辑爬虫文件cw.py

# -*- coding: utf-8 -*-
import scrapy
from qsauto.items import QsautoItem
from scrapy.http import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider,Rule
class CwSpider(CrawlSpider):
    name = 'cw'
    allowed_domains = ['qiushibaike.com'] #爬取网站域名
    #start_urls = ['http://qiushibaike.com/']
    def start_requests(self):             #设置代理服务器,设为自己浏览器的User-Agent
        ua = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'}
        yield Request('http://www.qiushibaike.com/', headers=ua)
    rules = (
        Rule(LinkExtractor(allow='article'), callback='parse_item', follow=True),
    )#将allow设置为爬取内容,这次爬取内容的网址标签为article
    def parse_item(self, response):
        i = QsautoItem()
        #i['domain_id'] = response.xpath('//input[@id="sid"]/@value').extract()
        #i['name'] = response.xpath('//div[@id="name"]').extract()
        #i['description'] = response.xpath('//div[@id="description"]').extract()
        i['content'] = response.xpath('//div[@class="content"]/text()').extract()
        i['link'] = response.xpath('//link[@rel="canonical"]/@href').extract()
        print(i['content'])
        print(i['link'])
        return i

因为要自动爬取,需要每次用代理服务器

将settings.py中的USER_AGENT改为浏览器的user-agent

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0'

猜你喜欢

转载自blog.csdn.net/xx20cw/article/details/84313771
今日推荐