爬虫 - scrapy框架遇到的问题与解答

1. 代码

import re

import scrapy

from Fang.items import esf_FangItem


class ExampleSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['www.fang.com']
    start_urls = ['https://www.fang.com/SoufunFamily.htm']

    def parse(self, response):
        trs = response.xpath('//div[@id="c02"]//tr')
        province = None
        for tr in trs:
            province_f = tr.xpath('./td[2]//text()').get()
            province_f = re.sub(r"\s", "", province_f)
            if province_f:
                province = province_f
            cities = tr.xpath('./td[3]/a')
            for i in cities:
                city = i.xpath('./text()').get()
                city_url = i.xpath('./@href').get()
                # print(city, city_url)
                yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)})
            #     break
            # break

    def parse_url(self, response):
        print(2)

2. 问题描述运行项目时,parse_url不执行,即不能打印2

  日志打印如下:

2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'bj.fang.com': <GET http://bj.fang.com/>
2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'sh.fang.com': <GET http://sh.fang.com/>
2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'tj.fang.com': <GET http://tj.fang.com/>
2020-03-20 17:08:02 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'cq.fang.com': <GET http://cq.fang.com/>

3. 解答

  百度得知是二次解析的域名被过滤掉了

  解决方法:

   

方法一:
  去掉域名: allowed_domains = ['www.fang.com']
  或将其改为:  allowed_domains = ['fang.com']

方法二:
  加上:
dont_filter=True (不推荐此方法)
  yield scrapy.Request(url=city_url, callback=self.parse_url, meta={'info': (province, city)}, dont_filter=True)
 
 

猜你喜欢

转载自www.cnblogs.com/JackShi/p/12532987.html
今日推荐