scrapy 的爬虫牛刀小试

  • 我感觉scrapy 的官方文档写的挺好,如果你想快速学习scrapy 框架那就去看官方文档。
    scrapy官方文档
  • 这框架的数据流是这样的 spider 发起请求-调度器-下载中间键-下载器-(response)-下载中间键-爬虫中间件-spider-管道(保存输)。各组件通过引擎链接。
  • 1创建项目 scrapy startproject xxxx
  • 2 创建爬虫 cd xxx, scrapy genspider spider_name domain .
  • 代码创建
import os

from scrapy import cmdline
def makeproject():
    """创建爬虫项目"""
    name = input('请输入爬虫项目名称:')
    cmdline.execute(['scrapy', 'startproject', '{}'.format(name)])
    os.chdir('./{}'.format(name))
    spider_name = input('请输入爬虫名称:')
    domain_name= input('请输入爬去的domain:')
    cmdline.execute(['scrapy','genspider',f'{spider_name}',f'{domain_name}'])


def run_spider():
    """启动爬虫"""
    name= input("请输入爬虫名你要启动的爬虫名:")
    os.chdir('./qiushi')
    cmdline.execute(['scrapy','crawl',f'{name}'])
if __name__ == '__main__':
    run_spider()
  • spider 部分 。Selector 支持 xpath css 正则,我用的CSS ,css 好久没有用了
class QsSpider(scrapy.Spider):
    name = 'qs'
    allowed_domains = ['lovehhy.net']
    start_urls = ['http://www.lovehhy.net/Joke/Detail/QSBK/']

    def parse(self, response):
        h3_list = response.css('#footzoon .red')
        h_list = list()
        for h3 in h3_list:
            title = h3.css("a::text").get() #获取a标签的文本,get=extract_first。
            detail_url = "http://www.lovehhy.net"+h3.css('a::attr(href)').extract_first() 
            h_list.append((title,detail_url))

        div_list = response.xpath("//div[@id='endtext']")
        #div_list = response.css('#endtext')
        con_list = list()
        for con in div_list:
            content = con.css('div::text').getall()[0].strip() # getall =extract
            con_list.append(content)
        for i in zip(h_list,con_list):
            item=QiushiItem(title=i[0][0],url=i[0][1],content=i[1])
            # item = dict()
            # item['title'] = i[0][0]
            # item['url'] = i[0][1]
            # item['content'] = i[1]
            yield item
  • pipelines (部分)
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class JsonWriterPipeline(object):
    def open_spider(self,spider):
        """This method is called when the spider is opened"""
        self.file = open('joek.jl','w',encoding='utf-8')

    def close_spider(self,spider):
        """This method is called when the spider is closed."""
        self.file.close()
    def process_item(self,item,spider):
        """This method is called for every item pipeline component"""
        line =json.dumps(dict(item),ensure_ascii=False) +"\n"
        self.file.write(line)

注意: 一定要在settings 文件中开起自己的管道。

管道是有权重的,数字越小,越先执行。

发布了127 篇原创文章 · 获赞 25 · 访问量 3万+

猜你喜欢

转载自blog.csdn.net/weixin_44224529/article/details/103871749