- 我感觉scrapy 的官方文档写的挺好,如果你想快速学习scrapy 框架那就去看官方文档。
scrapy官方文档。 - 这框架的数据流是这样的 spider 发起请求-调度器-下载中间键-下载器-(response)-下载中间键-爬虫中间件-spider-管道(保存输)。各组件通过引擎链接。
- 1创建项目 scrapy startproject xxxx
- 2 创建爬虫 cd xxx, scrapy genspider spider_name domain .
- 代码创建
import os
from scrapy import cmdline
def makeproject():
"""创建爬虫项目"""
name = input('请输入爬虫项目名称:')
cmdline.execute(['scrapy', 'startproject', '{}'.format(name)])
os.chdir('./{}'.format(name))
spider_name = input('请输入爬虫名称:')
domain_name= input('请输入爬去的domain:')
cmdline.execute(['scrapy','genspider',f'{spider_name}',f'{domain_name}'])
def run_spider():
"""启动爬虫"""
name= input("请输入爬虫名你要启动的爬虫名:")
os.chdir('./qiushi')
cmdline.execute(['scrapy','crawl',f'{name}'])
if __name__ == '__main__':
run_spider()
- spider 部分 。Selector 支持 xpath css 正则,我用的CSS ,css 好久没有用了
class QsSpider(scrapy.Spider):
name = 'qs'
allowed_domains = ['lovehhy.net']
start_urls = ['http://www.lovehhy.net/Joke/Detail/QSBK/']
def parse(self, response):
h3_list = response.css('#footzoon .red')
h_list = list()
for h3 in h3_list:
title = h3.css("a::text").get() #获取a标签的文本,get=extract_first。
detail_url = "http://www.lovehhy.net"+h3.css('a::attr(href)').extract_first()
h_list.append((title,detail_url))
div_list = response.xpath("//div[@id='endtext']")
#div_list = response.css('#endtext')
con_list = list()
for con in div_list:
content = con.css('div::text').getall()[0].strip() # getall =extract
con_list.append(content)
for i in zip(h_list,con_list):
item=QiushiItem(title=i[0][0],url=i[0][1],content=i[1])
# item = dict()
# item['title'] = i[0][0]
# item['url'] = i[0][1]
# item['content'] = i[1]
yield item
- pipelines (部分)
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
class JsonWriterPipeline(object):
def open_spider(self,spider):
"""This method is called when the spider is opened"""
self.file = open('joek.jl','w',encoding='utf-8')
def close_spider(self,spider):
"""This method is called when the spider is closed."""
self.file.close()
def process_item(self,item,spider):
"""This method is called for every item pipeline component"""
line =json.dumps(dict(item),ensure_ascii=False) +"\n"
self.file.write(line)
注意: 一定要在settings 文件中开起自己的管道。
管道是有权重的,数字越小,越先执行。