python Scrapy创建和使用,运行Scrapy代码,怎么用Scrapy框架获取数据,Scrapy的xpath使用教程。

安装Scrapy,打开cmd 输入 pip install scrapy即可

创建项目
scrapy startproject tencent

创建爬虫
scrapy genspider hr tencent.com

items.py讲解

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html

import scrapy


class PachongItem(scrapy.Item):
    # define the fields for your item here like:
    #在此处定义项目的字段,如
    # name = scrapy.Field()
    pass
	

Field()继承的是一个字典,字典自然是{key:value}

middlewares.py

这个是一个下载中间件和爬虫中间件,里面封装了2个类分别是

class PachongSpiderMiddleware:#爬虫中间件
	pass

class PachongDownloaderMiddleware:#下载中间件
	pass


pipelines.py

# Define your item pipelines here
#在此处定义项目管道
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
#别忘了将管道添加到项目管道设置中
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html


# useful for handling different item types with a single interface
from itemadapter import ItemAdapter


class PachongPipeline:
    def process_item(self, item, spider):
        return item
	

如果要用爬虫,千万别忘记打开这个管道,

settings.py

再来看看最后一个文件,

这个文件简单来介绍一下,

#scrapy项目名
BOT_NAME = 'pachong'#这个是我创建的爬虫名称

SPIDER_MODULES = ['pachong.spiders']#我的额爬虫再pachong目录下。
NEWSPIDER_MODULE = 'pachong.spiders'	

#robots是否遵守
ROBOTSTXT_OBEY = True#这里是robots协议,可以注释掉。

#这里是最大的并发量,他默认的是16,如果需要调整就把下面注释打开把调整32的数字
#Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 32

#这个代码是下载延迟,默认是注释的没有下载延迟,如果太快了可以打开限制速度
#DOWNLOAD_DELAY = 3


#下面是一个比较重要的

# Override the default request headers:
#覆盖默认请求头:

#这里是一个请求报头,我们需要添加我们网址的请求头
DEFAULT_REQUEST_HEADERS = {
    
    
	'Accept': 	       'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
   'Accept-Language': 'en',
}


# Enable or disable spider middlewares
#启用或禁用spider中间件
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
SPIDER_MIDDLEWARES = {
    
    
   'pachong.middlewares.PachongSpiderMiddleware': 543,
}

# Enable or disable downloader middlewares
#启用或禁用下载程序中间件
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
    
    
 'pachong.middlewares.PachongDownloaderMiddleware': 543,
}



# Configure item pipelines
#配置项目管道
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
    
    
#    'pachong.pipelines.PachongPipeline': 300,
#}

下面我们先运行一下程序,再命令端输入

scrapy crawl hr	

我们会得到如下数据

2021-03-24 09:51:05 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: pachong)

09:51:05------------启用时间
Scrapy 2.4.1--------程序版本
 (bot: pachong)-----项目名

2021-03-24 09:51:05 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Windows-10-10.0.18362-SP0

 lxml 4.4.1.0------程序版本
 Twisted 20.3.0----程序版本
 Python 3.7.4------程序版本
 Windows-10-10.0.18362-SP0----------系统版本
 
2021-03-24 09:51:05 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-24 09:51:05 [scrapy.crawler] INFO: Overridden settings:
{
    
    'BOT_NAME': 'pachong',
 'NEWSPIDER_MODULE': 'pachong.spiders',
 'ROBOTSTXT_OBEY': True,
 'SPIDER_MODULES': ['pachong.spiders']}
 
 'ROBOTSTXT_OBEY': True,-----这里我们看到这个协议是true
 
 
2021-03-24 09:51:05 [scrapy.extensions.telnet] INFO: Telnet Password: 28a8213a0b08cff4
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'pachong.middlewares.PachongDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'pachong.middlewares.PachongSpiderMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-24 09:51:05 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-03-24 09:51:05 [scrapy.core.engine] INFO: Spider opened

Spider opened---这里有一个opened,就是我们的爬虫,下面开始加载我们的爬虫


2021-03-24 09:51:05 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-24 09:51:05 [hr] INFO: Spider opened: hr
2021-03-24 09:51:05 [hr] INFO: Spider opened: hr
2021-03-24 09:51:05 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-24 09:51:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://tencent.com/robots.txt> from <GET http://tencent.com/robots.txt>

Redirecting (302)----请求失败

2021-03-24 09:51:06 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://tencent.com/robots.txt> (referer: None)

Crawled (404)----请求失败

2021-03-24 09:51:06 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://tencent.com/> from <GET http://tencent.com/>

DEBUG: Redirecting (302)----请求失败

2021-03-24 09:51:06 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://tencent.com/> (referer: None)

DEBUG: Crawled (200)----请求成功

2021-03-24 09:51:07 [scrapy.core.engine] INFO: Closing spider (finished)


我们发现请求他的robots协议失败,但是请求他的主网站成功,我们需要修改一下爬虫设置。


Closing spider (finished)------到这里爬虫就结束了。跳过下面这段,先看下文,


2021-03-24 09:51:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
    
    'downloader/request_bytes': 864,
 'downloader/request_count': 4,
 'downloader/request_method_count/GET': 4,
 'downloader/response_bytes': 1185,
 'downloader/response_count': 4,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/302': 2,
 'downloader/response_status_count/404': 1,
 'elapsed_time_seconds': 1.168874,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 3, 24, 1, 51, 7, 78523),
 'log_count/DEBUG': 4,
 'log_count/INFO': 12,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/404': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2021, 3, 24, 1, 51, 5, 909649)}
2021-03-24 09:51:07 [scrapy.core.engine] INFO: Spider closed (finished)

我们修改一下settings.py文件

ROBOTSTXT_OBEY = False    #把他改成Flase

再添加一个请求头

DEFAULT_REQUEST_HEADERS = {
    
    
    'Connection': 'keep-alive',
    'sec-ch-ua': '"Google Chrome";v="89", "Chromium";v="89", ";Not A Brand";v="99"',
    'sec-ch-ua-mobile': '?0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.90 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'same-origin',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Referer': 'https://www.douban.com/',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}
2021-03-24 10:28:12 [scrapy.utils.log] INFO: Scrapy 2.4.1 started (bot: pachong)
2021-03-24 10:28:12 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7.4 (tags/v3.7.4:e09359112e, Jul  8 2019, 20:34:20) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h  22 Sep 2020), cryptography 3.2.1, Platform Windows-10-10.0.18362-SP0
2021-03-24 10:28:12 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2021-03-24 10:28:12 [scrapy.crawler] INFO: Overridden settings:
{
    
    'BOT_NAME': 'pachong',
 'NEWSPIDER_MODULE': 'pachong.spiders',
 'SPIDER_MODULES': ['pachong.spiders']}
2021-03-24 10:28:12 [scrapy.extensions.telnet] INFO: Telnet Password: fc28b3411aa3a1b4
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',   
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',     
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',   
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-03-24 10:28:12 [scrapy.middleware] INFO: Enabled item pipelines:
[]


2021-03-24 10:28:12 [scrapy.core.engine] INFO: Spider opened
2021-03-24 10:28:12 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-03-24 10:28:12 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2021-03-24 10:28:13 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.douban.com/> (referer: https://www.douban.com/)

DEBUG: Crawled (200)---我们把robots协议关闭后没有了,加入请求头现在就可以访问了。

****************************************************************************************************
<200 https://www.douban.com/>
****************************************************************************************************
2021-03-24 10:28:13 [scrapy.core.engine] INFO: Closing spider (finished)




2021-03-24 10:28:13 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
    
    'downloader/request_bytes': 666,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'downloader/response_bytes': 18450,
 'downloader/response_count': 1,
 'downloader/response_status_count/200': 1,
 'elapsed_time_seconds': 0.872716,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2021, 3, 24, 2, 28, 13, 404170),
 'log_count/DEBUG': 1,
 'log_count/INFO': 10,
 'response_received_count': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'start_time': datetime.datetime(2021, 3, 24, 2, 28, 12, 531454)}
2021-03-24 10:28:13 [scrapy.core.engine] INFO: Spider closed (finished)

竟然请求成功了,为了我们观看结果方便,我们需要把多余的日志关闭,

再settings.py文件中加入一句话

LOG_LEVEL='WARNING'

在这里插入图片描述

然后我们再运行一下代码得到的就是没有log的信息了,直接是数据,如下:

****************************************************************************************************
<200 https://www.douban.com/>
<class 'scrapy.http.response.html.HtmlResponse'>
{
    
    'name': '了不起的文明现场——一线考古队长带你探秘历史'}
<class 'str'>
{
    
    'name': '人人听得懂用得上的法律课'}
<class 'str'>
{
    
    'name': '如何读透一本书——12堂阅读写作训练课'}
<class 'str'>
{
    
    'name': '52倍人生——戴锦华大师电影课'}
<class 'str'>
{
    
    'name': '我们的女性400年——文学里的女性主义简史'}
<class 'str'>
{
    
    'name': '用性别之尺丈量世界——18堂思想课解读女性问题'}
<class 'str'>
{
    
    'name': '哲学闪耀时——不一样的西方哲学史'}
<class 'str'>
{
    
    'name': '读梦——村上春树长篇小说指南'}
<class 'str'>
{
    
    'name': '拍张好照片——跟七七学生活摄影'}
<class 'str'>
{
    
    'name': '白先勇细说红楼梦——从小说角度重解“红楼”'}
<class 'str'>
****************************************************************************************************

接下来就应该开始说怎么爬取数据了,是不是已经发现我上面已经将部分数据爬下来了。

是的,用的解析器就是Scrapy自带的requests.xpath属性进行定位取值,使用功能和lxml一样

例如:

html=response.xpath('//ul[@class="time-list"]/li/a[2]/text()')

但是我们取回来的值是这样的

[<Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='了不起的文明现场——一线考古队长带你探秘历史'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text(s="time-list"]/li/a[2]/text()' data='52倍人生——戴锦华大师电影课'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='我们的女性400年——文学里的女性主义简
史'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='用性别之尺丈量世界——18堂思想课解读女性问题'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='哲学闪耀时——不一样的西方哲学史'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='读梦——村上春树长篇小说指南'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='拍张好照片——跟七七学生活摄影'>, <Selector xpath='//ul[@class="time-list"]/li/a[2]/text()' data='白先勇细说红楼梦——从小说角度重
解“红楼”'>]

我们用lxml取出来的值是列表,但是再这里也是列表,但是我们发现没有他不是单单的列表,需要,分为了三块。

<Selector xpath='XXX' data='xxx'>

第一个是属性,第二个是我们取值的路径,第三个才是我们要的值。

其实再我们Scrapy中给我们封装了一些方法,如xpath、css 在selector对象中,

但是我们要获取selector对象中的date有这么几个方法

#旧方法
extract_first()----》返回值
extract----》返回列表
#新方法
get()----》返回值
getall()----》返回列表

新方法旧方法都可以使用,旧方法找不到会返回一个None,新方法找不到会返回一个空。

我们用新发get()试试

html=response.xpath('//ul[@class="time-list"]/li/a[2]/text()').get()

返回
了不起的文明现场——一线考古队长带你探秘历史

说明对了。好了竟然返回了,我们来讲讲这么运行

第一个方法,在scrapy.cfg同级目录创建一个PY文件

代码如下:

from scrapy import cmdline

cmdline.execute('scrapy crawl hr'.split())

这样就可以运行我们的框架爬虫了

猜你喜欢

转载自blog.csdn.net/weixin_50123771/article/details/115180880
今日推荐