Scrapy爬虫之中文乱码问题

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/qq_40795214/article/details/82154464

问题描述:

I.

#这是.csv格式的文件,有中文乱码现象。


[root@Uu jianshu]# cat jianshu.csv 
url,title,author
http://www.jianshu.com/p/2a7a594816e1,彖浣犳                   村?鏍?
[root@Uu jianshu]#                            璋㈣传绌凤兼娉绗锛?

II.

#这是.json格式的文件,也有中文显示问题


[root@Uu jianshu]# cat jianshu.json 
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86"], "author": ["\u65e0\u6212"]}
]

问题解决过程:

I.  首先猜想用UTF-8,问题解决过程如下:

#UTF-8解决问题如下:


[root@Uu jianshu]# vi settings.py
# -*- coding: utf-8 -*-

# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

USER_AGENT = USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#FEED_URL = u'/home/BS/jianshu.json'
#FEED_FORMAT = 'json'
FEED_EXPORT_ENCODING = 'UTF-8'
#FEED_EXPORT_ENCODING = 'GBK'
#FEED_EXPORT_ENCODING = 'GB2312'
"settings.py" 98L, 3371C written
[root@Uu jianshu]# cd ..
[root@Uu jianshu]# ll
total 8
drwxr-xr-x. 3 root root 174 Aug 28 22:35 jianshu
-rw-r--r--. 1 root root 117 Aug 28 22:34 jianshu.json
-rw-r--r--. 1 root root 257 Aug 28 14:44 scrapy.cfg
[root@Uu jianshu]# rm -f jianshu.json 
[root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json
2018-08-28 22:35:51 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu)
2018-08-28 22:35:51 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-08-28 22:35:51 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'UTF-8', 'DOWNLOAD_DELAY': 5}
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-28 22:35:51 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-28 22:35:51 [scrapy.core.engine] INFO: Spider opened
2018-08-28 22:35:51 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-28 22:35:51 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-28 22:35:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly>
2018-08-28 22:35:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None)
2018-08-28 22:35:57 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly>
{'author': [u'\u65e0\u6212'],
 'title': [u'\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86'],
 'url': u'http://www.jianshu.com/p/2a7a594816e1'}
2018-08-28 22:35:57 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-28 22:35:57 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json
2018-08-28 22:35:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 10881,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 8, 28, 14, 35, 57, 854597),
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 8,
 'memusage/max': 42971136,
 'memusage/startup': 42971136,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 8, 28, 14, 35, 51, 387501)}
2018-08-28 22:35:57 [scrapy.core.engine] INFO: Spider closed (finished)
[root@Uu jianshu]# cat jianshu.json 
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["彖浣犳                   村?], "author": ["鏍?]}                                          璋㈣传绌凤兼娉绗锛?
[root@Uu jianshu]# 

 可见,在setting.py中设置参数FEED_EXPORT_ENCODING = 'UTF-8',并不能解决问题。

 II.试着用GBK解决问题(即设置参数FEED_EXPORT_ENCODING = 'GBK'),过程如下:

#GBK解决问题,过程如下:



[root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json
2018-08-28 22:32:40 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu)
2018-08-28 22:32:40 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-08-28 22:32:40 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'GBK', 'DOWNLOAD_DELAY': 5}
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-28 22:32:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-28 22:32:40 [scrapy.core.engine] INFO: Spider opened
2018-08-28 22:32:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-28 22:32:40 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-28 22:32:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly>
2018-08-28 22:32:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None)
2018-08-28 22:32:46 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly>
{'author': [u'\u65e0\u6212'],
 'title': [u'\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86'],
 'url': u'http://www.jianshu.com/p/2a7a594816e1'}
2018-08-28 22:32:46 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-28 22:32:46 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json
2018-08-28 22:32:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 10879,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 8, 28, 14, 32, 46, 587323),
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 8,
 'memusage/max': 42975232,
 'memusage/startup': 42975232,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 8, 28, 14, 32, 40, 291948)}
2018-08-28 22:32:46 [scrapy.core.engine] INFO: Spider closed (finished)
[root@Uu jianshu]# cat jianshu.json 
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["听说你感谢贫穷,我想笑,却哭了"], "author": ["无戒"]}

 显而易见,问题得到解决,可以成功显示中文。

III.下面再试试GB2312(即设置参数FEED_EXPORT_ENCODING = 'GB2312'),过程如下:

#GB2312解决,过程如下:


[root@Uu jianshu]# vi settings.py
# -*- coding: utf-8 -*-

# Scrapy settings for jianshu project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#     https://doc.scrapy.org/en/latest/topics/settings.html
#     https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#     https://doc.scrapy.org/en/latest/topics/spider-middleware.html

BOT_NAME = 'jianshu'

SPIDER_MODULES = ['jianshu.spiders']
NEWSPIDER_MODULE = 'jianshu.spiders'


# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'jianshu (+http://www.yourdomain.com)'

USER_AGENT = USER_AGENT = "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)"

# Obey robots.txt rules
ROBOTSTXT_OBEY = False

# Configure maximum concurrent requests performed by Scrapy (default: 16)
CONCURRENT_REQUESTS = 1

# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 5
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16

# Disable cookies (enabled by default)
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
#    'scrapy.extensions.telnet.TelnetConsole': None,
#}

# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
#    'jianshu.pipelines.JianshuPipeline': 300,
#}

# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

#FEED_URL = u'/home/BS/jianshu.json'
#FEED_FORMAT = 'json'
#FEED_EXPORT_ENCODING = 'UTF-8'
#FEED_EXPORT_ENCODING = 'GBK'
FEED_EXPORT_ENCODING = 'GB2312'
"settings.py" 98L, 3371C written
[root@Uu jianshu]# cd ..
[root@Uu jianshu]# rm -f jianshu.json 
[root@Uu jianshu]# ll
total 4
drwxr-xr-x. 3 root root 174 Aug 28 22:44 jianshu
-rw-r--r--. 1 root root 257 Aug 28 14:44 scrapy.cfg
[root@Uu jianshu]# scrapy crawl jianshu -o jianshu.json
2018-08-28 22:45:25 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: jianshu)
2018-08-28 22:45:25 [scrapy.utils.log] INFO: Versions: lxml 3.2.1.0, libxml2 2.9.1, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.5 (default, Jul 13 2018, 13:06:57) - [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Linux-3.10.0-693.el7.x86_64-x86_64-with-centos-7.4.1708-Core
2018-08-28 22:45:25 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'jianshu.spiders', 'FEED_URI': 'jianshu.json', 'CONCURRENT_REQUESTS': 1, 'SPIDER_MODULES': ['jianshu.spiders'], 'BOT_NAME': 'jianshu', 'COOKIES_ENABLED': False, 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; AcooBrowser; .NET CLR 1.1.4322; .NET CLR 2.0.50727)', 'FEED_FORMAT': 'json', 'FEED_EXPORT_ENCODING': 'GB2312', 'DOWNLOAD_DELAY': 5}
2018-08-28 22:45:25 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.memusage.MemoryUsage',
 'scrapy.extensions.logstats.LogStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.corestats.CoreStats']
2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-08-28 22:45:26 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-08-28 22:45:26 [scrapy.core.engine] INFO: Spider opened
2018-08-28 22:45:26 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-08-28 22:45:26 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-08-28 22:45:26 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.jianshu.com/trending/monthly> from <GET http://www.jianshu.com/trending/monthly>
2018-08-28 22:45:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.jianshu.com/trending/monthly> (referer: None)
2018-08-28 22:45:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.jianshu.com/trending/monthly>
{'author': [u'\u65e0\u6212'],
 'title': [u'\u542c\u8bf4\u4f60\u611f\u8c22\u8d2b\u7a77\uff0c\u6211\u60f3\u7b11\uff0c\u5374\u54ed\u4e86'],
 'url': u'http://www.jianshu.com/p/2a7a594816e1'}
2018-08-28 22:45:32 [scrapy.core.engine] INFO: Closing spider (finished)
2018-08-28 22:45:32 [scrapy.extensions.feedexport] INFO: Stored json feed (1 items) in: jianshu.json
2018-08-28 22:45:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 606,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 10873,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 1,
 'downloader/response_status_count/301': 1,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 8, 28, 14, 45, 32, 543578),
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 8,
 'memusage/max': 42971136,
 'memusage/startup': 42971136,
 'response_received_count': 1,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 8, 28, 14, 45, 26, 27174)}
2018-08-28 22:45:32 [scrapy.core.engine] INFO: Spider closed (finished)
[root@Uu jianshu]# cat jianshu.json 
[
{"url": "http://www.jianshu.com/p/2a7a594816e1", "title": ["听说你感谢贫穷,我想笑,却哭了"], "author": ["无戒"]}
][root@Uu jianshu]# 

 可以发现,问题同样得到解决,可见GB2312也可以成功解决中文乱码问题。

华丽的总结:

通过实验过程,可以发现:scrapy爬虫中的中文乱码问题只需要在setting.py设置参数FEED_EXPORT_ENCODING,

并且,只有将FEED_EXPORT_ENCODING设置为GBK或者GB2312才可以,设置为UTF-8不能解决问题。

猜你喜欢

转载自blog.csdn.net/qq_40795214/article/details/82154464