scrapy shell
scrapy shell is an interactive shell, where you can quickly scrape debug code without having to run spider. It was originally used to test data extraction code, but you can actually use it to test any type of code, because it is also a regular Python shell.
shell for testing xpath or css expression, and see how they work, and the data extracted from them trying to crawl your pages. It allows you to interactively test when writing spider expression, without having to run spider to test each change.
Once you are familiar Scrapy Shell, you will find it an invaluable tool for developing and debugging of spider.
Configuration shell
If you have IPython installed later, scrapy shell will use it (instead of the standard python console). This IPython console more powerful, intelligent auto-completion and color output capabilities.
By Scrapy settings, you can configure it to use ipython
, bpython
or standard python
shell, no matter what has been installed. This is done by setting SCRAPY_PYTHON_SHELL
the environment variables; or by scrapy.cfg ;
[settings]
shell = bpython
Start shell
Use shell
the following command:
scrapy shell <url> # <url> to crawl url
shell
Also applies to local files. If you want to play a local copy of a web page, it is very convenient. shell
Understand the local file the following syntax:
# UNIX-style scrapy shell ./path/to/file.html scrapy shell ../other/path/to/file.html scrapy shell /absolute/path/to/file.html # File URI scrapy shell file:///absolute/path/to/file.html
When using a relative file path, use ./
(or) ../
indicates that the file stored in the path
scrapy shell ./index.html
Use shell
scrappyshell just an ordinary python console (or IPython console, if you have it), it provides some additional shortcuts to facilitate.
Available shortcuts
shelp () - print help for a list of available objects and shortcuts FETCH (url [, redirect = True]) - get a new response from the given URL, and update all related objects. You can select HTTP 3xx redirection request, in order to avoid passing FETCH (Request) - Get a new response from a given request, and updates all of the relevant objects. View (Response) - open in response to a given inspection in the local Web Browser. This will add a <base> tag to the body in response to external links (e.g., style sheets and images) displayed correctly. Note, however, this will create a temporary file on your computer, the file is not automatically deleted.
Available srapy objects
Scrapy Shell automatically create some convenient object from the download page, such as Response
objects and Selector
objects (for HTML and XML content).
These objects are:
crawler
- the currentCrawler
object.spider
- Spiders are known for processing the URL, orSpider
if the URL does not find a spider is present, compared with.request
-ARequest
last page of the extracted object. You can use this request to modifyreplace()
or use thefetch
shortcut.response
-AResponse
object contains extracts of the last pagesettings
- The current configuration scrapy
Shell Example Session
Here is an example of a typical shell session, we begin to crawl https://scrappy.org page and continue to crawl https://reddit.com page. Finally, we modify (reddit) request method to publish and retrieve it, you get an error. Through to the end of the session, type ctrl-d (on UNIX systems) or ctrl-z in Windows.
Keep in mind that the data extracted here may not be the same when you try, because these pages are not static, may have changed when you test. The sole purpose of this example is to familiarize you with scraps shell works.
First, we fired artillery shells:
C:\Users\admin>scrapy shell www.scrapy.org --nolog
Then, shell acquisition URL (use scrapy downloader) and print the list of available objects and useful shortcuts (you will notice these rows with [s]
prefix):
After we started using objects:
In [1]: response.xpath('//title/text()').get() # 获取网页title Out[1]: 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework' In [5]: fetch('https://www.osgeo.cn/scrapy/topics/shell.html') # 切换网址 In [6]: response.xpath('//title//text()').get() # 获取网页title Out[6]: 'Scrapy shell — Scrapy 1.7.0 文档' In [7]: request = request.replace(method='post') #Request is a request to change the way POST the In [ . 8]: FETCH (request) # new URL request above the In [ . 9]: Response.Status # Response Status Code Out [. 9]: 405 the In [ 10]: from the pprint Import the pprint # import module the pprint the In [ . 11]: the pprint (Response.Headers) # print request header {B ' the Content-the Type ' : [B ' text / HTML ' ], B ' a Date ' : [B ' Sat, 10 04-Aug 2019: GMT 08:19 ' ], B ' Server': [b'nginx/1.10.3']} In [12]: pprint(type(response.headers)) # 打印类型 <class 'scrapy.http.headers.Headers'>
Calling shell from spiders to check response
Sometimes, you want to check on your response to a point of being processed spider, if only to check whether or not the response you expect to get there, then.
This can be done using the scrapy.shell.inspect_response
function.
Here is an example of how to call it from your spiders:
import scrapy class MySpider(scrapy.Spider): name = "myspider" start_urls = [ "http://example.com", "http://example.org", "http://example.net", ] def parse(self, response): # We want to inspect one specific response. if ".org" in response.url: from scrapy.shell import inspect_response inspect_response(response, self) # Rest of parsing code.
When you run a spider, you'll get something like:
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None) 2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0x1e16b50> ... >>> response.url 'http://example.org'
You can then extract the code to check whether the work:
>>> response.xpath('//h1[@class="fn"]') []
No, not like that. So you can open in response to a Web browser and see if it is what you expect in response:
>>> view(response)
True
Finally, click ctrl-d (or click in Windows ctrl-z) exit the shell and continue to crawl:
>>> ^D 2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None) ...
Please note that you can not use fetch
a shortcut here, because the engine was scrapped blocked shell. However, when you leave the shell, the spider will continue where it left off crawling, as shown above.
Example:
# -*- coding: utf-8 -*- import scrapy class DowloadTaobaoSpider(scrapy.Spider): name = 'dowload_taobao' allowed_domains = ['www.taobao.com'] start_urls = ['http://www.taobao.com/'] def parse(self, response): if '.com' in response.url: from scrapy.shell import inspect_response inspect_response(response,self)