shell scrapy framework

scrapy shell

  scrapy shell is an interactive shell, where you can quickly scrape debug code without having to run spider. It was originally used to test data extraction code, but you can actually use it to test any type of code, because it is also a regular Python shell.

  shell for testing xpath or css expression, and see how they work, and the data extracted from them trying to crawl your pages. It allows you to interactively test when writing spider expression, without having to run spider to test each change.

Once you are familiar Scrapy Shell, you will find it an invaluable tool for developing and debugging of spider.

Configuration shell

If you have  IPython  installed later, scrapy shell will use it (instead of the standard python console). This  IPython  console more powerful, intelligent auto-completion and color output capabilities.

By Scrapy settings, you can configure it to use  ipython ,  bpython or standard  python shell, no matter what has been installed. This is done by setting  SCRAPY_PYTHON_SHELL the environment variables; or by  scrapy.cfg  ;

[settings]
shell = bpython

Start shell

Use  shell the following command:

scrapy shell <url> # <url> to crawl url

shell Also applies to local files. If you want to play a local copy of a web page, it is very convenient. shell Understand the local file the following syntax:

# UNIX-style
scrapy shell ./path/to/file.html
scrapy shell ../other/path/to/file.html
scrapy shell /absolute/path/to/file.html

# File URI
scrapy shell file:///absolute/path/to/file.html

When using a relative file path, use  ./ (or)  ../ indicates that the file stored in the path

scrapy shell ./index.html

Use shell

scrappyshell just an ordinary python console (or  IPython  console, if you have it), it provides some additional shortcuts to facilitate.

Available shortcuts

shelp () - print help for a list of available objects and shortcuts 

FETCH (url [, redirect = True]) - get a new response from the given URL, and update all related objects. You can select HTTP 3xx redirection request, in order to avoid passing 

FETCH (Request) - Get a new response from a given request, and updates all of the relevant objects. 

View (Response) - open in response to a given inspection in the local Web Browser. This will add a <base> tag to the body in response to external links (e.g., style sheets and images) displayed correctly. Note, however, this will create a temporary file on your computer, the file is not automatically deleted.

Available srapy objects

Scrapy Shell automatically create some convenient object from the download page, such as  Response objects and  Selector objects (for HTML and XML content).

These objects are:

  • crawler - the current  Crawler object.
  • spider - Spiders are known for processing the URL, or  Spider if the URL does not find a spider is present, compared with.
  • request -A  Request last page of the extracted object. You can use this request to modify  replace() or use the  fetch shortcut.
  • response -A  Response object contains extracts of the last page
  • settings - The current configuration scrapy

Shell Example Session

Here is an example of a typical shell session, we begin to crawl https://scrappy.org page and continue to crawl https://reddit.com page. Finally, we modify (reddit) request method to publish and retrieve it, you get an error. Through to the end of the session, type ctrl-d (on UNIX systems) or ctrl-z in Windows.

Keep in mind that the data extracted here may not be the same when you try, because these pages are not static, may have changed when you test. The sole purpose of this example is to familiarize you with scraps shell works.

First, we fired artillery shells:

C:\Users\admin>scrapy shell www.scrapy.org --nolog

Then, shell acquisition URL (use scrapy downloader) and print the list of available objects and useful shortcuts (you will notice these rows with  [s] prefix):

After we started using objects:

In [1]: response.xpath('//title/text()').get()   # 获取网页title
Out[1]: 'Scrapy | A Fast and Powerful Scraping and Web Crawling Framework'

In [5]: fetch('https://www.osgeo.cn/scrapy/topics/shell.html')  # 切换网址
   
In [6]: response.xpath('//title//text()').get()   # 获取网页title
Out[6]: 'Scrapy shell — Scrapy 1.7.0 文档'

In [7]: request = request.replace(method='post')   #Request is a request to change the way POST 

the In [ . 8]: FETCH (request)     # new URL request above 

the In [ . 9]: Response.Status    # Response Status Code 
Out [. 9]: 405 

the In [ 10]: from the pprint Import the pprint    # import module the pprint 

the In [ . 11]: the pprint (Response.Headers)    # print request header 
{B ' the Content-the Type ' : [B ' text / HTML ' ], 
 B ' a Date ' : [B ' Sat, 10 04-Aug 2019: GMT 08:19 ' ], 
 B ' Server': [b'nginx/1.10.3']}

In [12]: pprint(type(response.headers))   # 打印类型
<class 'scrapy.http.headers.Headers'>

Calling shell from spiders to check response

Sometimes, you want to check on your response to a point of being processed spider, if only to check whether or not the response you expect to get there, then.

This can be done using the  scrapy.shell.inspect_response function.

Here is an example of how to call it from your spiders:

import scrapy

class MySpider(scrapy.Spider):
    name = "myspider"
    start_urls = [
        "http://example.com",
        "http://example.org",
        "http://example.net",
    ]

    def parse(self, response):
        # We want to inspect one specific response.
        if ".org" in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response, self)

        # Rest of parsing code.

When you run a spider, you'll get something like:

2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.com> (referer: None)
2014-01-23 17:48:31-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.org> (referer: None)
[s] Available Scrapy objects:
[s]   crawler    <scrapy.crawler.Crawler object at 0x1e16b50>
...

>>> response.url
'http://example.org'

You can then extract the code to check whether the work:

>>> response.xpath('//h1[@class="fn"]')
[]

No, not like that. So you can open in response to a Web browser and see if it is what you expect in response:

>>> view(response)
True

Finally, click ctrl-d (or click in Windows ctrl-z) exit the shell and continue to crawl:

>>> ^D
2014-01-23 17:50:03-0400 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://example.net> (referer: None)
...

Please note that you can not use  fetch a shortcut here, because the engine was scrapped blocked shell. However, when you leave the shell, the spider will continue where it left off crawling, as shown above.

Example:

# -*- coding: utf-8 -*-
import scrapy


class DowloadTaobaoSpider(scrapy.Spider):
    name = 'dowload_taobao'
    allowed_domains = ['www.taobao.com']
    start_urls = ['http://www.taobao.com/']

    def parse(self, response):
        if '.com' in response.url:
            from scrapy.shell import inspect_response
            inspect_response(response,self)

 

Guess you like

Origin www.cnblogs.com/songzhixue/p/11331141.html