Scrapy advanced knowledge summary (a) - the basic commands and basic classes (spider, request, response)

 

A. Common command

scrapy global command can be used anywhere in the project command can only be used in the project path

Global command: Project command: 
startproject crawl 
genspider the Check 
Settings List 
runspider Edit 
shell the parse 
FETCH Bench 
View 
Version

1. Create a project

scrapy startproject <project_name> [project_dir]

示例: scrapy startproject douban

2. Create spiders in the project

scrapy genspider [-t template] <name > <domain> 

Example: scrapy genspider douban www.douban.com 

<name> is the name of spiders 
<domain> is the domain crawl, according to this document will be generated allowed_domains and start_urls 
[- t template] can be created from a template, there are basic crawl csvfeed xmlfeed etc.

3. Start reptiles

crawl Scrapy <Spider> 

- O may specify the output format 
--nolog print log can be closed

4. View all reptiles

scrapy list

5. Print response

FETCH Scrapy <URL> 

- nolog printing log not only show Content 
--headers print head in response to

6. debug shell

scrapy shell [url] 

enters the command-line debugging, these methods may be used to detect response.css xpath data, print source other

The display items set

Settings scrapy - GET BOT_NAME 

BOT_NAME defaults to the project name, you can change in settings.py

8. Do not create project running spiders

scrapy runspider <spider_file.py>

 

Two .Spider class

scrapy. Spiders. Spider is the most basic type of spider (there are other templates), spiders files are deployed mainly around this class (inheritance and rewrite)
 
This class provides default Spider start_requests () method is implemented, and the read request start_urls properties, and according to the result returned call parse () method of analysis results
 
1. Common class properties
name   # reptile name, the name is the definition of Spider string, unique within the project 

allowed_domains   # allow crawling domain name, are optional and not in this range will not be linked to follow up crawling 

start_urls   # start URL list, no when implementing start_requests () method, the default will start crawling from the list 

custom_settings   # spider-specific settings override the global settings 

crawler   # bind crawler spider the object, this represents the corresponding crawler Spider class objects (bit stupid things ) 

Setttings   # run this spider configuration. This is an instance of Settings 

Logger   # specified Spider create a Python logger name. You can use it to send log messages

2. Class Methods

from_crawler (CLS, content crawler, args *, ** kwargs)
 # class method, for instance of an object (intermediate module), to bind to the class spider 

start_requests (Self) 
# generator, returned by the URL configuration Request, as an inlet run automatically at startup reptiles. Implement this method is ignored start_urls. Of course, in the method using start_urls another said. 

the parse (Self, Response) 
# default Response analytic function (request default callback function), this method as well as any other Request callbacks must return may iteration Request and / or dicts or Item object 

Closed (Self, reason) 
# automatic operation crawler Close to achieve a binding signal spider_closed

 

Three objects .request

scrapy using the built scrapy.http.Request and R & lt esponse processed in response to the request and the network resource

PS: when we write spider, mainly import scrapy, then Spider classes and request a reference to the class path is scrapy.Spider and scrapy.Request. And we see in the document's official website is different because the source scrapy / __ init__.py in

An increase of abbreviations

scrapy/__init__.py

# Declare top-level shortcuts
from scrapy.spiders import Spider
from scrapy.http import Request, FormRequest
from scrapy.selector import Selector
from scrapy.item import Item, Field

 

Request object

scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags, cb_kwargs])

Parameter analysis

url (String)   # request the URL of 

callback   # after the callback, request the download is complete, method calls, response object for the first parameter 

Method, (String) # HTTP request method. The default is 'the GET' 

Meta (dict) # metadata indicates to carry or information transmission, user-definable pass parameters from the Request to Response, this parameter also typically treated middlewares the 

body (str or Unicode) # request body body, is behind the HTTP content request message header 

headers (dict) # request header dictionary format 

cookie (dict or List) # Cookie, dictionary or dictionaries plurality List 

encoding (String) # coding. 8-default UTF 

priority ( int) # priority of the request (default 0). The scheduler uses to define the priority order for processing requests. Request having a higher priority value will be performed earlier. Allow negative values to indicate a relatively low priority 

dont_filter (boolean) # If you need to submit the form multiple times, and the same url, then you must set the parameters dont_filter = True, from being filtered out as duplicate pagesDefault to false 

errback (Callable) # initiator exception when any function is invoked when a processing request 

the flags (List) # request sent to the flag, can be used for logging or similar purpose 

cb_kwargs (dict) # dictionary arbitrary data with a it will be passed to the callback Request as a key parameter.

Properties and Methods

URL   # request URL 

Method   # request method 

headers   # request header dictionary type 

body   # request body STR 

Meta   # user-definable pass parameters from the Request to Response, this parameter also typically treated middlewares in 

cb_kwargs   # parameter as 

copy () # returns a copy 

replace (parameters) # replace the corresponding parameter and returns a new request

 

FormRequest objects

Request class extends the base FormRequest, has a function of processing HTML forms, specific Request Parameter increase formdata

1. scrapy.http.FormRequest(url [,formdata,... ] )

FormData (or tuples dict Iterable) # is the dictionary containing the HTML form data (or the (key, value) tuples may iteration), it will be encoded and assigned to the url body of the request.

POST transmitted analog form, formdata value parameter values ​​must be unicode, str or bytes object, it can not be an integer

return [FormRequest(url="http://www.example.com/post/action",
                    formdata={'name': 'John Doe', 'age': '27'},
                    callback=self.after_post)]

2. from_response()方法

Website form fields typically provided by pre-filled <input type = "hidden"> element, e.g. session related data or authentication token (for login page). When crawling, it is desirable to automatically pre-fill these fields, and covers only some of these fields, for example, user name and password. You may be used FormRequest.from_response () method, the following examples

import scrapy

def authentication_failed(response):
    # TODO: Check the contents of the response and return True if it failed
    # or False if it succeeded.
    pass

class LoginSpider(scrapy.Spider):
    name = 'example.com'
    start_urls = ['http://www.example.com/users/login.php']

    def parse(self, response):
        return scrapy.FormRequest.from_response(
            response,
            formdata={'username': 'john', 'password': 'secret'},
            callback=self.after_login
        )

    def after_login(self, response):
        if authentication_failed(response):
            self.logger.error("Login failed")
            return

        # continue scraping with authenticated session...

If the interface desired HTML form information request form elements have from_response () method which can automatically identify the form, and form parameters passed

However, the actual from_response () can do, FormRequest can do, suggested FormRequest

 

JSONRequest objects

JSONRequest class adds two new constructor parameter. The remaining parameters and Request class. Use JSONRequest the Content-Type header set application / json Accept header and application / json, text / javascript, * / *; q = 0.01

scrapy.http.JSONRequest(url [,... data,dumps_kwargs ] )

Data (JSON serializable objects) # is any desired body JSON encoded and assigned to a JSON serializable objects. If Request.body provides the parameters, this parameter is ignored. If Request.body no argument and provides the data parameter Request.method 'POST' is automatically set. 

dumps_kwargs (dict) # arguments passed to the base json.dumps method, the method for data serialized as JSON format.

Examples

data = {
    'name1': 'value1',
    'name2': 'value2',
}
yield JSONRequest(url='http://www.example.com/post/action', data=data)

 

Four .Response objects

Response class is used to return http download class information, it has several sub-categories: TextResponse, HtmlResponse, XmlResponse. Relationship is as follows:

Response
 -
TextResponse    -HtmlResponse    -XmlResponse

Under normal circumstances, when a page download is complete, download created based on HTTP Content-Type response header information objects of a subclass Response, usually generally HtmlResponse subclass, and through the response of the parameter of the incoming request our callback function to extract data manipulation

Response base class

scrapy.http.Response(url [,status = 200,headers = None,body = b'',flags = None,request = None ] )

Parameters and objects similar request, and the use of Response properties and methods are more omitted parameters, specifically the official website to see https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request

Object Properties

URL # URL string response (not necessarily the address of the request, should be considered redirection) 

Status   # integer HTTP status response. Example: 200 is, 404 

headers   # response header dictionaries, To obtain a specific value get ( 'keyname') getlist ( 'keyname'); get ( 'keyname'): obtaining a first value specified key value is returned str; getlist ( 'keyname') : Get the specified key value for all values returned List 

body   # response body, byte 

request # returns this response request Object request 

meta # metadata, the data may be obtained Request.meta. Can understand meta data is a relay 

flags # contains this reply flag list

Object Methods

Copy () # Returns a new Response, it is this Response copy 

Replace ([URL, Status, headers, body, Request, the flags, CLS])   # Returns replaced the new object corresponding to the parameter 

urljoin (URL) # by Response url possible relative URL in combination with construct an absolute URL 

Follow (URL, the callback = None, Method = ' the GET ' , headers = None, body = None, Cookies = None, Meta = None, encoding = ' UTF-. 8 ' , = 0 priority, dont_filter = False, errback = None, cb_kwargs = None)   

# the request request url or methods may receive relatively the url. Commonly used in the next request.

 

TextResponse objects

TextResponse base class inherits the main Response, and added a new realization methods, and new object property

class scrapy.http.TextResponse(url [,encoding [,... ] ] )

Attributes

text   # response text, and response.body.decode (response.encoding) similar to, but more convenient text 

encoding   # HTTP response text encoding, its value may be from the head or body of the HTTP response is parsed 

Selector   # Selector objects for extracting data in the Response

New method in the base class

XPath (Query)   # using XPath Response Selector extracts the data; it is a shortcut method response.selector.xpath 

CSS (Query)   # CSS selector extracts the data in the Response; it is a method response.selector.css Shortcut.

 

HtmlResponse与XmlResponse

HtmlResponse and XmlResponse in the source code implementation is inherited TextResponse, at present there is no difference with TextResponse

 

 

 

Guess you like

Origin www.cnblogs.com/fengf233/p/11265347.html