A. Common command
scrapy global command can be used anywhere in the project command can only be used in the project path
Global command: Project command:
startproject crawl
genspider the Check
Settings List
runspider Edit
shell the parse
FETCH Bench
View
Version
1. Create a project
scrapy startproject <project_name> [project_dir]
示例: scrapy startproject douban
2. Create spiders in the project
scrapy genspider [-t template] <name > <domain>
Example: scrapy genspider douban www.douban.com
<name> is the name of spiders
<domain> is the domain crawl, according to this document will be generated allowed_domains and start_urls
[- t template] can be created from a template, there are basic crawl csvfeed xmlfeed etc.
3. Start reptiles
crawl Scrapy <Spider>
- O may specify the output format
--nolog print log can be closed
4. View all reptiles
scrapy list
5. Print response
FETCH Scrapy <URL>
- nolog printing log not only show Content
--headers print head in response to
6. debug shell
scrapy shell [url]
enters the command-line debugging, these methods may be used to detect response.css xpath data, print source other
The display items set
Settings scrapy - GET BOT_NAME
BOT_NAME defaults to the project name, you can change in settings.py
8. Do not create project running spiders
scrapy runspider <spider_file.py>
Two .Spider class
name # reptile name, the name is the definition of Spider string, unique within the project allowed_domains # allow crawling domain name, are optional and not in this range will not be linked to follow up crawling start_urls # start URL list, no when implementing start_requests () method, the default will start crawling from the list custom_settings # spider-specific settings override the global settings crawler # bind crawler spider the object, this represents the corresponding crawler Spider class objects (bit stupid things ) Setttings # run this spider configuration. This is an instance of Settings Logger # specified Spider create a Python logger name. You can use it to send log messages
2. Class Methods
from_crawler (CLS, content crawler, args *, ** kwargs) # class method, for instance of an object (intermediate module), to bind to the class spider start_requests (Self) # generator, returned by the URL configuration Request, as an inlet run automatically at startup reptiles. Implement this method is ignored start_urls. Of course, in the method using start_urls another said. the parse (Self, Response) # default Response analytic function (request default callback function), this method as well as any other Request callbacks must return may iteration Request and / or dicts or Item object Closed (Self, reason) # automatic operation crawler Close to achieve a binding signal spider_closed
Three objects .request
scrapy using the built scrapy.http.Request and R & lt esponse processed in response to the request and the network resource
PS: when we write spider, mainly import scrapy, then Spider classes and request a reference to the class path is scrapy.Spider and scrapy.Request. And we see in the document's official website is different because the source scrapy / __ init__.py in
An increase of abbreviations
scrapy/__init__.py # Declare top-level shortcuts from scrapy.spiders import Spider from scrapy.http import Request, FormRequest from scrapy.selector import Selector from scrapy.item import Item, Field
Request object
scrapy.http.Request(url[, callback, method='GET', headers, body, cookies, meta, encoding='utf-8', priority=0, dont_filter=False, errback, flags, cb_kwargs])
Parameter analysis
url (String) # request the URL of callback # after the callback, request the download is complete, method calls, response object for the first parameter Method, (String) # HTTP request method. The default is 'the GET' Meta (dict) # metadata indicates to carry or information transmission, user-definable pass parameters from the Request to Response, this parameter also typically treated middlewares the body (str or Unicode) # request body body, is behind the HTTP content request message header headers (dict) # request header dictionary format cookie (dict or List) # Cookie, dictionary or dictionaries plurality List encoding (String) # coding. 8-default UTF priority ( int) # priority of the request (default 0). The scheduler uses to define the priority order for processing requests. Request having a higher priority value will be performed earlier. Allow negative values to indicate a relatively low priority dont_filter (boolean) # If you need to submit the form multiple times, and the same url, then you must set the parameters dont_filter = True, from being filtered out as duplicate pagesDefault to false errback (Callable) # initiator exception when any function is invoked when a processing request the flags (List) # request sent to the flag, can be used for logging or similar purpose cb_kwargs (dict) # dictionary arbitrary data with a it will be passed to the callback Request as a key parameter.
Properties and Methods
URL # request URL Method # request method headers # request header dictionary type body # request body STR Meta # user-definable pass parameters from the Request to Response, this parameter also typically treated middlewares in cb_kwargs # parameter as copy () # returns a copy replace (parameters) # replace the corresponding parameter and returns a new request
FormRequest objects
Request class extends the base FormRequest, has a function of processing HTML forms, specific Request Parameter increase formdata
1. scrapy.http.FormRequest(url [,formdata,... ] )
FormData (or tuples dict Iterable) # is the dictionary containing the HTML form data (or the (key, value) tuples may iteration), it will be encoded and assigned to the url body of the request.
POST transmitted analog form, formdata value parameter values must be unicode, str or bytes object, it can not be an integer
return [FormRequest(url="http://www.example.com/post/action", formdata={'name': 'John Doe', 'age': '27'}, callback=self.after_post)]
2. from_response()方法
Website form fields typically provided by pre-filled <input type = "hidden"> element, e.g. session related data or authentication token (for login page). When crawling, it is desirable to automatically pre-fill these fields, and covers only some of these fields, for example, user name and password. You may be used FormRequest.from_response () method, the following examples
import scrapy def authentication_failed(response): # TODO: Check the contents of the response and return True if it failed # or False if it succeeded. pass class LoginSpider(scrapy.Spider): name = 'example.com' start_urls = ['http://www.example.com/users/login.php'] def parse(self, response): return scrapy.FormRequest.from_response( response, formdata={'username': 'john', 'password': 'secret'}, callback=self.after_login ) def after_login(self, response): if authentication_failed(response): self.logger.error("Login failed") return # continue scraping with authenticated session...
If the interface desired HTML form information request form elements have from_response () method which can automatically identify the form, and form parameters passed
However, the actual from_response () can do, FormRequest can do, suggested FormRequest
JSONRequest objects
JSONRequest class adds two new constructor parameter. The remaining parameters and Request class. Use JSONRequest the Content-Type header set application / json Accept header and application / json, text / javascript, * / *; q = 0.01
scrapy.http.JSONRequest(url [,... data,dumps_kwargs ] )
Data (JSON serializable objects) # is any desired body JSON encoded and assigned to a JSON serializable objects. If Request.body provides the parameters, this parameter is ignored. If Request.body no argument and provides the data parameter Request.method 'POST' is automatically set. dumps_kwargs (dict) # arguments passed to the base json.dumps method, the method for data serialized as JSON format.
Examples
data = { 'name1': 'value1', 'name2': 'value2', } yield JSONRequest(url='http://www.example.com/post/action', data=data)
Four .Response objects
Response class is used to return http download class information, it has several sub-categories: TextResponse, HtmlResponse, XmlResponse. Relationship is as follows:
Response
-TextResponse -HtmlResponse -XmlResponse
Under normal circumstances, when a page download is complete, download created based on HTTP Content-Type response header information objects of a subclass Response, usually generally HtmlResponse subclass, and through the response of the parameter of the incoming request our callback function to extract data manipulation
Response base class
scrapy.http.Response(url [,status = 200,headers = None,body = b'',flags = None,request = None ] )
Parameters and objects similar request, and the use of Response properties and methods are more omitted parameters, specifically the official website to see https://docs.scrapy.org/en/latest/topics/request-response.html#scrapy.http.Request
Object Properties
URL # URL string response (not necessarily the address of the request, should be considered redirection) Status # integer HTTP status response. Example: 200 is, 404 headers # response header dictionaries, To obtain a specific value get ( 'keyname') getlist ( 'keyname'); get ( 'keyname'): obtaining a first value specified key value is returned str; getlist ( 'keyname') : Get the specified key value for all values returned List body # response body, byte request # returns this response request Object request meta # metadata, the data may be obtained Request.meta. Can understand meta data is a relay flags # contains this reply flag list
Object Methods
Copy () # Returns a new Response, it is this Response copy Replace ([URL, Status, headers, body, Request, the flags, CLS]) # Returns replaced the new object corresponding to the parameter urljoin (URL) # by Response url possible relative URL in combination with construct an absolute URL Follow (URL, the callback = None, Method = ' the GET ' , headers = None, body = None, Cookies = None, Meta = None, encoding = ' UTF-. 8 ' , = 0 priority, dont_filter = False, errback = None, cb_kwargs = None)
# the request request url or methods may receive relatively the url. Commonly used in the next request.
TextResponse objects
TextResponse base class inherits the main Response, and added a new realization methods, and new object property
class scrapy.http.TextResponse(url [,encoding [,... ] ] )
Attributes
text # response text, and response.body.decode (response.encoding) similar to, but more convenient text encoding # HTTP response text encoding, its value may be from the head or body of the HTTP response is parsed Selector # Selector objects for extracting data in the Response
New method in the base class
XPath (Query) # using XPath Response Selector extracts the data; it is a shortcut method response.selector.xpath CSS (Query) # CSS selector extracts the data in the Response; it is a method response.selector.css Shortcut.
HtmlResponse与XmlResponse
HtmlResponse and XmlResponse in the source code implementation is inherited TextResponse, at present there is no difference with TextResponse