Getting explain Xinshoubikan reptiles and HTTP protocol

1 Introduction

I often see people in the know how to ask questions almost on entry Python reptile? How to learn Python Reptile [Beginners]? This and other questions, I write this article today, the purpose is to tell you why I want to study reptiles, reptile What is the nature of yes.

2. Why should I learn reptiles

Let me talk about me, why would I want to study reptiles do?

Two years ago, I was ignorant of the child, at that time, basically every night on old driver will be the forum to find a movie, do not know if you know the old drivers forums, in fact, can be classified according to find movies you want to see, but it is not even multiple choice (can not select two or more classifications to find). For example, I want to see the "xxx" plot + Chinese subtitles, is how I do it, first select a category "xxx", page by page and then enter the ctrl + f "Chinese" Find ... so look for a few days, I I found that this is simply too silly, but I have a Baidu, the first time that the "crawlers" ... Thus, under the strong interest in driving, on my one week into the door ... that's why I want to study reptiles through

I think the reptile is to help us lazy, as described above, when I climbed down the entire old driver forum, I can customize the look of many conditions, no longer so silly turned a page by page; reptiles save us off a series of complicated time (for example, I want to download I love drawing pictures of this site, I can not point one by one, I can write a reptile me all downloaded)

3. What is the nature of reptiles is

I think that is the essence of reptiles sentence imitate browser to open a Web page

Let's look at an example (let fly a red envelope)

After opening this page, press F12, to open the Developer Tools, and then F5 refresh the page (I use the Google browser)

Click on "top Network" and then click on the "Doc", you should see the same interface as shown below

We look at the following General

request url, means that we open this Web page address, which is our address above

request method, represents the way we requested, here we see use the GET

Request method (all methods all uppercase) There are various methods of interpretation of each follows:
the GET request to obtain the resource identified by the Request-URI
POST data after adding a new resource identified by the Request-URI
HEAD request acquired by the Request-URI response message reported resource identified by the first
PUT request the server to store a resource, and its identity with request-URI as a
dELETE request server delete the resource the request-URI identified
TRACE request server returns a request received information, mainly for testing or diagnostic
CONNECT reserved for future use
oPTIONS request to query the server's performance, or other relevant resource needs and options
application example:
gET method: when you access the page enter the URL address bar of your browser, the browser uses gET method to get resources to the server, eg: GET /form.html HTTP / 1.1 ( CRLF)

POST method requires the server to accept the requested data back attached to the request, to submit the form used.
EG: HTTP POST /reg.jsp / (CRLF)
the Accept: Image / GIF, Image / the X--Xbit, ... (CRLF)
...
HOST: Guilin University of Electronic Technology (CRLF)
Content-the Length: 22 (CRLF)
Connection: Keep- Alive (CRLF)
Cache-Control: NO-Cache (CRLF)
(CRLF) CRLF // represents the message header over, before the message header
user = jeffrey & pwd = 1234 // this line following is the data submitted

HEAD method and the GET method is almost the same for the response part HEAD request, its HTTP header information contained in the GET request and information obtained is the same. With this method, without transferring the resource content, resource information can be obtained is identified by the Request-URI. The method used to test the effectiveness of a hyperlink, you can access, and whether a recent update.

status code returned by the server indicates the status of it, here is the 200, for OK

Status code has three numbers, the first number in response to the defined categories, and there are five possible values:
1xx: indication information - indicates a request has been received, processing continues
2xx: Success - indicates that the request has been successfully received, understood, accepted
3xx: redirection - to fulfill the request must go a step further
4xx: client error - the request has a syntax error or a request can not be achieved
5xx: server-side error - the server failed to achieve a legitimate request
common status code, state description, explanation :
200 the OK // client request was successful
400 Bad request // client requests a syntax error, can not be understood by the server
401 unauthorized // request is not authorized, the status code must be used with the WWW-Authenticate header field
403 Forbidden / / server receives the request, but refused to provide service
404 not Found // requested resource does not exist, eg: enter the wrong the URL of
500 Internal server error // server goes unexpected errors
503 server unavailable // server is currently unable to process client request may return to normal after a period of time
Here Insert Picture Description
and then look at Reques t Headers (when requested by the message header)

The Accept
the Accept request header field is used to specify what type of information the client accepted. eg: Accept: image / gif, indicates that the client wishes to receive resources GIF image format; Accept: text / html, indicates that the client wishes to accept html text.

-Encoding the Accept
the Accept-Encoding request header field is similar to Accept, but it is acceptable for specifying content encoding. eg: Accept-Encoding:. gzip.deflate If the request is not set this field to the message server assumes that the client for encoding various contents are acceptable.
Language-the Accept
the Accept-Language request header field similar to Accept, but it is used to specify a natural language. eg: Accept-Language:. zh -cn If the request is not set this header field of the message, the server assumes that the client can accept a variety of languages
Cache-Control is used to control the cached page, detail may Cache-control_ Baidu Encyclopedia

Cookie, sometimes with plural forms Cookies, refers to the data (typically encrypted) to identify the user identity of certain sites, for tracking purposes session stored on the user's local terminal. Is defined in RFC2109 and 2965 have been discarded, the latest specification is substituted RFC6265 [1]. (Can be called browser cache)
HOST request that you request URL field
User-Agent represents the current name and version of the browser

Referer: tell the server which page you came from links (no figure below.)
Here Insert Picture Description

Response Headers I will not describe

Reptile contents of the above is to simulate sent (please look forward to the next article.)