If you want to learn about reptiles, you must know the most basic knowledge about reptiles. From what is a reptile? What can reptiles do? How does the crawler work? Let's study together, hoping to provide Xiaobai with help in learning.
- Crawler Definition, Classification and Process
- http和https
reptile definition
A web crawler (also known as web spider, web robot) is a program that simulates a browser to send web requests and receive request responses. It is a program that automatically grabs Internet information according to certain rules. Crawlers simulate the behavior of browsers, the more similar the better, the less likely they are to be found. In principle, as long as the browser (client) can do anything, the crawler can do it.
Classification of reptiles
- General crawler: usually refers to the crawler of the search engine
- Focus on crawlers: crawlers for specific sites
Uses of reptiles
- download vip music
- download vip movie
- 12306 grab tickets
- Website Auto-Voting
- SMS bombing
etc.
The process of reptiles
- Send a request to the starting url and get a response
- Extract the response
- If the url is extracted, then continue to send the request to get the response
- If the data is extracted, the data is saved
robots protocol
Robots agreement: The website tells the search engine which pages can be crawled and which pages cannot be crawled through the Robots agreement, but it is only a constraint on the moral level. For example: Taobao's robots agreement
HTTP concept
HTTP (Hypertext Transfer Protocol) is a client/server model communication protocol on the application layer, which consists of requests and responses and is stateless.
Protocol: The protocol stipulates the data transmission format that both communication parties must abide by, so that the communication parties can communicate accurately according to the agreed format.
Stateless: Stateless means that there is no connection between two understanding communications, each time is a new connection, and the server will not record the request information before and after.
HTTP request flow
- The browser obtains the IP address through the domain name resolution server (DNS)
- The browser first initiates a request to the IP and obtains the corresponding
- In the returned response content (html), there will be url addresses such as css, js, pictures, and ajax codes, and the browser will send other requests in sequence according to the order in the response content, and get the corresponding response
- Every time the browser gets a response, it adds (loads) the displayed results, js, css and other content will modify the content of the page, and js can also resend the request to get the response
- From getting the first response and displaying it in the browser, until finally getting all the responses, and adding content or modifying the displayed results - this process is called browser rendering
Five-Layer Network Model
HTTP protocol structure diagram
Network Model Correspondence
- HTTP, RTSP, FTP -------> application layer
- TCP, UDP -------> transport layer
- IP -------> Network layer
- Data Link -------> Data Link Layer
- Physical medium -------> physical layer
url address format
Format specification: scheme://host[:port]/path/…/[?query-string][#anchor]
- scheme: protocol (eg: http, https, ftp)
- host: IP address or domain name of the server
- port: The port of the server (if it is the default port of the protocol, the default port is 80)
- path: the path to access the resource
- query-string: parameter, the data sent to the http server
- anchor: anchor (jump to the specified anchor position of the web page)
HTTP request
request format
request method
- According to the HTTP standard, HTTP requests can use several request methods.
- HTTP1.0 defines three request methods: GET, POST and HEAD methods.
- HTTP1.1 adds five new request methods: OPTIONS, PUT, DELETE, TRACE and CONNECT methods.
request method | describe |
---|---|
GET | Request the specified page information and return the entity body. |
HEAD | Similar to a get request, except that there is no specific content in the returned response, which is used to obtain the header |
POST | Submit data to a specified resource to process a request (such as submitting a form or uploading a file). Data is included in the request body. |
POST | Requests may result in the creation of new resources and/or the modification of existing resources. |
PUT | The data sent from the client to the server replaces the content of the specified document |
DELETE | Requests the server to delete the specified page. |
CONNECT | Reserved in the HTTP/1.1 protocol for proxy servers that can pipe connections into. |
OPTIONS | Allows clients to view server performance. |
TRACE | Echoes the requests received by the server, mainly for testing or diagnosis. |
Common request headers
request header | effect |
---|---|
Cookie | Cookie |
User-Agent | browser name |
Refer | page jump |
Host | host and port number |
Connection | link type |
Upgrade-Insecure-Requests | Upgrade to HTTPS requests |
Accept | transfer file type |
Accept-Encoding | File codec format |
x-requested-with : XMLHttpRequest | ajax request |
HTTP response format
The HTTP response also consists of four parts, namely: status line, message header, blank line (carriage return + line feed) and response body.
response header
response header | effect |
---|---|
Location | This header is used with the 302 status code to tell the client who to look for |
Set-Cookie | Set the cookie associated with the page |
Content-Type | The server sends back the type of data through this header |
Server | The server uses this header to tell the browser the type of server |
Content-Length | The server uses this header to tell the browser the length of the returned data |
Connection | Through this header, the server responds whether to keep the connection or close the connection |
HTTP status code
When a viewer visits a web page, the viewer's browser will send a request to the server where the web page is located. Before the browser receives and displays the webpage, the server where the webpage is located will return an information header (server header) containing the HTTP status code in response to the browser's request. The English of HTTP status code is HTTP Status Code. The HTTP status code consists of three decimal numbers. The first decimal number defines the type of the status code, and the last two numbers have no classification function. There are 5 types of HTTP status codes
Classification | Category description |
---|---|
1** | Information, the server receives the request and needs the requester to continue to perform the operation |
2** | Success, the operation was successfully received and processed |
3** | Redirected, further action is required to complete the request |
4** | Client error, the request contained syntax errors or could not be completed |
5** | Server error, the server encountered an error while processing the request |
Common HTTP Status Codes
- 200 - Request succeeded
- 301 - The resource (webpage, etc.) has been permanently moved to another URL
- 404 - The requested resource (page, etc.) does not exist
- 500 - Internal Server Error
HTTPS
- HTTP + SSL (Secure Sockets Layer), i.e. Hypertext Transfer Protocol with Secure Sockets Layer
- Default port number: 443
- The role of HTTPS: Encrypt data during transmission to prevent intermediate routers, switches and other intermediate routing devices from tampering with data.
Current form
Note: At present, HTTPS is the mainstream in the future, and the interface provision of iOS client and android client requires HTTPS interface support.
If one day, when you encounter a problem, you can come up with multiple solutions, and quickly and accurately select the most efficient one, it proves that you are already proficient in this language.
Learn and communicate, answer any questions at any time, check in and learn together, qq: 943192807