If you want to learn Python crawlers, do you really understand the most basic knowledge reserves of crawlers?

If you want to learn about reptiles, you must know the most basic knowledge about reptiles. From what is a reptile? What can reptiles do? How does the crawler work? Let's study together, hoping to provide Xiaobai with help in learning.

  • Crawler Definition, Classification and Process
  • http和https

reptile definition

A web crawler (also known as web spider, web robot) is a program that simulates a browser to send web requests and receive request responses. It is a program that automatically grabs Internet information according to certain rules. Crawlers simulate the behavior of browsers, the more similar the better, the less likely they are to be found. In principle, as long as the browser (client) can do anything, the crawler can do it.

Classification of reptiles

  • General crawler: usually refers to the crawler of the search engine
  • Focus on crawlers: crawlers for specific sites

Uses of reptiles

  • download vip music
  • download vip movie
  • 12306 grab tickets
  • Website Auto-Voting
  • SMS bombing
    etc.

The process of reptiles

insert image description here

  1. Send a request to the starting url and get a response
  2. Extract the response
  3. If the url is extracted, then continue to send the request to get the response
  4. If the data is extracted, the data is saved

robots protocol

Robots agreement: The website tells the search engine which pages can be crawled and which pages cannot be crawled through the Robots agreement, but it is only a constraint on the moral level. For example: Taobao's robots agreement

HTTP concept

HTTP (Hypertext Transfer Protocol) is a client/server model communication protocol on the application layer, which consists of requests and responses and is stateless.

Protocol: The protocol stipulates the data transmission format that both communication parties must abide by, so that the communication parties can communicate accurately according to the agreed format.

Stateless: Stateless means that there is no connection between two understanding communications, each time is a new connection, and the server will not record the request information before and after.

HTTP request flow

insert image description here

  1. The browser obtains the IP address through the domain name resolution server (DNS)
  2. The browser first initiates a request to the IP and obtains the corresponding
  3. In the returned response content (html), there will be url addresses such as css, js, pictures, and ajax codes, and the browser will send other requests in sequence according to the order in the response content, and get the corresponding response
  4. Every time the browser gets a response, it adds (loads) the displayed results, js, css and other content will modify the content of the page, and js can also resend the request to get the response
  5. From getting the first response and displaying it in the browser, until finally getting all the responses, and adding content or modifying the displayed results - this process is called browser rendering

Five-Layer Network Model

insert image description here

HTTP protocol structure diagram

insert image description here

Network Model Correspondence

  1. HTTP, RTSP, FTP -------> application layer
  2. TCP, UDP -------> transport layer
  3. IP -------> Network layer
  4. Data Link -------> Data Link Layer
  5. Physical medium -------> physical layer

url address format

insert image description here

Format specification: scheme://host[:port]/path/…/[?query-string][#anchor]

  1. scheme: protocol (eg: http, https, ftp)
  2. host: IP address or domain name of the server
  3. port: The port of the server (if it is the default port of the protocol, the default port is 80)
  4. path: the path to access the resource
  5. query-string: parameter, the data sent to the http server
  6. anchor: anchor (jump to the specified anchor position of the web page)

HTTP request

insert image description here

request format
insert image description here
request method

  • According to the HTTP standard, HTTP requests can use several request methods.
  • HTTP1.0 defines three request methods: GET, POST and HEAD methods.
  • HTTP1.1 adds five new request methods: OPTIONS, PUT, DELETE, TRACE and CONNECT methods.
request method describe
GET Request the specified page information and return the entity body.
HEAD Similar to a get request, except that there is no specific content in the returned response, which is used to obtain the header
POST Submit data to a specified resource to process a request (such as submitting a form or uploading a file). Data is included in the request body.
POST Requests may result in the creation of new resources and/or the modification of existing resources.
PUT The data sent from the client to the server replaces the content of the specified document
DELETE Requests the server to delete the specified page.
CONNECT Reserved in the HTTP/1.1 protocol for proxy servers that can pipe connections into.
OPTIONS Allows clients to view server performance.
TRACE Echoes the requests received by the server, mainly for testing or diagnosis.

Common request headers

request header effect
Cookie Cookie
User-Agent browser name
Refer page jump
Host host and port number
Connection link type
Upgrade-Insecure-Requests Upgrade to HTTPS requests
Accept transfer file type
Accept-Encoding File codec format
x-requested-with : XMLHttpRequest ajax request

HTTP response format

The HTTP response also consists of four parts, namely: status line, message header, blank line (carriage return + line feed) and response body.
insert image description here

response header

response header effect
Location This header is used with the 302 status code to tell the client who to look for
Set-Cookie Set the cookie associated with the page
Content-Type The server sends back the type of data through this header
Server The server uses this header to tell the browser the type of server
Content-Length The server uses this header to tell the browser the length of the returned data
Connection Through this header, the server responds whether to keep the connection or close the connection

HTTP status code

When a viewer visits a web page, the viewer's browser will send a request to the server where the web page is located. Before the browser receives and displays the webpage, the server where the webpage is located will return an information header (server header) containing the HTTP status code in response to the browser's request. The English of HTTP status code is HTTP Status Code. The HTTP status code consists of three decimal numbers. The first decimal number defines the type of the status code, and the last two numbers have no classification function. There are 5 types of HTTP status codes

Classification Category description
1** Information, the server receives the request and needs the requester to continue to perform the operation
2** Success, the operation was successfully received and processed
3** Redirected, further action is required to complete the request
4** Client error, the request contained syntax errors or could not be completed
5** Server error, the server encountered an error while processing the request

Common HTTP Status Codes

  • 200 - Request succeeded
  • 301 - The resource (webpage, etc.) has been permanently moved to another URL
  • 404 - The requested resource (page, etc.) does not exist
  • 500 - Internal Server Error

HTTPS

  • HTTP + SSL (Secure Sockets Layer), i.e. Hypertext Transfer Protocol with Secure Sockets Layer
  • Default port number: 443
  • The role of HTTPS: Encrypt data during transmission to prevent intermediate routers, switches and other intermediate routing devices from tampering with data.

Current form
Note: At present, HTTPS is the mainstream in the future, and the interface provision of iOS client and android client requires HTTPS interface support.

If one day, when you encounter a problem, you can come up with multiple solutions, and quickly and accurately select the most efficient one, it proves that you are already proficient in this language.
Learn and communicate, answer any questions at any time, check in and learn together, qq: 943192807

Guess you like

Origin blog.csdn.net/weixin_57577264/article/details/121033127