A record of basic knowledge about crawlers

HTTP principle

1.URI Japanese URL

URI, the full name is Uniform Resource Identifier, Uniform Resource Identifier; URL, the full name is Universal Resource Locator, Uniform Resource Locator. For example, there is now a URL https://github.com/favicon.ico, which is both a URI and a URL, indicating that there is an icon resource such as favicon.ico. We specify the only way to access it and the access protocol through the previous URI/URL. https, access path and resource name, we can find the specified resource in the network through this URL. URL is a subset of URI. In addition, URI also contains URN, which is called Universal Resource Name. It only names the resource and does not specify how to locate the resource. Currently, almost all URIs are URLs because URNs are rarely used. What you need to know is that the format of the URL is as follows:

protocol ://[username:password@]hostname[:port][/path][;parameters][?query]#fragment

The non-essential parts in square brackets. Our commonly used www.baidu.com only uses the protocol and hostname parts, and the other parts are not included. Note that sometimes protocol is also called scheme. Commonly used ones are http, https and ftp. ;
username and password here represent the username and password required for entry. Some URLs require some and some are not. For example, in the actual example of python3 web crawler development: https://admin:[email protected] here represents https:/ /ssr3.scrape.center requires the username admin and password admin to access;
hostname, which can be a domain name or IP address, indicates the host where the resource is located;
port, the port number, indicates a port on the previous host, which can be accessed through this port A certain resource or a certain response;
path, path, a certain path under the hostname host explained earlier is where the resource is located;
parameters, parameters, which are additional information when accessing certain resources, this has not been carefully distinguished, and it depends on the opinion;
query, query, usually represents a certain condition, such as specifying the decoding method, etc.: ie=utf-8, if there are multiple, use & to separate them;
fragment, fragment, a partial supplement to the resource description, equivalent to a bookmark, commonly seen is the anchor point. Those who have been exposed to HTML5 should know that there is a parameter id to specify a certain location as an anchor point. When accessing the URL, adding href=#name can jump to the anchor point of the resource. Examples of novice tutorials

2. A brief introduction to http and https

For URLs, there are many supported protocols, including https and http. The latter is a protocol called Hypertext Transfer Protocol in Chinese. The latter is a secure version of the former, Hypertext Transfer Protocol over Secure Socket Layer. It is generally It is said that SSL layer security verification is added on the basis of http, which can establish an information security channel to ensure the security of information transmission, and the security information of the website can be confirmed through the lock mark in the browser address bar or through the CA organization Check the security signature.

3.http request

When we usually log in to a website, we can paste a URL we found in the address bar of the browser or enter it manually. After pressing Enter, we are instantly logged in, which is super fast (if the Internet speed is good, it is a domestic URL) . In fact, this is a request sent to the server where the website is located, and the web page we enter is the server's request response.
Insert image description here
As shown above, this is Baidu's request list. When we first enter, it sends these requests. The first request is the URL of Baidu's website. Then we click on the entry requested by Baidu, which can display the following information:
Insert image description here
The above shows that the URL we requested is www.baidu.com, the request method is the get method, and the response is 200 status code. If it cannot be found, it should respond with 404. Then the following is the request header information and response header information. The server will respond according to the request header. Status information to perform different responses to get the page displayed correctly.

Request method

For requests, a more important concept is the request method. Our common request methods are GET and POST requests. When we search in Baidu, it is a GET request. For example, the link I got when I searched for the URL is like this (https://www .baidu.com/s?wd=URL&rsv_spt=1&rsv_iqid=0x99541d3100016383&issp=1&f=8&rsv_bp=1&rsv_idx=2&ie=utf-8&rqlang=cn&tn=baiduhome_pg&rsv_enter=1&rsv_dl=tb&oq=WSL%25E7%2 5BD%2591%25E7%25BB%259C%25E7 %25BC%2596%25E7%25A8%258B&rsv_btype=t&inputT=873&rsv_t=499egBufa%2F3R6vAC2x%2Fh8OKsLJ84Cx9C28aV%2B%2FgmS1xyKe0HdJZgqy6l3TlM%2FEuCZ0w5&rsv_pq =d7e0de8000015926&rsv_sug3=41&rsv_sug1=29&rsv_sug7=100&rsv_sug2=0&rsv_sug4=2655) We pay attention to important information, s?wd=URL Contains important query information, indicating that the keyword we are searching for is the URL; when we log in to a certain interface, when we click "Login", we usually initiate a POST request, which is initiated when submitting the form. Yes, the data is usually sent in the form of a form and will not have any representation on the URL.
This is the difference between the two. The former is just a request for ordinary information and generally does not involve sensitive information. When you need to submit a username and password, it is too dangerous to use a GET request, so at this time, a POST request is used. In addition, When uploading files, POST is also used because the file content is large. In addition to the above two requests, there are also HEAD, PUT, DELETE, OPTIONS, TRACE, etc.

Important information in request headers

Cookies , also called Cookies, are used by websites to identify users, perform session tracking and store data locally on the user. The main function is to maintain the current access session. For example, the interface after login. The cookie will identify our corresponding server session, and then every time the browser visits the site page, it will add the cookie to the request header and send it to the server. The server identifies the identity, confirms the login status, and then returns the corresponding response.
Refer , used to identify the requested page to facilitate server processing;
User-Agent, referred to as UA, a special string header that allows the server to identify the operating system and browser information used by the client. For crawlers, adding this is enough It can be used as a pseudo server, otherwise it will be easily identified.

Request body

For GET requests, the request body is empty; for POST requests, since form data needs to be submitted, you need to pay attention to the Content-Type information specified in the request header. For example, set it to application/json to submit json data, set it to multipart/form-data to upload files, set it to application/x-www-form-urlencoded to submit form data of user name and password information, set it to text/xml. Upload xml data.

The above content comes from the second edition of python3 web crawler development practice by Cui Qingcai (the photo is very handsome). It can be regarded as a learning record of mine. There is a long way to go.

Guess you like

Origin blog.csdn.net/weixin_44948269/article/details/121872129