Spider data mining-----1, crawler concept

1. Concept
The program that automatically downloads network resources in batches (disguised as a program for data interaction between the client and the server, the client corresponds to the server, and the Internet browser is a disguised client)
Function:
1. Data collection: The crawler used for data collection is called "focused crawler", such as the data collected by a certain app, webpage or software for the public.
2. Search engine: it collects a wider range than the focused crawler, such as Baidu, which will search a large number of pages through keywords. Crawl down the information and put it on your own server. You need to store and locate the page.
3. Simulation operation: disguise as a client, widely used to simulate user operations, such as testing robots, irrigation robots, etc., all through the same The request issued by the ip address is used to disguise the client operation (blocking the ip permission in the background will limit the operation of the
crawler ) . The important and difficult points of crawler development:
1. Data acquisition: avoid network resources from being collected by crawlers, and the server will set a lot of them Turing test to prevent malicious crawling of
crawlers: The frequency of user login and server interaction is generally low and slow, but the crawling speed is very fast and the frequency is very high. After the number of crawlers increases, the frequency doubles. The server will get a large number of requests,
and put a lot of pressure on it, causing the server to crash, so the server generally sets up anti-climbing measures.

Current crawler development process: most of it is to solve anti-crawling measures, but also to control the crawler speed to avoid waste of public resources

2、采集的速度

In the era of big data, a huge amount of data is needed, so the collection speed is also very high. Generally, concurrency and distribution are used to solve the speed problem. This is another focus of the crawler development process.
Two, HTTP and HTTPS (90% The request of the network protocol is carried out through these two protocols)
Request: the interaction between the client and the server
Response: the result of the server processing
Network architecture:
1. c/s is the client (such as the computer version of WeChat, QQ Music, online games) and server
2. b/s is the browser (such as a music listening website opened by the browser) and server
3. m/s is the mobile terminal (the client of a mobile phone, such as the mobile version of WeChat And QQ) and server-side
HTTP protocol:
1. Reason: The grammatical structure used in the communication between computers and humans is called an agreement between computers (also called rules, approximate numbers, and requirements), which is guaranteed Can understand each other's specific information
2. Concepts and features:
HTTP is the abbreviation of hypertext (transmission of various media types beyond text, such as music, etc.) (the role is to transmit), used to transmit hypertext from a web server The transfer protocol
HTTP to the local browser is based on the TCP/IP communication protocol (a protocol at a lower level than HTTP) to transfer data. The important reason for using the TCP communication protocol is based on the connection-oriented characteristics.
Application layer
network services and end users Of an interface.
Protocols are: HTTP, FTP, TFTP, SMTP, SNMP, DNS, TELNET, HTTPS, POP3, DHCP,
presentation layer,
data presentation, security, and compression. (It has been merged into the application layer in the five-layer model) The
formats are JPEG, ASCll, EBCDIC, encryption format, etc.[2]
Session layer
Establish, manage, and terminate sessions. (It has been merged into the application layer in the five-layer model.) The
corresponding host process refers to the ongoing conversation between the local host and the remote host. The
transport layer
defines the protocol port number for data transmission, as well as flow control and error checking.
The protocols are: TCP UDP, once the data packet leaves the network card, it enters the network transmission layer. The
network layer
performs logical address addressing to realize the path selection between different networks.
Protocols include: ICMP IGMP IP (IPV4 IPV6)
data link layer to
establish logical connections, hardware address addressing, error checking [3] and other functions. (The protocol is defined by the underlying network)
Combining bits into bytes and then into frames, using MAC addresses to access the medium, errors are found but cannot be corrected.
The physical layer
establishes, maintains, and disconnects physical connections. (The protocol defined by the underlying network)
is generally divided into four layers in the computer: from high to low, the application layer (the realization of the HTTP protocol) and the transport layer (the realization of the TCP protocol, "connection-oriented", to maintain the integrity of the data) , Network layer (implementation of IP protocol), link layer.
Because the transport layer and the network layer are used too extensively, it is also said that the two are combined
(the application layer needs to care about the logical details of the application, rather than the data transmission activities in the network. The application layer has three lower layers. Then deal with the real communication details)
crawler development lies in the development of the application layer

The use of HTTP protocol: (request is also called positioning)
1. The basic flow of HTTP request: including request and response.
Before request, the client and server need to be connected (TCP). The TCP connection is divided into three handshake: the client sends the request , The server receives successfully, and the client confirms the receipt of the message (because the infinite loop confirmation is limited, so TCP is limited to 3 times)
two handshake will cause a waste of resources when the server is congested, and the three-way handshake can prevent the
server from being vulnerable to attacks: hackers pass Forge a large number of non-existent IP addresses (non-existent sites) to send SYN requests, causing the server to send a large amount of resources to occupy the cpu, and in serious cases
, the server will be paralyzed. Protection against syn attacks: the server will send long-occupied requests to Discard or block out The
three-way handshake is the beginning of the request, and the four waves are the end of the request.
Four waves: which party does not want to receive the connection, and the party sends the request.
URL (Uniform Resource Locator, globally unified): When sending an http request, the network resource is located through the URL. Each URL is unique and consists of "protocol + domain name (IP address) + port default 80 (you can not write) + path + parameters". The
domain name can be converted into an IP address through DNS (domain name is generally composed of port), path It is the corresponding file found by entering the server through the port

HTTP request format (request line, request header, blank line, request body (request body))
Request message format: carriage return + line feed is used to separate the
request body, usually the data sent using the POST method, the GET method is not The body of the request (GET is for obtaining not a request)
1. Request line: request method + space + URL + space + protocol version (HTTP/1.0) + carriage return (\r) + line feed (\n)
2. Request header (There are multiple): Header field name: value + carriage return + line feed (if the request data is followed by a set of carriage return and line feed), the main content of the camouflage is in the request header, anti-climbing The more serious the request header, the more you use
3. Request data (separated from the request header by carriage return and line feed)
Request method: (get (get data), post (submit form), head (there are three in version 1.0), options , Put, delete, trace, connect (there are five in version 1.1)), some of which are not open to the server
get: not high security, but convenient and efficient, request string is limited to 1024 bytes
post: no size limit, High security
Request header: refere (to know which link came from), when using the Host method, the request line must transfer the domain name to the request header (Host:ss0.bdstatic.com), leaving only parameters, not a fixed method, you can use it
Connection connection method [(close) after the data transmission is completed, waved four times to close) or (keepalive long connection, HTTP connection keeps dropping after data transmission, suitable for multi-file sending)]
cookie:
blank line of extended field stored in the client : the last one After the request header is a blank line (plus a set of carriage return and line feed characters) to notify the server that there is no more request header below
Request body (request body): not used in the get method, but used in the post (applicable to the occasions where the customer needs to fill in the form). The most commonly used method with the request body is the package type and the length of the package body

HTTP response format (status line, response header, blank line, response body)
1. Status line: protocol version + space + status code + space + status code description (usually OK) + carriage return + line feed
2, response header : Header field name: value + carriage return + line feed (if the request data is followed by a set of carriage return and line feed)
3. Response body
status code: consists of three digits, the first digit The number indicates the type of response (1 indicates that the server received, 2 indicates that it was received and processed, 3 indicates that the server is required to redirect, 4 indicates that there is illegal content, and 5 indicates that it failed to process normally)
307 redirect: when the website is upgraded, for To prevent users from being unavailable on the original website and unable to find the upgraded website and losing users, the server will be required to automatically go to the upgraded URL.
When a resource targeted by a request does not support the corresponding request method, 405 will be returned, and the server will not 501 is returned when the request method is recognized. The specific HTTP server supports extended custom methods.
Status code description: OK means the request is successful, Badrequest syntax is wrong, Unauthonzed is unauthorized, found refuses to provide services,
NOTfound request resource does not exist, internalservalError server occurs Error, serviceUnavailable server is temporarily unable to process the request.
Response headers: allow (which methods the server supports), set-Cookie (used to send cookies to the client browser)

HTTPS (Hypertext Transfer Security Protocol): HTTP channel with security as the goal, the secure version of HTTP, http is based on TCP/IP, and HTTPS is based on this plus a layer of SSL/TLS protocol, encrypted transmission
HTTPS default port 443, HTTP default port is 80

Third, session technology
HTTP is stateless and does not remember the transaction processing, so it must be kept logged in to avoid it. This uses session technology (to distinguish continuous requests from the same user): cookie (or cookies) and session
cookie ( It is equivalent to a credential): The data stored on the user’s local terminal for session tracking in order to identify whether the website is the same user. The latest specification is RFC6265
1. A special authentication information actually issued by the server to the client
2. Stored in a text file
3. The client will bring this authentication to the server every time it sends a request to let it know which user it is.
After obtaining the cookie, just verify the cookie to keep the login status.

Session: For example, the period from opening the browser window to closing is called session. The difference with cookie is its purpose: during the period from opening the browser to closing, all requests initiated can be recognized as the same user.
In contrast, session is stored The data on the server is only judged by the sessionID passed by the client, so the session is more secure. The cookie may be tampered with.
Generally, the sessionID will be discarded when the browser is closed, or the server will verify the activity of the session. If it is not active for a period of time, it will be recognized as invalid. The session is cookie-based, and the sessionID is a temporary cookie.
For example: when the user enters a password and is successfully verified by the server, a key: hash value will be stored in the session table, and then given The user provides a temporary cookie authentication with sessionID=hash (retained on the browser), and when
accessing some web pages (such as visiting QQ space and then accessing QQ mailbox), verify the sessionID again (provided that the browser is not closed, based on the original Go up and open a QQ mailbox), you can, instead of verifying the password.
Fourth, proxy
concept: proxy actually refers to the proxy server, its function is to replace the user to obtain network information, like a network transfer station. With a proxy server, the web server does not need to identify the client's IP, only the IP of the proxy server,
so that the real IP of our machine is disguised.

Function:
1. Break through the restrictions of your own IP access and visit some sites that your own IP cannot access (for example, foreign servers do not allow domestic users to access, then you can access for your own domestic users through a foreign proxy server)
2. Visit some units or groups Internal resources: For example, if a proxy server is installed in the IP address allowed in the education network, it can be accessed through the proxy server.
3. Increase the access speed: the process of adding a proxy server will reduce the speed. However, the proxy server will set a larger hard disk buffer.
When the same information is accessed again, there is no need to access the web server, and the information of the proxy server can be extracted directly to increase the speed.
4. Hide the real IP: avoid attacks and prevent your own IP from being blocked
. The role of the proxy on the crawler: Use a proxy to hide the real IP and make the server mistakenly think that it is a proxy server before requesting itself, so that by constantly changing the proxy during the crawling process, Will not be blocked.

socket (socket/socket): an object-oriented abstraction layer between the application layer and the transport layer, making it easier to call between them (socketing these two layers), and establishing interactive instructions for network communication String
Establishing the server:
1. In terms of skipping sockets, server.bind establishes the domain name
2. Listening, server.listen (5) has 5 listeners (observe who will connect)
3. Waiting for connection, server.accept Waiting for reception
4. Receive data conn.recv (1024), 1024 is the limited number of bytes
5. Send data conn.send (response.encode())
6. Close server.close()
and send after receiving the request, and then Send after receiving

Establish a client: (through the client request to the browser to download a picture of the following domain name)

If you are visiting http
1, img_url=”http://ss0.bdstatic.com~~~~~” (picture domain name)

2. Client=socket.socket()
client.connect("ss0.bdstatic.com", 80) to establish a connection

If you are visiting https (points
1 and 2 must be changed) 1. import ssl (encryption protocol for HTTPS)
img_url="http://ss0.bdstatic.com~~~~~" (picture domain name)

2. client=ssl.wrap_socket(socket.socket())
client.connect(("ss0.bdstatic.com", 443)) to establish a connection, this connection function connect needs to pass tuples in.

3. Construct the request message (the content must be in a bytecode format, which is explained by binary b here)
data=b"GET/(/ is the root directory of the above connection) HTTP/1.0\r\nHOST~~~~~ (Specifically, request line, request header, blank line, request body)"
4. Send request:
client.send(data)
res=b"" (set a variable to receive the image bytecode given by the server to load)

5. Receive data (a cyclic receiving control flow can be set)
temp=client.recv(4096) (set the number of stable received bytecodes, the amount of stable received 4096)
first_data=client.recv(4096)
length=int (Re.findall(b"Content-Length(. )\r\n", first_data)[0] (because the regularization is a list, so 0 is extracted)) (using regularization to extract the byte length, understand its size in order It can be stopped in time later)
image_data=re.findall(b"\r\n\r\n(.
)", first_data, re.S) (extract the request body from first_data) re.S can make (.* ) Extract any characters, including line breaks that cannot be extracted.
In order to avoid the browser not returning data to us, we can set the control flow to avoid errors in the subsequent program
if image_data:
image_data=image_data[0]
else:
image_data=b"" (if If you don’t return, assign a value to null)
while True:
temp=
client.recv (4096) image_data +=temp
print(temp.decode()) (get the response content for analysis)
if len(image_data)==length: (when image_data is extracted When the number of byte characters is the same as length, it is proved that the extraction has been completed, and the loop can be broken)
break
6. Disconnect
clien close()
with open("socket_cat.jpg","wb") as f: (load the received into a file, if there is no socket_cat.jpg file, it will be automatically created in the same directory)
f.write( image_data) (taking image_data as an object)

No cookie can be added to the request header to initiate the request

Guess you like

Origin blog.csdn.net/qwe863226687/article/details/114116702