[Reproduced] Common anti-climbing ideas for crawlers

Original link: https://blog.51cto.com/14237227/2362691

Reptile stand

The purpose of crawlers is to obtain data on a large scale and for a long time. However, one IP is always used to crawl websites, and large-scale centralized access to the server may be rejected after a long period of time, and crawlers may crawl data for a long time. A verification code is required, even if multiple accounts are crawled in turn, there will still be cases where a verification code is required.

The following 5 tips are commonly used by crawlers:

Tip 1: Set download waiting time/download frequency
Large-scale centralized access has a greater impact on the server, and it is easy to be blocked by the server. The crawler program can increase the crawling interval. This is less likely to attract the attention of the server.

Tip 2: The
most common way to modify User-Agent is to disguise the browser and modify User-Agent.
The specific method can change the value of User-Agent to the way of the browser, and even set up a User-Agent pool (list, array, dictionary can be), store multiple "browser", randomly fetch it every time it is crawled A User-Agent to set the request, so that the User-Agent will always change to prevent being walled.

Tip 3: Set cookies.
Cookies are actually some encrypted data stored in the user terminal. Some websites use cookies to identify users. If a certain visit is always requested frequently, it is likely to be noticed by the website and be suspected. In order to crawl, then the website can find the visiting user through the cookie and refuse the visit.
1. Customize cookies policy (prevent cookierejected problems, refuse to write cookies) There are custom cookie policy settings in the first article of the series, but more reference is the example of official documents, the setting methods are actually the same, because The HttpClient-4.3.1 component version is different from the previous version, and the writing method is also different. See also the official document: http://hc.apache.org/httpcomponents-client-4.3.x/tutorial/html/statemgmt.html#d5e553
2. Prohibit cookies By prohibiting cookies, this is the client's initiative to prevent the server from writing. Banning cookies can prevent websites that may use cookies to identify crawlers to ban us. You can set COOKIES_ENABLES= FALSE in the scrapy crawler, that is, do not enable the cookies middleware and do not send cookies to the web server.

Tip 4: Distributed crawling
There are also many Githubrepo for distributed crawling. The principle is to maintain a distributed queue that all cluster machines can effectively share.
There is another purpose of using distributed crawling: large-scale crawling, a single machine has a heavy load, and the speed is very slow. Multiple machines can set up a master to manage multiple slaves to crawl at the same time.

Guess you like

Origin blog.csdn.net/u010472858/article/details/104290453