Micro-channel public number of articles collected several options

Option One: Based Sogou entrance

Can search the Internet to collect relevant article number of public information point of view, this is the largest, most direct, and simplest kind of program.
The general process is:

  1. Sogou search portal for public micro-channel number search

  2. Select the number into the public list of public history article number

  3. Get a list of links to articles through article, acquire the contents of the article by article link

  4. Parses the content of the article storage

Collected too frequently, Sogou search history and public number to access the list of articles will appear verification code . The script directly using the general collection can not get a verification code. Here you can use a headless browser to be accessed through the docking platform coding identification code. Headless browser can be used selenium.

Even headless browser of the same problems:
inefficiency (in fact, run a full browser to simulate a human operator)
web browser to load the resource is difficult to control, the script for the browser to load is difficult to control
verification code identification can not do to 100%, the middle is likely to interrupt the flow crawl
if you insist on using search dogs entrance and want to capture perfect if only to increase agent IP. Incidentally, free public IP addresses, forget it, very unstable, and basically are micro letter to the closure.
In addition to face anti-reptile mechanism Sogou / micro letter outside, there are other disadvantages of the use of this program:
can not get to read a few, key information used to assess the quality of the article point number, etc. Like
unable to timely articles have been published public number, only for repeat regularly crawls
only the nearest ten bulk article

Option II: the phone micro letter man in the middle attacks

This is some hacker middle attack technique, used to communicate information between the client and the server taken. The idea of ​​this program is to build a "HTTPS proxy" between the phone and a micro-channel micro-channel server for micro-letters intercepted phone number to obtain public information article. General steps are:

Mobile micro

  • Search a public letter No.

  • Click to enter public number history page article

  • Acting recognition has entered the list of pages for content interception, returning js code continues down or crawling number of new public according to the actual situation

  • The reason for this program can be automated are:

  • Public micro-channel number using the HTTPS protocol, and the unencrypted content

  • Micro-channel public number article is a Web page on a list and details of the nature that can be embedded js code control

The advantages of this scheme:

  • It is not blocked under normal circumstances

  • Like the number of points you will get and read a few articles and other assessment information

  • No public can get all of history article

Of course, there are many disadvantages:

  • Require a long-term mobile phone network entity

  • Preliminary proxy needs to be set, the workload is relatively large

  • Essentially a round or investigation process, rather than real-time push

  • There is also the risk of uncontrolled Web load, and the local network environment, its impact is very large

  • Where there is an interface micro-channel code is no longer changed to adapt

This program also has some variants, such as:

  • Public control number search by lua script instead of returning by proxy embedded js code

  • The control terminal PC through the GUI operation script WeChat

  • But there are disadvantages "not accurate and stable control"

Option Three: Web micro-channel packet capture analysis

After being anti-abuse reptile micro letter for a long time, and colleagues brainstorm to find a new micro-channel public number of articles crawling scheme. Entrance to analyze which data were available there. Fuzzy remember there is a public web micro-channel function of the number of articles to read, just once I played for some time personal letter micro robots, mainly used ItChat this Python package. It implements the principle of micro-letter web page is to perform packet capture analysis, aggregated into individual micro-channel interfaces, the goal is to achieve all pages micro-channel function it can achieve. . So there is a preliminary program - by ItChat let the public micro-channel push over his own article number. After work and colleagues mentioned a moment, he is also very interested in, the next day out to achieve validation code (ItChat corresponding function code is very short, the content analysis section before you do, can be used directly).

The main flow of this program are:

  • Phone micro-channel public attention target number to crawl

  • ItChat login server through web micro-letter

  • When the public Publication push new article, the server will be intercepted by a subsequent resolution storage

The advantage of this solution is:

  • The basic zero intervals, has released the public article number

  • Chan can get the number of points, the number of reading

  • Just keep the phone micro-channel landing, no other operations

Of course, there are disadvantages:

  • We need a long-term mobile phone network

  • Phone micro letter can not take the initiative to withdraw from, or prolonged dropped

  • A micro-letter day be able to focus on a limited number of public

  • The new micro-micro-channel signals can not landing page, it can not be used for this program

  • Only get the latest release of the article, the article can not get history

Micro-channel public number is basically a collection article and Tencent battle of wits, waste effort. Until now not been able to find a perfect solution, it can only be based on the actual acquisition target, merit-based selection. To complete the service side, do not rely on the phone micro letter, no thumbs count read, there are a large number of proxy IP on the use of Option One; local network stable and has rich mobile phone with Option II; the need for timely access to the latest articles public Publication of the words on the scheme III.

Published 19 original articles · won praise 10 · views 2463

Guess you like

Origin blog.csdn.net/qq_40125653/article/details/96100899