Tool: cloud mining crawler
Goal: Grab all Weibo of a blogger
Analyze web page structure:
The idea of our crawling is to simulate the browser's automatic access to page crawling.
Let's take a look at the page structure. First, each microblog list must be loaded three or four times. If there is a page turning button at the bottom, it is judged that the page is loaded.
login problem
Crawling requires a login account, how to log in?
The login does not require a verification code. If you make a mistake, you will be asked to enter the verification code, so there is no technical difficulty in logging in.
We can create a [login module], first log in with a browser, and then all pages will be crawled based on the cookie shared by this browser.
Flow chart design:
We don't need a detail page for Weibo. Therefore, the entire crawler process does not have a details page, and the data is extracted from the list.
Crawling results:
It took a total of 5 minutes to crawl 10 pages and a total of 400 microblogs. Because I don't post very frequently on Weibo.
Data are as follows:
Make a simple word cloud: