33 describes the operation of project implementation reptiles Python

WechatSogou [. 1] micro-channel public number reptiles. Sogou search based on micro-channel micro-channel public number of reptiles interface can be expanded into Sogou search based on the reptile, the result is a list of each specific numbers are public information dictionary.

 

DouBanSpider [2] bean reading reptiles. Can climb down all the books in the famous book label, were ranked according to score storage, stored in Excel, you can facilitate screening of collecting, such as the number of screening evaluations> 1000 score books; can be based on different themes stored in a different Excel Sheet using User Agent camouflage crawling into your browser and add to better mimic the behavior of the browser random delay to avoid reptile was closed.

 

zhihu_spider [. 3] know almost reptiles. Function of this project is almost crawling the user information and known interpersonal topology, using Scrapy crawler frame, using data stored mongo

 

User-bilibili [. 4] bilibili user reptiles. The total number of data: 20,119,918, crawl fields: user id, nickname, gender, avatar, rank, experience, number of fans, birthday, address, registration date, signature, rank and experience and so on. Generating user station B after reporting data fetch.

 

SinaSpider [5] Sina Weibo reptiles. The main crawling Weibo user's personal information, tweets, fans and followers. Cookie code gets Sina microblogging log, can be prevented by pocketing more than Sina account login. The main use scrapy reptile framework.

 

distribute_crawler [6] Fiction download distributed crawlers. Use scrapy, Redis, MongoDB, a distributed web crawler graphite to achieve, the underlying storage MongoDB clusters, distributed use Redis achieve, reptiles use graphite to achieve the status display, aimed at a novel site.

 

CnkiSpider [7] China HowNet reptiles. After setting the search condition, performs src / CnkiSpider.py fetches data stored in the data capture / data directory, the behavior of the first field name of each data file.

 

scrapy_jingdong [. 9] Jingdong reptiles. Based scrapy Jingdong website crawler, save format is csv.

 

wooyun_public [. 11] - clouds reptiles.  Clouds public vulnerability, reptiles and search the knowledge base. There is a list of text content and each vulnerability, the entire disclosure vulnerability in MongoDB, probably about 2G content; if the entire station to climb all the text and pictures as an offline query takes about 10G of space, two hours (10M-bandwidth telecommunications); crawling all knowledge library, a total of about 500M space. Use the vulnerability search Flask as web server, bootstrap as the front end.

 

Spider [12] hao123 website crawlers. In hao123 inlet page, scroll crawling the chain, collect URLs and record number within the chain and the chain on the address, the recording title information, windows7 32-position test, the current every 24 hours, data may be collected 100000 about.

 

findtrip [13] Flights reptiles (and where to Ctrip). Findtrip is based Scrapy ticket reptiles, the current integration of the two major domestic airline ticket websites (where to go + Ctrip).

 

163spider [14]  - Based on requests, MySQLdb, torndb the netease client content crawler

 

doubanspiders [15] -  watercress movies, books, groups, albums, reptiles and other things set

 

QQSpider [16] QQ space reptiles , including weblogs, or personal information, a day to fetch 4 million data.

 

Music-Spider-baidu [. 17] Baidu mp3 the station crawler , using redis support HTTP.

 

tbcrawler [18] Taobao and Tmall reptiles , according to the search keyword, item id came to the page of information, data is stored in mongodb.

 

stockholm [19] a stock data (CSI) reptiles and stock-picking strategy testing framework. Grab all the Shanghai and Shenzhen stock market data based on date range selected. It supports the use of expressions to define stock-picking strategy. Support multi-threading. Save data to a JSON file, CSV file.

 

BaiduyunSpider [20 is] - Baidu cloud disk reptiles.

 

Spider [21 is] - social data reptiles. Support micro-Bo, we know almost, watercress.

 

the pool Proxy [22 is] - Python crawler Agent IP pool (proxy pool).

 

163-Music [23] - crawling Netease cloud music all the songs comment.

 

jandan_spider [24] - crawling fried sister paper image.

 

CnblogsSpider [25] - cnblogs List crawlers.

 

spider_smooc [26 is] - crawling Mu class network video.

 

CnkiSpider [27] - China HowNet reptiles.

 

knowsecSpider2 [28] - know Chong Yu reptiles topic.

 

Spider-AISS [29] - Aisi APP picture reptile.

 

SinaSpider [30] - Dynamic IP mechanisms to address anti-reptile Sina , quickly grab content.

 

Spider-CSDN [31] - crawling blog articles on CSDN.

 

ProxySpider [32] - crawling on the West proxy IP thorns , and verify the availability of the agent

 

webspider [33 is] - This system is mainly used python3, celery, and requests to crawling position data reptiles achieve timing task retry error logging, the automatic change function Cookies, etc., using ECharts + Bootstrap constructed front page, to show crawling data.

 

Guess you like

Origin www.cnblogs.com/KeenLeung/p/12150641.html