Implement a web crawler small application with NodeJS - crawling blog article list Home Park

Foreword

  Web crawler (also known as web spider, web robot, in the middle of FOAF community, more often called web Chaser), is a kind of follow certain rules, automatically grab information on the World Wide Web program or script. Other less frequently used names include ants, automatic indexing, simulation programs or worms.

  We can use a web crawler to automatically collect data information, such as applied to the search engines crawling the site included, applied to data analysis and mining data collection, used in the financial analysis of financial data collection and, apart from but also can be applied to web crawlers public opinion monitoring and analysis, and other targets in various fields gather customer data.

1, the web crawler classification

  Web crawler system according to the structure and implementation techniques can be roughly divided into the following types: general crawler (General Purpose Web Crawler), focusing crawlers (Focused Web Crawler), the incremental web crawler (Incremental Web Crawler), deep web reptile (Deep Web crawler). The actual web crawler system is typically on combining several crawler technology, the following are to be several brief these reptiles.

1.1 General web crawler

The whole network, also known as crawlers (Scalable Web Crawler), crawling objects from some of the seed URL expanded to the entire Web, data collection for the main search engines and portal sites large Web service providers.

1.2, focusing crawlers

Also known as the theme crawler (Topical Crawler), which selectively crawl those with pre-defined theme relevant page of the web crawler. And general web crawler compared to just focus reptiles crawling related to the topic of the page, significant savings in hardware and network resources, saved pages and also due to the small number of fast update, can satisfy a number of specific areas for specific populations demand information.

1.3 Incremental web crawler

Means taking only incremental updates and newly generated or crawling reptile page changes have occurred on the downloaded pages, it can guarantee a certain extent, the page is crawled as far as possible the new page. .

1.4 Deep Web Crawler

Web page by the presence of the surface can be divided into pages (Surface Web) and deep web (Deep Web, also known as the Invisible Web Pages or Hidden Web). Surface pages refers to the traditional search engines can index the pages to static pages hyperlinks can reach the main Web page configuration. Deep Web is that most of the content can not be obtained by static links, hidden in the search form, only the user submits a Web page to get in some key words.

2, create a simple application reptiles

  After more than a few reptiles simple to understand, let's implement a simple reptile small application bar.

2.1 goals

Mention reptiles, large probability think of big data, which in turn think of Python , Baidu following, Python reptiles really do more. Due mainly to do their own piece of front-end development, relatively speaking, JavaScript and more skilled a little simpler. Achieve a small goal, then with NodeJS come to the right blog Garden (own a common developer site) Home crawling list of articles, then write to your local JSON file .

2.2 environment to build

  • NodeJS : the computer to install NodeJS, not installed, go to the official website to download and install.
  • npm : NodeJS package management tools, together with NodeJS installed.

NodeJS installed, open a command line can be used node -vto detect NodeJS whether the installation is successful, can npm -vdetect whether NodeJS successful installation, the installation was successful should print the following information (depending on the version):

2.3, the specific implementation

2.3.1 installation dependencies

Performed in the directory npm install superagent cheerio --save-devinstallation superagent, cheerio these two dependencies. Create a crawler.js file.

  • SuperAgent : a SuperAgent is a lightweight, flexible, easy to read, the client requests a low learning curve agent module, in NodeJS environment.

  • Cheerio : Cheerio is fast, flexible and lean achieving the core jQuery designed specifically for servers. It can operate as a string like jquery.

// 导入依赖包
const http       = require("http");
const path       = require("path");
const url        = require("url");
const fs         = require("fs");
const superagent = require("superagent");
const cheerio    = require("cheerio");

2.3.2 crawling data

Then get page requests, after obtaining the page content, to return to the DOM parsing value depends on what you want the data, and finally the results processed translated into a JSON string stored locally.

//爬取页面地址
const pageUrl="https://www.cnblogs.com/";

// 解码字符串
function unescapeString(str){
    if(!str){
        return ''
    }else{
        return unescape(str.replace(/&#x/g,'%u').replace(/;/g,''));
    }
}

// 抓取数据
function fetchData(){
    console.log('爬取数据时间节点:',new Date());
    superagent.get(pageUrl).end((error,response)=>{
        // 页面文档数据
        let content=response.text;
        if(content){
            console.log('获取数据成功');
        }
        // 定义一个空数组来接收数据
        let result=[];
        let $=cheerio.load(content);
        let postList=$("#main #post_list .post_item");
        postList.each((index,value)=>{
            let titleLnk=$(value).find('a.titlelnk');
            let itemFoot=$(value).find('.post_item_foot');

            let title=titleLnk.html(); //标题
            let href=titleLnk.attr('href'); //链接
            let author=itemFoot.find('a.lightblue').html(); //作者
            let headLogo=$(value).find('.post_item_summary a img').attr('src'); //头像
            let summary=$(value).find('.post_item_summary').text(); //简介
            let postedTime=itemFoot.text().split('发布于 ')[1].substr(0,16); //发布时间
            let readNum=itemFoot.text().split('阅读')[1]; //阅读量
            readNum=readNum.substr(1,readNum.length-1);

            title=unescapeString(title);
            href=unescapeString(href);
            author=unescapeString(author);
            headLogo=unescapeString(headLogo);
            summary=unescapeString(summary);
            postedTime=unescapeString(postedTime);
            readNum=unescapeString(readNum);

            result.push({
                index,
                title,
                href,
                author,
                headLogo,
                summary,
                postedTime,
                readNum
            });
        });

        // 数组转换为字符串
        result=JSON.stringify(result);

        // 写入本地cnblogs.json文件中
        fs.writeFile("cnblogs.json",result,"utf-8",(err)=>{
            // 监听错误,如正常输出,则打印null
            if(!err){
                console.log('写入数据成功');
            }
        });
    });
}

fetchData();

3, optimization is performed

3.1, generate results

Open a command line input in the project directory node crawler.js,

You will find that the directory will create a cnblogs.jsonfile, open the file as follows:

Open blog Garden Home Comparison follows:

Successfully climbed to the list of articles found Home wanted

3.2, the timing crawling

Found that only Each time to get to the data, the price of the timer, automatically crawl every five minutes once the code is as follows:

···
// 每隔5分钟请求一次
setInterval(()=>{
    fetchData()
},5*60*1000);
···

4, summary

  Web crawler applications far more than that, more than just web crawlers to do a brief introduction, and implementation of a small demo, if inadequate, please correct me.

References:


https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
https://blog.csdn.net/zw0Pi8G5C1x/article/details/89078072
https://www.jianshu.com/p/1432e0f29abd
https://www.jianshu.com/p/843ade9bf6df

Guess you like

Origin www.cnblogs.com/peerless1029/p/11946652.html