Easy to learn! Use Node.js to write a crawler, follow the tutorial step by step!

node

A crawler is a program that can automatically obtain data from web pages, and it can help us collect and analyze various useful information. In this article, I will show you how to write a simple crawler in node.js in just a few steps.

1. Install node.js and npm

node.js is a JavaScript runtime environment based on the Chrome V8 engine , which allows us to run JavaScript code on the server side. npm is a package manager for node.js, which allows us to easily install and manage various node.js modules.

To install node.js and npm, you can visit https://nodejs.org/ to download and install the latest version of node.js, which will automatically include npm. You can also use other methods to install, please refer to https://nodejs.org/en/download/package-manager/ for details.

After the installation is complete, you can enter the following command on the command line to check whether the installation is successful:

bash
node -v
npm -v

If you can see the corresponding version number, you have successfully installed node.js and npm.

Install

2. Create project folders and files

Next, we need to create a project folder to hold our crawler code. You can create a folder anywhere, such as creating a folder called crawler on your desktop.

Create a file called index.js in this folder , this file is our crawler main program. You can use any text editor to write this file, such as VS Code, Sublime Text, Notepad++, etc.

Then run the initialization project in this folder npm init, follow the prompts to enter some project-related information (you can press Enter all the way), and the project will be built.
initialization

3. Install request and cheerio modules

In order for our crawler to be able to send HTTP requests and parse HTML documents, we need to use two very useful node.js modules: request and cheerio.

request is a simple HTTP client that allows us to easily send various HTTP requests and get response data.
cheerio is a lightweight jQuery implementation that allows us to easily manipulate and extract HTML elements.

To install these two modules, we need to go to our project folder at the command line and enter the following command:

npm install request cheerio --save

With this, the request and cheerio modules will be downloaded and saved to the node_modules folder in our project folder, and will be recorded in the package.json file.

4. Write crawler code

Now, we can start writing our crawler code. First, we need to introduce the request and cheerio modules in the index.js file:

const request = require('request');
const cheerio = require('cheerio');

Then, we need to define a target URL, which is the address of the webpage we want to crawl data from. For example, we want to crawl the entry page about node.js on Baidu Encyclopedia:

const url = 'https://news.baidu.com/';

Next, we need to use the request module to send a GET request to this URL and get the response data. For example, we want to extract the title, abstract and body content of the web page and print them out:

request(url, function (error, response, body) {
    
    
  // 如果请求成功且状态码为 200
  if (!error && response.statusCode == 200) {
    
    
    // 使用 cheerio 加载 HTML 文档
    const $ = cheerio.load(body);

    // 存储获取到的数据
    const totalData = []
    
    // 获取hotnews下全部的li元素
    $('.hotnews').find('ul').find('li').each(function (index, value){
    
    
        // 向数组中存放数据
        totalData.push({
    
    
            title: $(value).find('strong').find('a').text(),
            href: $(value).find('strong').find('a').attr('href')
        })
    })

    // 打印结果
    console.log(totalData)
  }
});

We need to get the content of which class, we can get it through the $ symbol, the DOM structure of the page is as follows

DOM structure

Next we write the data to the json file

Import the fs module provided by node

const fs = require('fs')

Manually create a data.json file and store data in it

// 创建存储数据的函数,在打印totalData处调用该函数
function writeFs(totalData){
    
    
    fs.writeFile('./data.json', JSON.stringify(totalData), function (err, data) {
    
    
        if (err) {
    
    
            throw err
        }
        console.log('数据保存成功');
    })
}

data file

Run this code, you will see the title and link of the news saved in data.json. In this way, we have successfully written a simple crawler with node.js and fetched data from web pages.

full code

const request = require('request');
const cheerio = require('cheerio');
const fs = require('fs')

const url = 'https://news.baidu.com/';

request(url, function (error, response, body) {
    
    
  // 如果请求成功且状态码为 200
  if (!error && response.statusCode == 200) {
    
    
    // 使用 cheerio 加载 HTML 文档
    const $ = cheerio.load(body);

    // 存储获取到的数据
    const totalData = []
    
    // 获取hotnews下全部的li元素
    $('.hotnews').find('ul').find('li').each(function (index, value){
    
    
        // 向数组中存放数据
        totalData.push({
    
    
            title: $(value).find('strong').find('a').text(),
            href: $(value).find('strong').find('a').attr('href')
        })
    })
    writeFs(totalData)
    // 打印结果
    console.log(totalData)
  }
});

function writeFs(totalData){
    
    
    fs.writeFile('./data.json', JSON.stringify(totalData), function (err, data) {
    
    
        if (err) {
    
    
            throw err
        }
        console.log('数据保存成功');
    })
}

Of course, this is just a very basic example. In fact, crawlers also have many advanced skills and functions, such as setting request headers, handling redirection, handling exceptions, setting proxy, setting delay, simulated login, processing verification code, and parsing JSON , XML, CSV and other formats, store data to database or file, etc. If you want to learn more about crawlers, please keep following my blog!

Past data :

1. Full analysis of interface status codes: What is your API talking about?

2. Master the reduce and reduceAll methods in the JS array in one article

3. JS Array Method Encyclopedia: Allows you to easily master array operations


That’s all for the article, thank you for reading! I am still learning, please correct me if I am wrong. If you think the article is meaningful and can bring you gains or inspiration, please like and collect it to encourage it. Please also follow me, and I will share more useful things Front-end content and skills. I am walking on the waves , I hope to grow together with you~

Guess you like

Origin blog.csdn.net/weixin_45849072/article/details/130984085