Reason: Dad asked me to download thousands of songs for him to play in the car. I feel like downloading it manually, even if it takes time to download in batches, simply write a crawler to download it automatically. .
For this small crawler project, select node+koa2, initialize the project koa2 projectName
(need to be installed globally first koa-generator
), and then enter the project file, npm install && npm start
, in which the dependencies are usedsuperagent, cheerio, async, fs, path
Open the NetEase Cloud web version, click on the playlist page, I select the Chinese category, right-click to view the framework source code, get the real url, and find m-pl-container
the html structure with the id, this is the list of playlists that need to be crawled this time, directly use the superagent
request url, Only the data on the first page can only be crawled, and it needs async
to be crawled concurrently
static getPlayList(){
const pageUrlList = this.getPageUrl();
return new Promise((resolve, reject) => {
asy.mapLimit(pageUrlList, 1, (url, callback) => {
this.requestPlayList(url, callback);
}, (err, result) => {
if(err){
reject(err);
}
resolve(result);
})
})
}
Among them const asy = require('async')
, because it is used async/await
, it is distinguished from requestPlayList
the request initiated by superagent
static requestPlayList(url, callback){
superagent.get(url).set({
'Connection': 'keep-alive'
}).end((err, res) => {
if(err){
console.info(err);
callback(null, null);
return;
}
const $ = cheerio.load(res.text);
let curList = this.getCurPalyList($);
callback(null, curList);
})
}
getCurPalyList
is to get the information on the page and pass it $
in for dom operation
static getCurPalyList($){
let list = [];
$('#m-pl-container li').each(function(i, elem){
let _this = $(elem);
list.push({
name: _this.find('.dec a').text(),
href: _this.find('.dec a').attr('href'),
number: _this.find('.nb').text()
});
});
return list;
}
So far, the crawling of the playlist list is completed, and the next step is to crawl the song list
static async getSongList(){
const urlCollection = await playList.getPlayList();
let urlList = [];
for(let item of urlCollection){
for(let subItem of item){
urlList.push(baseUrl + subItem.href);
}
}
return new Promise((resolve, reject) => {
asy.mapLimit(urlList, 1, (url, callback) => {
this.requestSongList(url, callback);
}, (err, result) => {
if(err){
reject(err);
}
resolve(result);
})
})
}
requestSongList
The usage is similar to that of the playList above, so it will not be repeated. After the above code gets the song list, it needs to be downloaded locally
static async downloadSongList(){
const songList = await this.getSongList();
let songUrlList = [];
for(let item of songList){
for(let subItem of item){
let id = subItem.url.split('=')[1];
songUrlList.push({
name: subItem.name,
downloadUrl: downloadUrl + '?id=' + id + '.mp3'
});
}
}
if(!fs.existsSync(dirname)){
fs.mkdirSync(dirname);
}
return new Promise((resolve, reject) => {
asy.mapSeries(songUrlList, (item, callback) => {
setTimeout(() => {
this.requestDownload(item, callback);
callback(null, item);
}, 5e3);
}, (err, result) => {
if(err){
reject(err);
}
resolve(result);
})
})
}
Among them requestDownload
is to request downloadUrl and download and save to local
static requestDownload(item, callback){
let stream = fs.createWriteStream(path.join(dirname, item.name + '.mp3'));
superagent.get(item.downloadUrl).set({
'Connection': 'keep-alive'
}).pipe(stream).on('error', (err) => {
console.info(err); // error处理,爬取错误时,打印错误并继续向下执行
})
}
At this point, the crawler applet is complete. The project crawls the playlist list --> song list --> download to the local, of course, you can also directly find a singer's homepage, modify the url passed into the songList, and directly download the singer's popular songs.