By Jsoup SWIPE cattle off net amount of reading blog

Explanation

The good news, cattle off the network cable on the blog, to blog like this clean and tidy with no ads (csdn you look at the people), of course we would like to welcome the first time, the authors freshman csdn into the embrace, and later to buy their own server build personal blog, and finally the last of the cattle off the network will also be on the line his own blog, of course, to support the (heard that before one hundred cups also send oh). However, due to the project just on the line, naturally there will be a lot of bug, the most important one is actually constantly refresh the article, the amount of reading will Cengceng rub to rise, which can be incredible, so he wrote a simple reptiles and playing code also dozens of lines. Here is a programming language used java, using jsoup this library.

Perhaps as bug fixes, the content of this article is no longer valid.

analysis

First of import maven coordinate jsoup

    <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.11.3</version>
    </dependency>

Then analyzed, roughly divided into two steps, the first step to obtain the URL of all articles, the second step, create a thread constantly request url, ok, start writing code.

  • Gets article url

    Home blog found by observing the cattle off the net, there's no blog article page, then you can directly get all the articles of the url. See where url in it.

It may be found to be in a class blog-listof ultag. Each litag represents an article.

liLabel reads as follows:

Link address in the lihref attribute in the tag.

Now start writing code

    List<String> getAllArticleURLList(String blogUrl) throws IOException {
        Document doc = Jsoup.connect(blogUrl)
                .userAgent("Mozilla")
                .get();
        Element element=doc.select("ul.blog-list").first();
        Elements li=element.getElementsByTag("li");
        List<String> articleList=new LinkedList<String>();
        for (Element e:li){
            Elements elements=e.getElementsByTag("a");
            articleList.add(elements.first().attr("href"));
        }
        return articleList;
    }

So that we can get links to all the articles, the results are as follows:

  • Create a Threading

    Get all the links to articles, it is constantly on the links below request by the get method, but because the network request is a very resource consuming thing, so we need to deal with articles by multiple threads, where a maximum of 20 threads to deal with.

    The first to write a class to handle the brush amount of reading tasks

    class DealThread implements Runnable{
            int num;
            DealThread(int num){
                this.num=num;
            }
            @Override
            public void run() {
                while (true){
                    //获取目前可以处理的url
                    int index=getNowDealIndex();
                    //如果处理完了就退出
                    if (index>=list.size()){
                        return;
                    }
                    String url=list.get(index);
                    //每个url刷num次
                    for (int i=0;i<num;i++){
                        try {
                            Jsoup.connect(url).userAgent("Mozilla").get();
                        } catch (IOException e) {
                            e.printStackTrace();
                        }
                    }
                }
            }
        }

    The final step is to use multi-threading threads to handle these requests, and for convenience, this paper did not use the thread pool.

Code

package main;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;
import java.util.LinkedList;
import java.util.List;
import java.util.Scanner;

/**
 * @author zeng
 * @Classname NewCodeReader
 * @Description TODO
 * @Date 2019/7/28 15:35
 */
public class NewCodeReader {
    private static final int THREAD_NUM=20;

    private static int nowDealIndex=-1;

    private static List<String> list;

    synchronized static int getNowDealIndex(){
        nowDealIndex+=1;
        return nowDealIndex;
    }

    private static List<String> getAllArticleURLList(String blogUrl) throws IOException {
        Document doc = Jsoup.connect(blogUrl)
                .userAgent("Mozilla")
                .get();
        Element element=doc.select("ul.blog-list").first();
        Elements li=element.getElementsByTag("li");
        List<String> articleList=new LinkedList<String>();
        for (Element e:li){
            Elements elements=e.getElementsByTag("a");
            articleList.add("https://blog.nowcoder.net"+elements.first().attr("href"));
        }
        return articleList;
    }

    public static void main(String[] args) throws IOException, InterruptedException {
        System.out.println("请输入刷取的数量:");
        Scanner scanner=new Scanner(System.in);
        int num=scanner.nextInt();
        System.out.println("开始获取博客列表...");
        String blogUrl="https://blog.nowcoder.net/zengxianhui";
        list=getAllArticleURLList(blogUrl);
        System.out.println("共发现"+list.size()+"篇博客");

        System.out.println("开始刷点击量...");
        Thread[] threads=new Thread[THREAD_NUM];
        for (int i=0;i<THREAD_NUM;i++){
            Thread thread=new Thread(new DealThread(num));
            thread.start();
            threads[i]=thread;
        }
        for (Thread t:threads) {
            t.join();
        }
        System.out.println("任务完成...");

    }

    static class DealThread implements Runnable{
        int num;
        DealThread(int num){
            this.num=num;
        }
        @Override
        public void run() {
            while (true){
                int index=getNowDealIndex();
                if (index>=list.size()){
                    return;
                }
                String url=list.get(index);
                for (int i=0;i<num;i++){
                    try {
                        Jsoup.connect(url).userAgent("Mozilla").get();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
            }
        }
    }
}

At last

Reptile fun, but not unethical, after reading this article you can try, but do not cause too much pressure on the cattle off network back-end server. I believe the official guest of cattle will soon be able to solve this problem.

Guess you like

Origin www.cnblogs.com/zeng-xian-hui/p/11263626.html