GitHub源码|phantom截图&爬取文本


前言

phantom是啥?一个无头浏览器。可以干嘛?有时遇到一些动态加载的页面,依靠JS加载html标签,这时直接爬取不能获得文本;还有的场景需要对页面进行截图,进行图片审核,都可以用它。

官网下载链接windows和linux是不同工具来的,注意看清楚再下载。

源码链接


爬取文本

其中, crawlTextCommand参数在windows下传入的命令如下所示:

F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\\phantomjs-2.1.1-windows\\bin\\crawlText.js
 /**
     *
     * @param url   待爬取的网站链接
     * @param crawlTextCommand  爬取文本命令
     * eg. F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\crawlText.js
     * @return      爬取的文本内容
     * @throws IOException
     */
    public static String crawlText(String url, String crawlTextCommand) throws IOException {
    
    
        InputStream inputStream = null;
        Process process = null;
        try {
    
    
            Runtime runtime = Runtime.getRuntime();
            String command = crawlTextCommand + url;
            process = runtime.exec(command);
            inputStream = process.getInputStream();
            BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
            StringBuilder builder = new StringBuilder();
            String content;
            while ((content = reader.readLine()) != null) {
    
    
                builder.append(content);
            }
            return builder.toString();
        } finally {
    
    
            if (inputStream != null) {
    
    
                inputStream.close();
            }
            if (process != null) {
    
    
                process.destroy();
            }
        }
    }


截图

其中, screensHotCommand参数在windows下传入的命令如下所示:

F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\crawlText.js
/**
     *
     * @param url   待截图的网站链接
     * @param path  图片路径+名称 eg. F:\\pic\\9.png
     * @param screensHotCommand 截图命令
     * eg. eg. F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\screensHot.js
     * @return      返回图片保存路径
     * @description 注意, 命令形式调用外部工具的时候, 都要考虑并发问题, 否则容易出现
     * 部分线程可以截图成功, 部分截图不成功
     * @throws IOException
     */
    public static String screenshot(String url, String path, String screensHotCommand) throws IOException, InterruptedException {
    
    
        InputStream inputStream = null;
        Process process = null;
        try {
    
    
            Runtime runtime = Runtime.getRuntime();
            String command = screensHotCommand + url + " " + path;
            process = runtime.exec(command);
            inputStream = process.getInputStream();
            BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
            while (reader.readLine() != null) ;
            reader.close();
            return path;
        } finally {
    
    
            if (inputStream != null) {
    
    
                inputStream.close();
            }
            if (process != null) {
    
    
                process.destroy();
            }
        }
    }

使用中的问题

之前模拟过多线程爬取发现会有部分线程没有截图,后边在服务调用处增加了同步的实现。

synchronized (this) {
    
    
            String path = SystemCallUtil.screenshot(
                    url, name + suffix, phantom.getScreensHotCommand()
            );
            File file = null;
            int tmpTry = retry;
            do {
    
    
                Thread.sleep(sleep);  // 休息一下
                file = new File(path);
            } while (!file.exists() && (--tmpTry) > 0); // 重试一次

            if (file == null || !file.exists()) {
    
    
                // 打印日志;
                return false;
            }
        }

猜你喜欢

转载自blog.csdn.net/legendaryhaha/article/details/106331204
今日推荐