前言
phantom是啥?一个无头浏览器。可以干嘛?有时遇到一些动态加载的页面,依靠JS加载html标签,这时直接爬取不能获得文本;还有的场景需要对页面进行截图,进行图片审核,都可以用它。
官网下载链接,windows和linux是不同工具来的,注意看清楚再下载。
爬取文本
其中, crawlTextCommand参数在windows下传入的命令如下所示:
F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\\phantomjs-2.1.1-windows\\bin\\crawlText.js
/**
*
* @param url 待爬取的网站链接
* @param crawlTextCommand 爬取文本命令
* eg. F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\crawlText.js
* @return 爬取的文本内容
* @throws IOException
*/
public static String crawlText(String url, String crawlTextCommand) throws IOException {
InputStream inputStream = null;
Process process = null;
try {
Runtime runtime = Runtime.getRuntime();
String command = crawlTextCommand + url;
process = runtime.exec(command);
inputStream = process.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
StringBuilder builder = new StringBuilder();
String content;
while ((content = reader.readLine()) != null) {
builder.append(content);
}
return builder.toString();
} finally {
if (inputStream != null) {
inputStream.close();
}
if (process != null) {
process.destroy();
}
}
}
截图
其中, screensHotCommand参数在windows下传入的命令如下所示:
F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\crawlText.js
/**
*
* @param url 待截图的网站链接
* @param path 图片路径+名称 eg. F:\\pic\\9.png
* @param screensHotCommand 截图命令
* eg. eg. F:\phantomjs-2.1.1-windows\bin\phantomjs.exe F:\phantomjs-2.1.1-windows\bin\screensHot.js
* @return 返回图片保存路径
* @description 注意, 命令形式调用外部工具的时候, 都要考虑并发问题, 否则容易出现
* 部分线程可以截图成功, 部分截图不成功
* @throws IOException
*/
public static String screenshot(String url, String path, String screensHotCommand) throws IOException, InterruptedException {
InputStream inputStream = null;
Process process = null;
try {
Runtime runtime = Runtime.getRuntime();
String command = screensHotCommand + url + " " + path;
process = runtime.exec(command);
inputStream = process.getInputStream();
BufferedReader reader = new BufferedReader(new InputStreamReader(inputStream));
while (reader.readLine() != null) ;
reader.close();
return path;
} finally {
if (inputStream != null) {
inputStream.close();
}
if (process != null) {
process.destroy();
}
}
}
使用中的问题
之前模拟过多线程爬取发现会有部分线程没有截图,后边在服务调用处增加了同步的实现。
synchronized (this) {
String path = SystemCallUtil.screenshot(
url, name + suffix, phantom.getScreensHotCommand()
);
File file = null;
int tmpTry = retry;
do {
Thread.sleep(sleep); // 休息一下
file = new File(path);
} while (!file.exists() && (--tmpTry) > 0); // 重试一次
if (file == null || !file.exists()) {
// 打印日志;
return false;
}
}