java网络爬虫-利用phantomjs和jsoup爬取动态ajax加载页面

java基于windows爬取ajax加载的动态页面需要一定的辅助工具支持，本文爬取ajax加载的动态页面所使用的工具是phantomJS(关于phantomJS的介绍百度一大堆)

首先下载phantomJS；下载地址：https://phantomjs.org/download.html

下载之后解压文件，为了后面方便使用建议单独放在一个文件夹里面，例如我这边是放在F盘下面单独的文件夹phantomjs,然后进入phantomjs--bin点击运行phantomjs.exe，出现一下界面：

phantomjs运行界面

即表示可以正常运行js代码了。（如果要经常使用建议配置path环境）

接下来就是爬取页面了。

首先需要写一个js（例：parser.js）：

 1 system = require('system')
 2 address = system.args[1];
 3 var page = require('webpage').create();
 4 var url = address;
 5 
 6 page.settings.resourceTimeout = 1000*10; // 10 seconds
 7 page.onResourceTimeout = function(e) {
 8     console.log(page.content);
 9     phantom.exit(1);
10 };
11 
12 page.open(url, function (status) {
13     //Page is loaded!
14     if (status !== 'success') {
15         console.log('Unable to post!');
16     } else {
17         console.log(page.content);
18     }
19     phantom.exit();
20 });

然后是java代码（我的parser.js是放在F盘下面的）：

 1 //读取动态页面
 2     public static String dynamicHtml(String url){
 3         Runtime rt = Runtime.getRuntime();
 4         Process process = null;
 5         String html = "";
 6         try {
 7             process = rt.exec("F:\\phantomjs\\bin\\phantomjs.exe F:/parser.js " +url);
 8             InputStream in = process.getInputStream();
 9             InputStreamReader reader = new InputStreamReader(in, "UTF-8");
10             BufferedReader br = new BufferedReader(reader);
11             String tmp = "";
12             while ((tmp = br.readLine()) != null) {
13                 html = html + tmp;
14             }
15             br.close();
16             reader.close();
17         } catch (IOException e) {
18             e.printStackTrace();
19         }
20         return html;
21     }

处理逻辑（利用Jsoup爬取）：

 1 public static void ReadAjaxDynamicHtml(String htmlUrl){
 2         String imageHtml = dynamicHtml(htmlUrl);
 3         Document imageDoc = Jsoup.parse(imageHtml);
 4         //如果选择其中部分元素 有class就用：
 5         //Elements childrenImg = imageDoc.select(".class");
 6         //System.err.println(childrenImg.html());
 7         //System.err.println(childrenImg.text());
 8         //如果选择其中部分标签 比如img：
 9         //Elements childrenImg = imageDoc.select("img");
10         System.err.println(imageDoc);
11         /* 接下来的处理逻辑 */
12         // ...
13     }

main方法调用示例：

1 public static void main(String[] args) {
2         String htmlUrl = "http://www.baidu.com";
3         ReadAjaxDynamicHtml(htmlUrl);
4     }

显示的结果部分截图：

jar参考：

1 <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
2 <dependency>
3     <groupId>org.jsoup</groupId>
4     <artifactId>jsoup</artifactId>
5     <version>1.8.3</version>
6 </dependency>

至此测试完成。

java网络爬虫-利用phantomjs和jsoup爬取动态ajax加载页面

猜你喜欢