HTTPClient扒取网页源码

最近公司新需求，关于后台扒取对应URL网页的参数，在本地生成HTML文档的。所以着手看了一些这方面的东西，都挺浅的，只是对这段时间的学习做个总结。

HTTPClient

最先上手的是HTTPClient。HTTPClient实现了所有http的方法，自然包括了POST和GET两种了。

GET方法

GET方法主要包含了6个步骤

1. 创建 HttpClient 的实例
2. 创建某种连接方法的实例，在这里是GetMethod。在 GetMethod 的构造函数中传入待连接的地址
3. 调用第一步中创建好的实例的 execute 方法来执行第二步中创建好的 method 实例
4. 读 response
5. 释放连接。无论执行方法是否成功，都必须释放连接
6. 对得到后的内容进行处理

示例代码如下：

 CloseableHttpClient httpclient = HttpClients.createDefault();  
    //实例化get方法  
    HttpGet httpget = new HttpGet(url);   
    //请求结果  
    CloseableHttpResponse response = null;  
    String content ="";  
    try {  
        //执行get方法  
        response = httpclient.execute(httpget);  
        if(response.getStatusLine().getStatusCode()==200){  
            content = EntityUtils.toString(response.getEntity(),"utf-8");  
            System.out.println(content);  
        }  
    } catch (ClientProtocolException e) {  
        e.printStackTrace();  
    } catch (IOException e) {  
        e.printStackTrace();  
    }  
    return content;

POST方法 — 可带参数

 //实例化httpClient  
    CloseableHttpClient httpclient = HttpClients.createDefault();  
    //实例化post方法  
    HttpPost httpPost = new HttpPost(url);   
    //处理参数  
    List<NameValuePair> nvps = new ArrayList <NameValuePair>();    
    Set<String> keySet = params.keySet();    
    for(String key : keySet) {    
        nvps.add(new BasicNameValuePair(key, params.get(key)));    
    }    
    //结果  
    CloseableHttpResponse response = null;  
    String content="";  
    try {  
        //提交的参数  
        UrlEncodedFormEntity uefEntity  = new UrlEncodedFormEntity(nvps, "UTF-8");  
        //将参数给post方法  
        httpPost.setEntity(uefEntity); 
        //执行post方法  

        response = httpclient.execute(httpPost);  
        if(response.getStatusLine().getStatusCode()==200){  
            content = EntityUtils.toString(response.getEntity(),"utf-8");  
            System.out.println(content);  
        }  
    } catch (ClientProtocolException e) {  
        e.printStackTrace();  
    } catch (IOException e) {  
        e.printStackTrace();  
    }

代码执行效果确实可以返回对应URL的网页源码，但是很可惜返回的是JS执行之前的源码，很可惜这不是我想要的结果。

于是，怎么样才能获得一个JS执行之后动态生成的页面呢？这里就用到了HTMLUnit。

HTTPClient扒取网页源码

HTTPClient

GET方法

POST方法 — 可带参数

猜你喜欢