前言:
这篇文章是我看了团长的一篇关于Java爬虫的文章之后,写的一个练习。代码中,实现了对京东网站的数据爬取、分析。
程序结构图如下:
说明,关于代码的说明在代码中已经表述的很明白,这里不过多叙述。
JdongMain是程序的入口、JdongBook对应京东上出售的书籍、URLHandle是对URL和client的处理,通过它返回经过加工的数据、HTTPUtils发送真正的HTTP请求,并返回响应报文、jdParse是对响应报文的实体内容进行解析。
代码:
1、JdongMain.java
package main;
import java.io.IOException;
import java.util.List;
import org.apache.http.ParseException;
import org.apache.http.client.HttpClient;
import org.apache.http.impl.client.DefaultHttpClient;
import model.JdongBook;
import util.URLHandle;
/**
* 程序入口,在此声明客户端,并向服务器发送请求
* @author 康茜
*
*/
public class JdongMain {
public static void main(String[] args) {
//生成一个客户端,通过客户端可url向服务器发送请求,并接收响应
HttpClient client = new DefaultHttpClient();
String url = "http://search.jd.com/Search?keyword=Python&enc=utf-8&book=y&wq=Python&pvid=33xo9lni.p4a1qb";
List<JdongBook> bookList = null;
try {
bookList = URLHandle.urlParser(client, url);
} catch (ParseException | IOException e) {
e.printStackTrace();
}
for(JdongBook book : bookList) {
System.out.println(book);
}
}
}
2、URLHandle.java
package util;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.http.HttpResponse;
import org.apache.http.ParseException;
import org.apache.http.client.HttpClient;
import org.apache.http.util.EntityUtils;
import model.JdongBook;
import parse.JdParse;
/**
* 通过URL和客户端(client)处理请求返回的数据
* @author 康茜
*
*/
public class URLHandle {
/**
*
* @param client 客户端
* @param url 请求地址
* @return 请求数据 :List<JdongBook>
* @throws ParseException
* @throws IOException
*/
public static List<JdongBook> urlParser(HttpClient client, String url) throws ParseException, IOException {
List<JdongBook> data = new ArrayList<>();
//获取响应资源
HttpResponse response = HTTPUtils.getHtml(client, url);
//获取响应的状态码
int sattusCode = response.getStatusLine().getStatusCode();
if(sattusCode == 200) {//200表示成功
//获取响应实体内容,并且将其转换为utf-8形式的字符串编码
String entity = EntityUtils.toString(response.getEntity(), "utf-8");
data = JdParse.getData(entity);
} else {
EntityUtils.consume(response.getEntity());//释放资源实体
}
return data;
}
}
3、HTTPUtils.java
package util;
import java.io.IOException;
import org.apache.http.HttpResponse;
import org.apache.http.HttpStatus;
import org.apache.http.HttpVersion;
import org.apache.http.client.HttpClient;
import org.apache.http.client.methods.HttpGet;
import org.apache.http.message.BasicHttpResponse;
public class HTTPUtils {
public static HttpResponse getHtml(HttpClient client, String url) {
//获取响应文件,即HTML,采用get方法获取响应数据
HttpGet getMethod = new HttpGet(url);
HttpResponse response = new BasicHttpResponse(HttpVersion.HTTP_1_1, HttpStatus.SC_OK, "OK");
try {
//通过client执行get方法
response = client.execute(getMethod);
} catch (IOException e) {
e.printStackTrace();
} finally {
//getMethod.abort();
}
return response;
}
}
4、JdParse.java
package parse;
import java.util.ArrayList;
import java.util.List;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import model.JdongBook;
public class JdParse {
/**
* 根据实体获取程序所需数据
* @param entity HTTP响应实体内容
* @return
*/
public static List<JdongBook> getData(String entity) {
List<JdongBook> data = new ArrayList<>();
//采用jsoup解析,关于jsoup的使用,见下文总结
Document doc = Jsoup.parse(entity);
//根据页面内容分析出需要的元素
Elements elements = doc.select("ul[class=gl-warp clearfix]").select("li[class=gl-item]");
for(Element element : elements) {
JdongBook book = new JdongBook();
book.setBookId(element.attr("data-sku"));
book.setBookName(element.select("div[class=p-name p-name-type-2]").select("em").text());
book.setBookPrice(element.select("div[class=p-price]").select("strong").select("i").text());
data.add(book);
}
return data;
}
}
5、JdongBook.java
package model;
public class JdongBook {
private String bookId;
private String bookName;
private String bookPrice;
public JdongBook() {
}
public String getBookId() {
return bookId;
}
public void setBookId(String bookId) {
this.bookId = bookId;
}
public String getBookName() {
return bookName;
}
public void setBookName(String bookName) {
this.bookName = bookName;
}
public String getBookPrice() {
return bookPrice;
}
public void setBookPrice(String bookPrice) {
this.bookPrice = bookPrice;
}
@Override
public String toString() {
return "Book [bookId=" + bookId + ", bookName=" + bookName + ", bookPrice=" + bookPrice + "]";
}
}
总结:
1、通过这次联系我学会了 HttpClient、HttpResponse、HttpGet 之间的关系及联合使用。
2、jsoup解析html数据的基本用法:http://www.open-open.com/jsoup/