Java通过Jsoup解析HTML文件

一、Jsoup简介

Jsoup 是一款Java 的HTML解析器，可直接解析某个URL地址、HTML文本内容。它提供了一套非常省力的API，可通过DOM，CSS以及类似于jQuery的操作方法来取出和操作数据。

二、Jsoup的主要功能

1、从一个URL，文件或字符串中解析HTML

2、使用DOM或CSS选择器来查找、取出数据

3、可操作HTML元素、属性、文本

注意：jsoup是基于MIT协议发布的，可放心使用于商业项目。

三、Jsoup用法简介

1、获得Document对象

Document document = Jsoup.parse(new File("D:\\information\\test.html"), "utf-8");

2、使用DOM的方式来取得

获得Document对象后，接下来就是解析Document对象，并从中获取我们想要的元素了。

Document中提供了丰富的方法来获取指定元素。

getElementById(String id)：通过id来获取
getElementsByTag(String tagName)：通过标签名字来获取
getElementsByClass(String className)：通过类名来获取
getElementsByAttribute(String key)：通过属性名字来获取
getElementsByAttributeValue(String key, String value)：通过指定的属性名字，属性值来获取
getAllElements()：获取所有元素

3、通过选择器查找元素

通过类似于css或jQuery的选择器来查找元素

使用的是Element类的下记方法：

public Elements select(String cssQuery)

通过传入一个类似于CSS或jQuery的选择器字符串，来查找指定元素。

四、Jsoup代码实例

博客的初衷是解析HTML中的table，将其转化为Bean。

1、引入依赖

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.12.1</version>
</dependency>

2、代码实例

//通过Jsoup获取table中对应标签的信息
private static void HTMLParserMapInit() throws IOException {
		Document document = Jsoup.parse(new File("D:\\information\\test.html"), "utf-8");
        Elements table_title = document.select(".title");
        Elements tables = document.select(".left");
        for(int i=0;i<table_title.size();i++) {
        	String title = table_title.get(i).text();
            String keyLevel1 = "";
		    String keyLevel2 = "";
		    String value = "";
		    String tag_rowspan = "";
		    String tag_colspan = "";
		    String tag_class = "";
		    String tag_text = "";
		    String title = "";
            String table = tables.get(i);
		    Elements tr =  table.select("tr");
		    for(Element eTr : tr){
			    Elements td = eTr.select("td");
			    for(Element eTd : td){
				    tag_rowspan = eTd.attr("rowspan");
				    tag_colspan = eTd.attr("colspan");
				    tag_class = eTd.attr("class");
				    tag_text = eTd.text();
				    if(!tag_colspan.equals("")) {
					    title += tag_text + ",";
				    }
				    if((tag_class.equals("class2"))) {
					    keyLevel1 = tag_text;
				    }else if((tag_class.equals("class1"))) {
					    keyLevel2 = tag_text;
				    }else if(tag_class.equals("")){
					    value += tag_text+",";
				    }
			        }
			    if(!(keyLevel1.equals("")&&keyLevel1.equals(""))) {
				    if(!value.equals("")) {
					    value = value.substring(0,value.length() - 1);
					    shiftInformationHashMap.put(keyLevel1 + "," + keyLevel2, value);
				    }
				        value = "";                                                                                                                                             
                }
		    }
		    title = title.toString().substring(0,title.length() - 1);
		    System.out.println("title,"+title);
		    System.out.println("hashMap,"+shiftInformationHashMap.toString());
				
		}
	}

将HTML中数据解析成hashmap，一切就一目了然了。

五、Map转为Bean

public static <T, V> T map2Bean(Map<String,V> map,Class<T> clz) throws Exception{
	T obj = clz.newInstance();
	Field field = null;
	for(String key : map.keySet()) {
		field = obj.getClass().getDeclaredField(key);
		field.setAccessible(true);
		field.set(obj, map.get(key));
	}
	return obj;
}

六、解析CSV文件

1、CSV文件

2、Bean类

@Data
public class ScoreBean {
	private Object id;
	private Object score;
}

3、读取CSV文件方法

public static List<HashMap<String, Object>> readCSVToList(String filePath) throws Exception {
	List<HashMap<String, Object>> list = new ArrayList<HashMap<String, Object>>();
	BufferedReader reader = null;
	try {
		reader = new BufferedReader(new FileReader(filePath));
        String[] headtilte = reader.readLine().split(",");
        String line = null;
        while ((line = reader.readLine()) != null) {
        	HashMap<String, Object> hashMap = new HashMap<String, Object>();
            String[] itemArray = line.split(",");
            for (int i = 0; i < itemArray.length; i++) {
            	hashMap.put(headtilte[i], itemArray[i]);
            }
            list.add(hashMap);
        }
	} catch (Exception e) {
		e.printStackTrace();
	} finally {
		if (null != reader) {
			reader.close();
		}
	}
	return list;
}

4、测试类

public static void main(String[] args) throws Exception {
	String path = "D:\\scoreInfo.csv";
	    List<HashMap<String, Object>> list = readCSVToList(path);
	    for(HashMap hashMap:list) {
	        BeanUtil.HashMapToBeanUtil(hashMap,ScoreBean.class);
	    }
}