Jsoup类

其实，在android客户端加载html源代码总结一文中简要介绍过Jsoup的使用。那里没做详细研究，接下来将对其结构、方法进行深入学习。

一、简介

Jsoup是一款HTML解析器，可以直接解析url地址，也可以解析html文本内容。也可通过DOM、CSS以及类似于jQuery的操作方法来取出和操作数据。其主要功能：

1、从url、字符串或者文本中解析出html

2、查找、取出数据

3、操作html元素、属性、文本。

Jsoup直接继承Object类，声明为：public class Jsoup extends Object

这是使用Jsoup库的核心的公共的入口。

二、方法详细

1、public static Document parse(String html, String baseUri) 将html解析到Document中，这里能为任何html创建一个document文档树。

其中的baseUri，html中url经常表示成相对路劲形式，baseUri就是用来指定其根路劲，在解析html中url从相对路劲中转换为绝对路劲时非常重要。

扫描二维码关注公众号，回复： 2839122 查看本文章

2、public static Document parse(String html, String baseUri, Parser parser) 使用指定的解析器对html字符串进行解析。

3、public static Document parse(String html) 将html字符串解析到Document中，这里没有指定baseUri，其依赖于html中<base href>标签。

4、public static Connection connect(String url) 创建一个指定url的链接（Connection）对象，常用来获取或解析html页面。

如：Document doc = Jsoup.connect("http://example.com").userAgent("Mozilla").data("name", "jsoup").get();

Document doc = Jsoup.connect("http://example.com").cookie("auth", "token").post();

5、public static Document parse(File in, String charsetName, String baseUri) throws IOException 解析html文件

charsetName指编码，通常设置为UTF-8比较安全。当文件找不到或者不可读或者编码无效时将会跑IO异常。

6、public static Document parse(File in, String charsetName) throws IOException 解析html文件文件位置常用来作为baseUri。其他跟上面第5点一样。

7、public static Document parse(InputStream in, String charsetName, String baseUri) throws IOException 读取输入流，然后将其解析为Document对象。

8、public static Document parse(InputStream in, String charsetName, String baseUri, Parser parser) throws IOException 读取输入流，使用指定解析器对其进行解析。

9、public static Document parseBodyFragment(String bodyHtml, String baseUri) 解析只含body部分的html片段。指定了baseUri

10、public static Document parseBodyFragment(String bodyHtml) 解析只含body部分的html片段。未指定baseUri

11、public static Document parse(URL url, int timeoutMillis) throws IOException 将url指定的html解析为document。考虑兼容性常用connect(String url)代替。

如果响应返回码不是200或者读取响应流出错将抛出IO异常。

12、public static String clean(String bodyHtml, String baseUri, Whitelist whitelist) 使用白名单标签和属性对输入的不信任的html进行过滤来得到安全的html。指定了baseUri

13、public static String clean(String bodyHtml, Whitelist whitelist) 使用白名单标签和属性对输入的不信任的html进行过滤来得到安全的html。未指定baseUri

14、public static boolean isValid(String bodyHtml, Whitelist whitelist) 测试输入的html是否只包含白名单允许的标签和属性。