使用JSOUP解析HTML文档

这篇文章主要介绍了Jsoup如何解析一个HTML文档、从文件加载文档、从URL加载Document等方法，对Jsoup常用方法做了详细讲解，最近提供了一个示例供大家参考使用DOM方法来遍历一个文档从元素抽取属性，文本和HTML 获取所有链接

解析和遍历一个HTML文档

如何解析一个HTML文档：

复制代码代码如下:

 
 String html = "<html><head><title>First parse</title></head>" 
  
   + "<body><p>Parsed HTML into a doc.</p></body></html>"; 
  
 Document doc = Jsoup.parse(html);

其解析器能够尽最大可能从你提供的HTML文档来创见一个干净的解析结果，无论HTML的格式是否完整。比如它可以处理：

1、没有关闭的标签 (比如： Lorem Ipsum parses to Lorem Ipsum)
2、隐式标签 (比如. 它可以自动将 <td>Table data</td>包装成<table><tr><td>?)
3、创建可靠的文档结构（html标签包含head 和 body，在head只出现恰当的元素）

一个文档的对象模型

1、文档由多个Elements和TextNodes组成 (以及其它辅助nodes).
2、其继承结构如下：Document继承Element继承Node. TextNode继承 Node.
3、一个Element包含一个子节点集合，并拥有一个父Element。他们还提供了一个唯一的子元素过滤列表。

从一个URL加载一个Document

存在问题
你需要从一个网站获取和解析一个HTML文档，并查找其中的相关数据。你可以使用下面解决方法：

解决方法
使用 Jsoup.connect(String url)方法:

复制代码代码如下:

 
 Document doc = Jsoup.connect("http://www.jb51.net/").get(); 
  
 String title = doc.title();

说明
connect(String url) 方法创建一个新的 Connection, 和 get() 取得和解析一个HTML文件。如果从该URL获取HTML时发生错误，便会抛出 IOException，应适当处理。

Connection 接口还提供一个方法链来解决特殊请求，具体如下：

复制代码代码如下:

 
 Document doc = Jsoup.connect("http://www.jb51.net") 
  
   .data("query", "Java") 
  
   .userAgent("Mozilla") 
  
   .cookie("auth", "token") 
  
   .timeout(3000) 
  
   .post();

这个方法只支持Web URLs (http和https 协议); 假如你需要从一个文件加载，可以使用parse(File in, String charsetName) 代替。

从一个文件加载一个文档

问题
在本机硬盘上有一个HTML文件，需要对它进行解析从中抽取数据或进行修改。

办法
可以使用静态 Jsoup.parse(File in, String charsetName, String baseUri) 方法：

复制代码代码如下:

 
 File input = new File("/tmp/input.html"); 
  
 Document doc = Jsoup.parse(input, "UTF-8", "http://www.jb51.net/");

说明
parse(File in, String charsetName, String baseUri) 这个方法用来加载和解析一个HTML文件。如在加载文件的时候发生错误，将抛出IOException，应作适当处理。
baseUri 参数用于解决文件中URLs是相对路径的问题。如果不需要可以传入一个空的字符串。
另外还有一个方法parse(File in, String charsetName) ，它使用文件的路径做为 baseUri。这个方法适用于如果被解析文件位于网站的本地文件系统，且相关链接也指向该文件系统。

使用DOM方法来遍历一个文档

问题
你有一个HTML文档要从中提取数据，并了解这个HTML文档的结构。

方法
将HTML解析成一个Document之后，就可以使用类似于DOM的方法进行操作。示例代码：

复制代码代码如下:

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.jb51.net/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
String linkHref = link.attr("href");
String linkText = link.text();
}

说明
Elements这个对象提供了一系列类似于DOM的方法来查找元素，抽取并处理其中的数据。具体如下：
查找元素
getElementById(String id)
getElementsByTag(String tag)
getElementsByClass(String className)
getElementsByAttribute(String key) (and related methods)
Element siblings: siblingElements(), firstElementSibling(), lastElementSibling();nextElementSibling(), previousElementSibling()
Graph: parent(), children(), child(int index)

元素数据
attr(String key)获取属性attr(String key, String value)设置属性
attributes()获取所有属性
id(), className() and classNames()
text()获取文本内容text(String value) 设置文本内容
html()获取元素内HTMLhtml(String value)设置元素内的HTML内容
outerHtml()获取元素外HTML内容
data()获取数据内容（例如：script和style标签)
tag() and tagName()

操作HTML和文本
append(String html), prepend(String html)
appendText(String text), prependText(String text)
appendElement(String tagName), prependElement(String tagName)
html(String value)

使用选择器语法来查找元素
问题
你想使用类似于CSS或jQuery的语法来查找和操作元素。

方法
可以使用Element.select(String selector) 和 Elements.select(String selector) 方法实现：

复制代码代码如下:

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://www.jb51.net./");

Elements links = doc.select("a[href]"); //带有href属性的a元素
Elements pngs = doc.select("img[src$=.png]");
//扩展名为.png的图片

Element masthead = doc.select("div.masthead").first();
//class等于masthead的div标签

Elements resultLinks = doc.select("h3.r > a"); //在h3元素之后的a元素

说明
jsoup elements对象支持类似于CSS (或jquery)的选择器语法，来实现非常强大和灵活的查找功能。.
这个select 方法在Document, Element,或Elements对象中都可以使用。且是上下文相关的，因此可实现指定元素的过滤，或者链式选择访问。
Select方法将返回一个Elements集合，并提供一组方法来抽取和处理结果。

Selector选择器概述
tagname: 通过标签查找元素，比如：a
ns|tag: 通过标签在命名空间查找元素，比如：可以用 fb|name 语法来查找 <fb:name> 元素
#id: 通过ID查找元素，比如：#logo
.class: 通过class名称查找元素，比如：.masthead
[attribute]: 利用属性查找元素，比如：[href]
[^attr]: 利用属性名前缀来查找元素，比如：可以用[^data-] 来查找带有HTML5 Dataset属性的元素
[attr=value]: 利用属性值来查找元素，比如：[width=500]
[attr^=value], [attr$=value], [attr*=value]: 利用匹配属性值开头、结尾或包含属性值来查找元素，比如：[href*=/path/]
[attr~=regex]: 利用属性值匹配正则表达式来查找元素，比如： img[src~=(?i)\.(png|jpe?g)]
*: 这个符号将匹配所有元素

Selector选择器组合使用
el#id: 元素+ID，比如： div#logo
el.class: 元素+class，比如： div.masthead
el[attr]: 元素+class，比如： a[href]
任意组合，比如：a[href].highlight
ancestor child: 查找某个元素下子元素，比如：可以用.body p 查找在"body"元素下的所有p元素
parent > child: 查找某个父元素下的直接子元素，比如：可以用div.content > p 查找 p 元素，也可以用body > * 查找body标签下所有直接子元素
siblingA + siblingB: 查找在A元素之前第一个同级元素B，比如：div.head + div
siblingA ~ siblingX: 查找A元素之前的同级X元素，比如：h1 ~ p
el, el, el:多个选择器组合，查找匹配任一选择器的唯一元素，例如：div.masthead, div.logo

伪选择器selectors
:lt(n): 查找哪些元素的同级索引值（它的位置在DOM树中是相对于它的父节点）小于n，比如：td:lt(3) 表示小于三列的元素
:gt(n):查找哪些元素的同级索引值大于n，比如： div p:gt(2)表示哪些div中有包含2个以上的p元素
:eq(n): 查找哪些元素的同级索引值与n相等，比如：form input:eq(1)表示包含一个input标签的Form元素
:has(seletor): 查找匹配选择器包含元素的元素，比如：div:has(p)表示哪些div包含了p元素
:not(selector): 查找与选择器不匹配的元素，比如： div:not(.logo) 表示不包含 class=logo 元素的所有 div 列表
:contains(text): 查找包含给定文本的元素，搜索不区分大不写，比如： p:contains(jsoup)
:containsOwn(text): 查找直接包含给定文本的元素
:matches(regex): 查找哪些元素的文本匹配指定的正则表达式，比如：div:matches((?i)login)
:matchesOwn(regex): 查找自身包含文本匹配指定正则表达式的元素
注意：上述伪选择器索引是从0开始的，也就是说第一个元素索引值为0，第二个元素index为1等
可以查看Selector API参考来了解更详细的内容

从元素抽取属性，文本和HTML

问题
在解析获得一个Document实例对象，并查找到一些元素之后，你希望取得在这些元素中的数据。

方法
要取得一个属性的值，可以使用Node.attr(String key) 方法
对于一个元素中的文本，可以使用Element.text()方法
对于要取得元素或属性中的HTML内容，可以使用Element.html(), 或 Node.outerHtml()方法
示例：

复制代码代码如下:

String html = "An <a href='http://www.jb51.net/'>www.jb51.net</a> link.";
Document doc = Jsoup.parse(html);//解析HTML字符串返回一个Document实现
Element link = doc.select("a").first();//查找第一个a元素

String text = doc.body().text(); // "An www.jb51.net link"//取得字符串中的文本
String linkHref = link.attr("href"); // "http://www.jb51.net/"//取得链接地址
String linkText = link.text(); // "www.jb51.net""//取得链接地址中的文本

String linkOuterH = link.outerHtml();
// "<a href="http://www.jb51.net">www.jb51.net</a>"
String linkInnerH = link.html(); // "www.jb51.net"//取得链接内的html内容

说明
上述方法是元素数据访问的核心办法。此外还其它一些方法可以使用：

Element.id()
Element.tagName()
Element.className() and Element.hasClass(String className)
这些访问器方法都有相应的setter方法来更改数据.

示例程序: 获取所有链接
这个示例程序将展示如何从一个URL获得一个页面。然后提取页面中的所有链接、图片和其它辅助内容。并检查URLs和文本信息。
运行下面程序需要指定一个URLs作为参数

复制代码代码如下:

package org.jsoup.www.jb51.nets;

import org.jsoup.Jsoup;
import org.jsoup.helper.Validate;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import java.io.IOException;

/**
* www.jb51.net program to list links from a URL.
*/
public class ListLinks {
    public static void main(String[] args) throws IOException {
        Validate.isTrue(args.length == 1, "usage: supply url to fetch");
        String url = args[0];
        print("Fetching %s...", url);

        Document doc = Jsoup.connect(url).get();
        Elements links = doc.select("a[href]");
        Elements media = doc.select("[src]");
        Elements imports = doc.select("link[href]");

print("\nMedia: (%d)", media.size());
 for (Element src : media) {
 if (src.tagName().equals("img"))
 print(" * %s: <%s> %sx%s (%s)",
 src.tagName(), src.attr("abs:src"), src.attr("width"), src.attr("height"),
 trim(src.attr("alt"), 20));
 else
 print(" * %s: <%s>", src.tagName(), src.attr("abs:src"));
 }

print("\nImports: (%d)", imports.size());
 for (Element link : imports) {
 print(" * %s <%s> (%s)", link.tagName(),link.attr("abs:href"), link.attr("rel"));
 }

print("\nLinks: (%d)", links.size());
 for (Element link : links) {
 print(" * a: <%s> (%s)", link.attr("abs:href"), trim(link.text(), 35));
 }
 }

    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
}
org/jsoup/www.jb51.nets/ListLinks.java

java使用Jsoup组件生成word文档的方法

先利用jsoup将得到的html代码“标准化”（Jsoup.parse(String html)）方法，然后利用FileWiter将此html内容写到本地的template.doc文件中，此时如果文章中包含图片的话，template.doc就会依赖你的本地图片文件路径，如果你将图片更改一个名称或者将路径更改，再打开这个template.doc，图片就会显示不出来（出现一个叉叉）。为了解决此问题，利用jsoup组件循环遍历html文档的内容，将img元素替换成${image_自增值}的标识，取出img元素中的src属性，再以键值对的方式存储起来，例如：

复制代码代码如下:

 
 Map<Integer,String> imgMap = new HashMap<Integer,String>(); 
  
 imgMap.put(1,”D:\lucene.png”);

此时你的html内容会变成如下格式:（举个示例）

复制代码代码如下:

 
 < html> 
  
  <head></head> 
  
  <body> 
  
   <p>测试消息1</p> 
  
   <p>${image_1}<p> 
  
   <table> 
  
    <tr> 
  
     <td> <td> 
  
    </tr> 
  
   </table> 
  
   <p>测试消息2</p> 
  
   <a href=http://www.jb51.net><p>${image_2}</p></a> 
  
   <p>测试消息3</p> 
  
  </body> 
  
 < /html>

保存到本地文件以后，利用MSOfficeGeneratorUtils类（工具类详见下面，基于开源组件Jacob）打开你保存的这个template.doc,调用replaceText2Image,将上面代码的图片标识替换为图片，这样就消除了本地图片路径的问题。然后再调用copy方法，复制整篇文档，关闭template.doc文件，新建一个doc文件（createDocument），调用 paste方法粘贴你刚复制的template.doc里的内容，保存。基本上就ok了。
关于copy整个word文档的内容，也会出现一个隐式问题。就是当复制的内容太多时，关闭word程序的时候，会谈出一个对话框，问你是否将复制的数据应用于其它的程序。对于这个问题解决方法很简单，你可以在调用 quit（退出word程序方法）之前，新建一篇文档，输入一行字，然后调用 copy方法，对于复制的数据比较少时，关闭word程序时，它不会提示你的。见如下代码
//复制一个内容比较少的*.doc文档，防止在关闭word程序时提示有大量的copy内容在内存中，是否应用于其它程序对话框,

复制代码代码如下:

msOfficeUtils.createNewDocument();
msOfficeUtils.insertText("测试消息");
msOfficeUtils.copy();
msOfficeUtils.close();
msOfficeUtils.quit();
Jacob在sourceforge上的链接
Jsoup官网
MsOfficeGeneratorUtils
package com.topstar.test;
import java.io.File;
import java.io.IOException;
import java.util.List;
import com.jacob.activeX.ActiveXComponent;
import com.jacob.com.ComThread;
import com.jacob.com.Dispatch;
import com.jacob.com.Variant;
/**
* 利用JACOB对Microsoft Office Word 进行相关操作
*
* @author xiaowu
* @category topstar
* @version 1.0
* @since 2011-12-5
*/
public class MSOfficeGeneratorUtils {
/**
* Microsoft Office Word 程序对象
*/
private ActiveXComponent word = null;
/**
* Word 活动文档对象
*/
private Dispatch document = null;
/**
* 所有 Word 文档对象
*/
private Dispatch documents = null;
/**
* selection 代表当前活动文档窗口中的所选内容。如果文档中没有选中任何内容，则此对象代表插入点（即光标所在位置）。 
* 每个文档窗口中只能存在一个selection对象，并且在整个应用程序中，只能存在一个活动的selection对象
*/
private Dispatch selection = null;
/**
* range 对象代表文档中的一个连续的区域。每个range对象由一个起始字符位置与结束字符位置定义。 
* range 对象独立于所选内容。你可以定义和处理一个范围而无需改变所选内容。还可以在文档中定义多个范围。但每个文档中只能有一个所选内容
*/
private Dispatch range = null;
/**
* PageSetup 对象包含文档所有页面的设置属性（如纸张大小，左边距，下边距）
*/
private Dispatch pageSetup = null;
/**
* 文档中的所有表格对象
*/
private Dispatch tables = null;
/** 单个表格对象 */
private Dispatch table = null;
/** 表格所有行对象 */
private Dispatch rows = null;
/** 表格所有列对象 */
private Dispatch cols = null;
/** 表格指定行对象 */
private Dispatch row = null;
/** 表格指定列对象 */
private Dispatch col = null;
/** 表格中指定的单元格 */
private Dispatch cell = null;
/** 字体 */
private Dispatch font = null;
/** 对齐方式 */
private Dispatch alignment = null;
/**
* 构造方法
*
* @param visible
* 设置在生成word文档时，程序是否可见
*/
public MSOfficeGeneratorUtils(boolean visible) {
if (this.word == null) {
 // 初始化Microsoft Office Word 实例
 this.word = new ActiveXComponent("Word.Application");
 this.word.setProperty("Visible", new Variant(visible));
 // 禁用宏
 this.word.setProperty("AutomationSecurity", new Variant(3));
}
if (this.documents == null)
 this.documents = word.getProperty("Documents").toDispatch();
}
/**
* 设置页面方向与页边距
*
* @param orientation
* 页面方向
* <ul>
* <li>0 横向</li>
* <li>1 纵向</li>
* </ul>
* @param leftMargin
* 左边距
* @param rightMargin
* 右边距
* @param topMargin
* 上边距
* @param buttomMargin
* 下边距
*/
public void setPageSetup(int orientation, int leftMargin, int rightMargin,
 int topMargin, int buttomMargin) {
if (this.pageSetup == null)
 this.getPageSetup();
Dispatch.put(pageSetup, "Orientation", orientation);
Dispatch.put(pageSetup, "LeftMargin", leftMargin);
Dispatch.put(pageSetup, "RightMargin", rightMargin);
Dispatch.put(pageSetup, "TopMargin", topMargin);
Dispatch.put(pageSetup, "BottomMargin", buttomMargin);
}
/**
* 打开word文档
*
* @param docPath
* word文档路径
* @return 打开的文档对象
*/
public Dispatch openDocument(String docPath) {
this.document = Dispatch.call(documents, "Open", docPath).toDispatch();
this.getSelection();
this.getRange();
this.getAlignment();
this.getFont();
this.getPageSetup();
return this.document;
}
/**
* 创建一篇新文档
*
* @return 文档对象
*/
public Dispatch createNewDocument() {
this.document = Dispatch.call(documents, "Add").toDispatch();
this.getSelection();
this.getRange();
this.getPageSetup();
this.getAlignment();
this.getFont();
return this.document;
}
/**
* 获取选定的内容或插入点
*
* @return selection
*/
public Dispatch getSelection() {
this.selection = word.getProperty("Selection").toDispatch();
return this.selection;
}
/**
* 获取当前文档中可以修改的部分，前提是必须存在选中内容
*
* @return range
*/
public Dispatch getRange() {
this.range = Dispatch.get(this.selection, "Range").toDispatch();
return this.range;
}
/**
* 获得当前文档的页面属性
*/
public Dispatch getPageSetup() {
if (this.document == null)
 return this.pageSetup;
this.pageSetup = Dispatch.get(this.document, "PageSetup").toDispatch();
return this.pageSetup;
}
/**
* 把选中内容或插入点向上移动
*
* @param count
* 移动的距离
*/
public void moveUp(int count) {
for (int i = 0; i < count; i++)
 Dispatch.call(this.selection, "MoveUp");
}
/**
* 把选中内容或插入点向下移动
*
* @param count
* 移动的距离
*/
public void moveDown(int count) {
for (int i = 0; i < count; i++)
 Dispatch.call(this.selection, "MoveDown");
}
/**
* 把选中内容或插入点向左移动
*
* @param count
* 移动的距离
*/
public void moveLeft(int count) {
for (int i = 0; i < count; i++)
 Dispatch.call(this.selection, "MoveLeft");
}
/**
* 把选中内容或插入点向右移动
*
* @param count
* 移动的距离
*/
public void moveRight(int count) {
for (int i = 0; i < count; i++)
 Dispatch.call(this.selection, "MoveRight");
}
/**
* 执行硬换行（回车键）
*
* @param count
* 换行数
*/
public void enterDown(int count) {
for (int i = 0; i < count; i++)
 Dispatch.call(this.selection, "TypeParagraph");
}
/**
* 把插入点移动到文件首位置
*/
public void moveStart() {
Dispatch.call(this.selection, "HomeKey", new Variant(6));
}
/**
* 把插入点移动到文件末尾
*/
public void moveEnd() {
Dispatch.call(selection, "EndKey", new Variant(6));
}

/**
* 从选定内容或插入点开始查找文本
*
* @param toFindText
* 要查找的内容
* @return 查询到的内容并选中
*/
public boolean find(String toFindText) {
// 从selection所在位置开始查询
Dispatch find = Dispatch.call(this.selection, "Find").toDispatch();
// 设置要查找的?热?br /> Dispatch.put(find, "Text", toFindText);
// 向前查找
Dispatch.put(find, "Forward", "True");
// 设置格式
Dispatch.put(find, "Format", "True");
// 大小写匹配
Dispatch.put(find, "MatchCase", "True");
// 全字匹配
Dispatch.put(find, "MatchWholeWord", "True");
// 查找并选中
return Dispatch.call(find, "Execute").getBoolean();
}
/**
* 替换选定的内容
*
* @param newText
* 要替换的内容
*/
public void replace(String newText) {
// 设置替换文本
Dispatch.put(this.selection, "Text", newText);
}
/**
* 全局替换
*
* @param oldText
* 要替换的内容
* @param replaceObj
* 被替换的内容
*/
public void replaceAll(String oldText, Object replaceObj) {
// 将插入点移到文件开头
moveStart();
// 表格替换方式
String newText = (String) replaceObj;
// 图片替换方式
if (oldText.indexOf("image") != -1 || newText.lastIndexOf(".bmp") != -1 || newText.lastIndexOf(".jpg") != -1 || newText.lastIndexOf(".gif") != -1) {
 while (find(oldText)) {
 insertImage(newText);
 Dispatch.call(this.selection, "MoveRight");
 }
 // 文本方式
} else {
 while (find(oldText)) {
 replace(newText);
 Dispatch.call(this.selection, "MoveRight");
 }
}
}

/**
* 将指定的内容替换成图片
* @param replaceText 指定的内容
* @param imgPath 图片路径
*/
public void replaceText2Image(String replaceText,String imgPath){
moveStart();
while(find(replaceText)){
 insertImage(imgPath);
 moveEnd();
 enterDown(1);
}
}
/**
* 向当前插入点替换图片
*
* @param imagePath
* 图片的路径
*/
public void insertImage(String imagePath) {
Dispatch.call(Dispatch.get(selection, "InLineShapes").toDispatch(), "AddPicture", imagePath);
}
/**
* 合并单元格
*
* @param tableIndex
* 表格下标，从1开始
* @param fstCellRowIdx
* 开始行
* @param fstCellColIdx
* 开始列
* @param secCellRowIdx
* 结束行
* @param secCellColIdx
* 结束列
*/
public void mergeCell(int tableIndex, int fstCellRowIdx, int fstCellColIdx,
 int secCellRowIdx, int secCellColIdx) {
getTable(tableIndex);
Dispatch fstCell = Dispatch.call(table, "Cell",
 new Variant(fstCellRowIdx), new Variant(fstCellColIdx))
 .toDispatch();
Dispatch secCell = Dispatch.call(table, "Cell",
 new Variant(secCellRowIdx), new Variant(secCellColIdx))
 .toDispatch();
Dispatch.call(fstCell, "Merge", secCell);
}
/**
* 拆分当前单元格
*
* @param numRows
* 拆分的行数，如果不想拆分行，请指定为1
* @param numColumns
* 拆分的列数，如果不想拆分列，请指定为1
*/
public void splitCell(int numRows, int numColumns) {
Dispatch.call(this.cell, "Split", new Variant(numRows), new Variant(
 numColumns));
}
/**
* 向表格中写入内容
*
* @param list
* 要写入的内容 
* 注：list.size() 应该与表格的rows一致，String数组的length属性应与表格的columns一致
*/
public void insertToTable(List<String[]> list) {
if (list == null || list.size() <= 0)
 return;
if (this.table == null)
 return;
for (int i = 0; i < list.size(); i++) {
 String[] strs = list.get(i);
 for (int j = 0; j < strs.length; j++) {
 // 遍历表格中每一??单元格，遍历次数所要填入的?热菔?肯嗤?br /> Dispatch cell = this.getCell(i + 1, j + 1);
 // 选中此单元格
 Dispatch.call(cell, "Select");
 // 写入?热莸酱说ピ?裰?br /> Dispatch.put(this.selection, "Text", strs[j]);
 // 将插入点移动至下一??位置
 }
 this.moveDown(1);
}
// 换行
this.enterDown(1);
}
/**
* 向当前插入点插入文本内容
*
* @param list
* 要插入的内容，list.size()代表行数
*/
public void insertToDocument(List<String> list) {
if (list == null || list.size() <= 0)
 return;
if (this.document == null)
 return;
for (String str : list) {
 Dispatch.put(this.selection, "Text", str);
 this.moveDown(1);
 this.enterDown(1);
}
}
/**
* 在当前插入点插入文本
*
* @param insertText
* 要插入的文本
*/
public void insertToText(String insertText) {
Dispatch.put(this.selection, "Text", insertText);
}
/**
* 在当前插入点插入字符串,利用此方法插入一行text后，Word会默认选中它，如果再调用此方法，会将原来的内容覆盖掉，所以调用此方法后，记得调用moveRight，将偏移量向右边移动一个位置。
* @param newText 要插入的新字符串
*/
public void insertText(String newText) {
Dispatch.put(selection, "Text", newText);
}
/**
* 创建新的表格
*
* @param rowCount
* 行
* @param colCount
* 列
* @param width
* 表格边框
* <ul>
* <li>0 无边框</li>
* <li>1 有边框</li>
* </ul>
* @return 表格对象
*/
public Dispatch createNewTable(int rowCount, int colCount, int width) {
if (this.tables == null)
 this.getTables();
this.getRange();
if (rowCount > 0 && colCount > 0)
 this.table = Dispatch.call(this.tables, "Add", this.range,
 new Variant(rowCount), new Variant(colCount),
 new Variant(width)).toDispatch();
return this.table;
}
/**
* 获取当前document对象中的所有表格对象
*
* @return tables
*/
public Dispatch getTables() {
if (this.document == null)
 return this.tables;
this.tables = Dispatch.get(this.document, "Tables").toDispatch();
return this.tables;
}
/**
* 获取当前文档中的所有表格数量
*
* @return 表格数量
*/
public int getTablesCount() {
if (this.tables == null)
 this.getTables();
return Dispatch.get(tables, "Count").getInt();
}
/**
* 根据索引获得table对象
*
* @param tableIndex
* 索引
* @return table
*/
public Dispatch getTable(int tableIndex) {
if (this.tables == null)
 this.getTables();
if (tableIndex >= 0)
 this.table = Dispatch.call(this.tables, "Item", new Variant(tableIndex)).toDispatch();
return this.table;
}
/**
* 在指定的单元格里填写数据
*
* @param tableIndex
* 表格索引
* @param cellRowIdx
* 行索引
* @param cellColIdx
* 列索引
* @param txt
* 文本
*/
public void putTxtToCell(int tableIndex, int cellRowIdx, int cellColIdx, String txt) {
getTable(tableIndex);
getCell(cellRowIdx, cellColIdx);
Dispatch.call(this.cell, "Select");
Dispatch.put(this.selection, "Text", txt);
}
/**
* 在当前文档末尾拷贝来自另一个文档中的段落
*
* @param anotherDocPath
* 另一个文档的磁盘路径
* @param tableIndex
* 被拷贝的段落在另一格文档中的序号(从1开始)
*/
public void copyParagraphFromAnotherDoc(String anotherDocPath, int paragraphIndex) {
Dispatch wordContent = Dispatch.get(this.document, "Content").toDispatch(); // 取得当前文档的内容
Dispatch.call(wordContent, "InsertAfter", "$selection$");// 插入特殊符定位插入点
copyParagraphFromAnotherDoc(anotherDocPath, paragraphIndex, "$selection$");
}
/**
* 在当前文档指定的位置拷贝来自另一个文档中的段落
*
* @param anotherDocPath
* 另一个文档的磁盘路径
* @param tableIndex
* 被拷贝的段落在另一格文档中的序号(从1开始)
* @param pos
* 当前文档指定的位置
*/
public void copyParagraphFromAnotherDoc(String anotherDocPath, int paragraphIndex, String pos) {
Dispatch doc2 = null;
try {
 doc2 = Dispatch.call(documents, "Open", anotherDocPath).toDispatch();
 Dispatch paragraphs = Dispatch.get(doc2, "Paragraphs").toDispatch();
 Dispatch paragraph = Dispatch.call(paragraphs, "Item", new Variant(paragraphIndex)).toDispatch();
 Dispatch range = Dispatch.get(paragraph, "Range").toDispatch();
 Dispatch.call(range, "Copy");
 if (this.find(pos)) {
 getRange();
 Dispatch.call(this.range, "Paste");
 }
} catch (Exception e) {
 e.printStackTrace();
} finally {
 if (doc2 != null) {
 Dispatch.call(doc2, "Close", new Variant(true));
 doc2 = null;
 }
}
}
/**
* 在当前文档指定的位置拷贝来自另一个文档中的表格
*
* @param anotherDocPath
* 另一个文档的磁盘路径
* @param tableIndex
* 被拷贝的表格在另一格文档中的序号(从1开始)
* @param pos
* 当前文档指定的位置
*/
public void copyTableFromAnotherDoc(String anotherDocPath, int tableIndex,
 String pos) {
Dispatch doc2 = null;
try {
 doc2 = Dispatch.call(documents, "Open", anotherDocPath)
 .toDispatch();
 Dispatch tables = Dispatch.get(doc2, "Tables").toDispatch();
 Dispatch table = Dispatch.call(tables, "Item",
 new Variant(tableIndex)).toDispatch();
 Dispatch range = Dispatch.get(table, "Range").toDispatch();
 Dispatch.call(range, "Copy");
 if (this.find(pos)) {
 getRange();
 Dispatch.call(this.range, "Paste");
 }
} catch (Exception e) {
 e.printStackTrace();
} finally {
 if (doc2 != null) {
 Dispatch.call(doc2, "Close", new Variant(true));
 doc2 = null;
 }
}
}
/**
* 在当前文档指定的位置拷贝来自另一个文档中的图片
*
* @param anotherDocPath
* 另一个文档的磁盘路径
* @param shapeIndex
* 被拷贝的图片在另一格文档中的位置
* @param pos
* 当前文档指定的位置
*/
public void copyImageFromAnotherDoc(String anotherDocPath, int shapeIndex,
 String pos) {
Dispatch doc2 = null;
try {
 doc2 = Dispatch.call(documents, "Open", anotherDocPath)
 .toDispatch();
 Dispatch shapes = Dispatch.get(doc2, "InLineShapes").toDispatch();
 Dispatch shape = Dispatch.call(shapes, "Item",
 new Variant(shapeIndex)).toDispatch();
 Dispatch imageRange = Dispatch.get(shape, "Range").toDispatch();
 Dispatch.call(imageRange, "Copy");
 if (this.find(pos)) {
 getRange();
 Dispatch.call(this.range, "Paste");
 }
} catch (Exception e) {
 e.printStackTrace();
} finally {
 if (doc2 != null) {
 Dispatch.call(doc2, "Close", new Variant(true));
 doc2 = null;
 }
}
}
/**
* 在指定的表格的指定行前面增加行
*
* @param tableIndex
* word文件中的第N张表(从1开始)
* @param rowIndex
* 指定行的序号(从1开始)
*/
public void addTableRow(int tableIndex, int rowIndex) {
getTable(tableIndex);
getTableRows();
getTableRow(rowIndex);
Dispatch.call(this.rows, "Add", new Variant(this.row));
}
/**
* 在第1行前增加一行
*
* @param tableIndex
* word文档中的第N张表(从1开始)
*/
public void addFirstTableRow(int tableIndex) {
getTable(tableIndex);
getTableRows();
Dispatch row = Dispatch.get(rows, "First").toDispatch();
Dispatch.call(this.rows, "Add", new Variant(row));
}
/**
* 在最后1行前增加一行
*
* @param tableIndex
* word文档中的第N张表(从1开始)
*/
public void addLastTableRow(int tableIndex) {
getTable(tableIndex);
getTableRows();
Dispatch row = Dispatch.get(this.rows, "Last").toDispatch();
Dispatch.call(this.rows, "Add", new Variant(row));
}
/**
* 增加一行
*
* @param tableIndex
* word文档中的第N张表(从1开始)
*/
public void addRow(int tableIndex) {
getTable(tableIndex);
getTableRows();
Dispatch.call(this.rows, "Add");
}
/**
* 增加一列
*
* @param tableIndex
* word文档中的第N张表(从1开始)
*/
public void addCol(int tableIndex) {
getTable(tableIndex);
getTableColumns();
Dispatch.call(this.cols, "Add").toDispatch();
Dispatch.call(this.cols, "AutoFit");
}
/**
* 在指定列前面增加表格的列
*
* @param tableIndex
* word文档中的第N张表(从1开始)
* @param colIndex
* 指定列的序号 (从1开始)
*/
public void addTableCol(int tableIndex, int colIndex) {
getTable(tableIndex);
getTableColumns();
getTableColumn(colIndex);
Dispatch.call(this.cols, "Add", this.col).toDispatch();
Dispatch.call(this.cols, "AutoFit");
}
/**
* 在第1列前增加一列
*
* @param tableIndex
* word文档中的第N张表(从1开始)
*/
public void addFirstTableCol(int tableIndex) {
getTable(tableIndex);
Dispatch cols = getTableColumns();
Dispatch col = Dispatch.get(cols, "First").toDispatch();
Dispatch.call(cols, "Add", col).toDispatch();
Dispatch.call(cols, "AutoFit");
}
/**
* 在最后一列前增加一列
*
* @param tableIndex
* word文档中的第N张表(从1开始)
*/
public void addLastTableCol(int tableIndex) {
getTable(tableIndex);
Dispatch cols = getTableColumns();
Dispatch col = Dispatch.get(cols, "Last").toDispatch();
Dispatch.call(cols, "Add", col).toDispatch();
Dispatch.call(cols, "AutoFit");
}
/**
* 获取当前表格的列数
*
* @return 列总数
*/
public int getTableColumnsCount() {
if (this.table == null)
 return 0;
return Dispatch.get(this.cols, "Count").getInt();
}
/**
* 获取当前表格的行数
*
* @return 行总数
*/
public int getTableRowsCount() {
if (this.table == null)
 return 0;
return Dispatch.get(this.rows, "Count").getInt();
}
/**
* 获取当前表格的所有列对象
*
* @return cols
*/
public Dispatch getTableColumns() {
if (this.table == null)
 return this.cols;
this.cols = Dispatch.get(this.table, "Columns").toDispatch();
return this.cols;
}
/**
* 获取当前表格的所有行对象
*
* @return rows
*/
public Dispatch getTableRows() {
if (this.table == null)
 return this.rows;
this.rows = Dispatch.get(this.table, "Rows").toDispatch();
return this.rows;
}
/**
* 根据索引获得当前表格的列对象
*
* @param columnIndex
* 列索引
* @return col
*/
public Dispatch getTableColumn(int columnIndex) {
if (this.cols == null)
 this.getTableColumns();
if (columnIndex >= 0)
 this.col = Dispatch.call(this.cols, "Item",
 new Variant(columnIndex)).toDispatch();
return this.col;
}
/**
* 根据索引获得当前表格的行对象
*
* @param rowIndex
* 行索引
* @return row
*/
public Dispatch getTableRow(int rowIndex) {
if (this.rows == null)
 this.getTableRows();
if (rowIndex >= 0)
 this.row = Dispatch.call(this.rows, "Item", new Variant(rowIndex))
 .toDispatch();
return this.row;
}
/**
* 自动调整当前所有表格
*/
public void autoFitTable() {
int count = this.getTablesCount();
for (int i = 0; i < count; i++) {
 Dispatch table = Dispatch.call(tables, "Item", new Variant(i + 1))
 .toDispatch();
 Dispatch cols = Dispatch.get(table, "Columns").toDispatch();
 Dispatch.call(cols, "AutoFit");
}
}
/**
* 根据行索引与列索引获取当前表格中的单元格
*
* @param cellRowIdx
* 行索引
* @param cellColIdx
* 列索引
* @return cell对象
*/
public Dispatch getCell(int cellRowIdx, int cellColIdx) {
if (this.table == null)
 return this.cell;
if (cellRowIdx >= 0 && cellColIdx >= 0)
 this.cell = Dispatch.call(this.table, "Cell",
 new Variant(cellRowIdx), new Variant(cellColIdx))
 .toDispatch();
return this.cell;
}
public void selectCell(int cellRowIdx, int cellColIdx) {
if (this.table == null)
 return;
getCell(cellRowIdx, cellColIdx);
if (cellRowIdx >= 0 && cellColIdx >= 0)
 Dispatch.call(this.cell, "select");
}
/**
* 设置当前文档的标题
*
* @param title 标题
* @param alignmentType 对齐方式
* @see setAlignment
*/
public void setTitle(String title, int alignmentType) {
if (title == null || "".equals(title))
 return;
if (this.alignment == null)
 this.getAlignment();
if(alignmentType != 0 && alignmentType != 1 && alignmentType != 2)
 alignmentType = 0;
Dispatch.put(this.alignment, "Alignment", alignmentType);
Dispatch.call(this.selection, "TypeText", title);
}
/**
* 设置当前表格边框的粗细
*
* @param width
* 范围：1 < w < 13，如果是0，就代表?]有框 
*/
public void setTableBorderWidth(int width) {
if (this.table == null)
 return;
/*
 * 设置表格线的粗细 1：代表最上边一条线 2：代表最左边一条线 3：最下边一条线 4：最右边一条线 5：除最上边最下边之外的所有横线
 * 6：除最左边最右边之外的所有竖线 7：从左上角到右下角的斜线 8：从左下角到右上角的斜线
 */
Dispatch borders = Dispatch.get(table, "Borders").toDispatch();
Dispatch border = null;
for (int i = 1; i < 7; i++) {
 border = Dispatch.call(borders, "Item", new Variant(i))
 .toDispatch();
 if (width != 0) {
 Dispatch.put(border, "LineWidth", new Variant(width));
 Dispatch.put(border, "Visible", new Variant(true));
 } else if (width == 0) {
 Dispatch.put(border, "Visible", new Variant(false));
 }
}
}
/**
* 得到指定的表格指定的单元格中的值
*
* @param tableIndex
* 表格索引（从1开始）
* @param rowIndex
* 行索引（从1开始）
* @param colIndex
* 列索引（从1开始）
* @return
*/
public String getTxtFromCell(int tableIndex, int rowIndex, int colIndex) {
String value = "";
// 设置为当前表格
getTable(tableIndex);
getCell(rowIndex, colIndex);
if (cell != null) {
 Dispatch.call(cell, "Select");
 value = Dispatch.get(selection, "Text").toString();
 value = value.substring(0, value.length() - 2); // 去掉最后的回车符;
}
return value;
}
/**
* 对当前选中的内容设置项目符号与列表
*
* @param tabIndex
* <ul>
* <li>1.项目编号</li>
* <li>2.编号</li>
* <li>3.多级编号</li>
* <li>4.列表样式</li>
* </ul>
* @param index
* 0表示没有，其它数字代表是该tab页中的第几项内容
*/
public void applyListTemplate(int tabIndex, int index) {
// 取得ListGalleries对象列表
Dispatch listGalleries = Dispatch.get(this.word, "ListGalleries")
 .toDispatch();
// 取得列表中一个对象
Dispatch listGallery = Dispatch.call(listGalleries, "Item",
 new Variant(tabIndex)).toDispatch();
Dispatch listTemplates = Dispatch.get(listGallery, "ListTemplates")
 .toDispatch();
if (this.range == null)
 this.getRange();
Dispatch listFormat = Dispatch.get(this.range, "ListFormat")
 .toDispatch();
Dispatch.call(listFormat, "ApplyListTemplate",
 Dispatch.call(listTemplates, "Item", new Variant(index)),
 new Variant(true), new Variant(1), new Variant(0));
}
/**
* 增加文档目录
*/
public void addTablesOfContents() {
// 取得ActiveDocument、TablesOfContents、range对象
Dispatch ActiveDocument = word.getProperty("ActiveDocument")
 .toDispatch();
Dispatch TablesOfContents = Dispatch.get(ActiveDocument,
 "TablesOfContents").toDispatch();
Dispatch range = Dispatch.get(this.selection, "Range").toDispatch();
// 增加目录
Dispatch.call(TablesOfContents, "Add", range, new Variant(true),
 new Variant(1), new Variant(3), new Variant(true), new Variant(
 ""), new Variant(true), new Variant(true));
}
/**
* 设置当前selection对齐方式
*
* @param alignmentType
* <ul>
* <li>0.居左</li>
* <li>1.居中</li>
* <li>2.居右</li>
* </ul>
*/
public void setAlignment(int alignmentType) {
if (this.alignment == null)
 this.getAlignment();
Dispatch.put(this.alignment, "Alignment", alignmentType);
}
/**
* 获取当前selection的对齐方式
*
* @return alignment
*/
public Dispatch getAlignment() {
if (this.selection == null)
 this.getSelection();
this.alignment = Dispatch.get(this.selection, "ParagraphFormat")
 .toDispatch();
return this.alignment;
}
/**
* 获取字体对象
*
* @return font
*/
public Dispatch getFont() {
if (this.selection == null)
 this.getSelection();
this.font = Dispatch.get(this.selection, "Font").toDispatch();
return this.font;
}
/**
* 设置当前selection的字体
*
* @param fontName
* 字体名称，如“微软雅黑”
* @param isBold
* 是否粗体
* @param isItalic
* 是否斜体
* @param isUnderline
* 是否下划线
* @param rgbColor
* 颜色值"1,1,1,1"
* @param Scale
* 字体间距
* @param fontSize
* 字体大小
*/
@Deprecated
public void setFontScale(String fontName, boolean isBold, boolean isItalic,
 boolean isUnderline, String rgbColor, int Scale, int fontSize) {
Dispatch.put(this.font, "Name", fontName);
Dispatch.put(this.font, "Bold", isBold);
Dispatch.put(this.font, "Italic", isItalic);
Dispatch.put(this.font, "Underline", isUnderline);
Dispatch.put(this.font, "Color", rgbColor);
Dispatch.put(this.font, "Scaling", Scale);
Dispatch.put(this.font, "Size", fontSize);
}

/**
* 设置当前选定内容的字体
* @param isBold 是否为粗体
* @param isItalic 是否为斜体
* @param isUnderLine 是否带下划线
* @param color rgb 字体颜色例如：红色 255,0,0
* @param size 字体大小 12:小四 16:三号
* @param name 字体名称例如：宋体，新宋体，楷体，隶书
*/
public void setFont(boolean isBold,boolean isItalic,boolean isUnderLine,String color,String size,String name) {
 Dispatch font = Dispatch.get(getSelection(), "Font").toDispatch();
 Dispatch.put(font, "Name", new Variant(name));
 Dispatch.put(font, "Bold", new Variant(isBold));
 Dispatch.put(font, "Italic", new Variant(isItalic));
 Dispatch.put(font, "Underline", new Variant(isUnderLine));
 if(!"".equals(color))
 Dispatch.put(font, "Color", color);
 Dispatch.put(font, "Size", size);
}

/**
* 保存文件
*
* @param outputPath
* 保存路径
*/
public void saveAs(String outputPath) {
if (this.document == null)
 return;
if (outputPath == null || "".equals(outputPath))
 return;
Dispatch.call(this.document, "SaveAs", outputPath);
}
/**
* 另存为HTML内容
*
* @param htmlFile
* html文件路径
*/
public void saveAsHtml(String htmlFile) {
Dispatch.invoke(this.document, "SaveAs", Dispatch.Method, new Object[] {
 htmlFile, new Variant(8) }, new int[1]);
}
/**
* saveFormat | Member name Description 0 | wdFormatDocument Microsoft Word
* format. 1 | wdFormatTemplate Microsoft Word template format. 2 |
* wdFormatText Microsoft Windows text format. 3 | wdFormatTextLineBreaks
* Microsoft Windows text format with line breaks preserved. 4 |
* wdFormatDOSText Microsoft DOS text format. 5 | wdFormatDOSTextLineBreaks
* Microsoft DOS text with line breaks preserved. 6 | wdFormatRTF Rich text
* format (RTF). 7 | wdFormatEncodedText Encoded text format. 7 |
* wdFormatUnicodeText Unicode text format. 8 | wdFormatHTML Standard HTML
* format. 9 | wdFormatWebArchive Web archive format. 10 |
* wdFormatFilteredHTML Filtered HTML format. 11 | wdFormatXML Extensible
* Markup Language (XML) format.
*/
/**
* 关闭当前word文档
*/
public void close() {
if (document == null)
 return;
Dispatch.call(document, "Close", new Variant(0));
}
/**
* 执行当前文档打印命令
*/
public void printFile() {
if (document == null)
 return;
Dispatch.call(document, "PrintOut");
}
/**
* 退出Microsoft Office Word程序
*/
public void quit() {
word.invoke("Quit", new Variant[0]);
ComThread.Release();
}

/**
* 选中整篇文档
*/
public void selectAllContent(){
Dispatch.call(this.document,"select");
}

/**
* 复制整篇文档
* @param target
*/
public void copy(){
Dispatch.call(this.document,"select");
Dispatch.call(this.selection,"copy");
}

/**
* 在当前插入点位置粘贴选中的内容
*/
public void paste(){
Dispatch.call(this.selection,"paste");
}

public static void main(String[] args) throws IOException {
MSOfficeGeneratorUtils officeUtils = new MSOfficeGeneratorUtils(true);
// officeUtils.openDocument("D:\TRS\TRSWCMV65HBTCIS\Tomcat\webapps\wcm\eipv65\briefreport\templates\zhengfa\头部.doc");
// officeUtils.replaceAll("${briefreport_year}", "2011");
// officeUtils.replaceAll("${briefreport_issue}", "3");
// File file = File.createTempFile("test", ".tmp");
// System.out.println(file.getAbsolutePath());
// file.delete();
// File file = new File("C:\DOCUME~1\ADMINI~1\LOCALS~1\Temp\test5411720146039914615.tmp");
// System.out.println(file.exists());

officeUtils.createNewDocument();
// officeUtils.createNewTable(1, 1, 1);
// officeUtils.insertText("发表时间:2011-11-11");
// officeUtils.moveRight(1);
// officeUtils.insertText("t");
// officeUtils.moveRight(1);
// officeUtils.insertText("所在频道:宏观环境/社会环境");
// officeUtils.moveRight(1);
// officeUtils.insertText("t");
// officeUtils.moveRight(1);
// officeUtils.insertText("文章作者:杨叶茂");
// officeUtils.moveRight(1);
officeUtils.insertText("I'm Chinese");
officeUtils.moveRight(1);
officeUtils.enterDown(1);
officeUtils.insertText("I'm not Chinese");
officeUtils.moveRight(1);


/* doc2 = Dispatch.call(documents, "Open", anotherDocPath).toDispatch();
Dispatch paragraphs = Dispatch.get(doc2, "Paragraphs").toDispatch();
Dispatch paragraph = Dispatch.call(paragraphs, "Item", new Variant(paragraphIndex)).toDispatch();*/

// officeUtils.setFontScale("微软雅黑", true, true, true, "1,1,1,1", 100,
// 18);
// officeUtils.setAlignment(1);
// officeUtils.insertToText("这是一个测试");
// officeUtils.moveEnd();
// officeUtils.setFontScale("微软雅黑", false, false, false, "1,1,1,1", 100,
// 18);
// officeUtils.insertImage("d:\11.jpg");
// officeUtils.enterDown(1);
// officeUtils.insertToText("这是我的照片");
// officeUtils.enterDown(1);
// officeUtils.createNewTable(3, 5, 1);
// List<String[]> list = new ArrayList<String[]>();
// for (int i = 0; i < 3; i++) {
// String[] strs = new String[5];
// for (int j = 0; j < 5; j++) {
// strs[j] = j + i + "";
// }
// list.add(strs);
// }
// officeUtils.insertToTable(list);
// officeUtils.createNewTable(10, 10, 1);
// officeUtils.moveEnd();
// officeUtils.enterDown(1);
// officeUtils.createNewTable(3,2,1);
// officeUtils.mergeCell(1, 1, 7, 1, 9);
// officeUtils.mergeCell(1, 2, 2, 3, 7);
// officeUtils.mergeCell(1, 3, 4, 9, 10);
// officeUtils.insertText("123");
// officeUtils.getCell(1, 2);
// officeUtils.splitCell(2 , 4);
// officeUtils.selectCell(1, 2);
// officeUtils.insertText("split");
// officeUtils.selectCell(1, 5);
// officeUtils.insertText("split1");
// officeUtils.selectCell(1, 6);
// officeUtils.insertText("yy");
// officeUtils.selectCell(2, 4);
// officeUtils.insertText("ltg");
// officeUtils.saveAs("D:\" + System.currentTimeMillis() + ".doc");
// officeUtils.close();
// officeUtils.quit();
}
}
TestJsoupComponent
package com.topstar.test;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
import java.util.Map.Entry;
import java.util.UUID;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.nodes.Node;
import com.eprobiti.trs.TRSException;
/** * 基本思路：得到html内容,因为是非标准的html内容，利用Jsoup组件将读取出来的内容转换为标准的html文件内容,
* 然后遍历每个节点，找到img标签，记录其索引，再根据其文件名规则拼接出图片的物理路径，将其替换为${image_index}标识，而后将{索引，路径}
* 以键值对的方式丰入Map中，如
* "${image_1,d:lucene.png}"格式，然后利用jacob组件打开template.doc,选中整篇文档并复制，而后新建一篇文档，粘贴刚复制的内
* 容查找图片标识位，将其替换为图片
*
* @since 2011-12-09
* @author xioawu
* @cateogry topstar
* @version 1.0
*/
public class TestJsoupComponent {
private static Document document;
private static Map<String, String> imgMap = new HashMap<String, String>(); //存放图片标识符及物理路径 i.e {"image_1","D:\lucene.png"};
private static List<String> files = new ArrayList<String>(); //存入本地生成的各个文章doc的文件名
private static Integer imgIndex = 1; //图片标识
public static void main(String[] args) throws TRSException, IOException {
MSOfficeGeneratorUtils officeUtils = new MSOfficeGeneratorUtils(true); // 将生成过程设置为不可见

String html = "<html>.....</html>";// 得到正文内容 , 此处自己填写html内容
String header = "测试标题"; // 得到文章标题
document = Jsoup.parse(html);
// System.out.println(document.html());
for (Element element : document.body().select("body > *"))
 // 递归遍历body下的所有直接子元素，找出img标签，@see SysElementText Method
 sysElementText(element);
File file = new File("D:" + File.separator + "template.doc");
file.createNewFile(); // 创建模板html
FileWriter fw = new FileWriter(file);
fw.write(document.html(), 0, document.html().length());// 写入文件
fw.flush(); // 清空FileWriter缓冲区
fw.close();
officeUtils.openDocument("D:\template.doc"); // 打开template.doc .由trsserver eipdocument库中的dochtmlcon生成的template.doc文件
officeUtils.copy(); // 拷贝整篇文档
officeUtils.close();
officeUtils.createNewDocument();
officeUtils.paste(); // 粘贴整篇文档
for (Entry<String, String> entry : imgMap.entrySet()) //循环将图片标识位替换成图片
 officeUtils.replaceText2Image(entry.getKey(), entry.getValue());
officeUtils.moveStart(); // 将插入点移动至Word文档的最顶点
officeUtils.setFont(true, false, false, "0,0,0", "20", "宋体"); // 设置字体,具体参数，自己看API
officeUtils.setTitle(header, 1); // 设置标题
officeUtils.enterDown(1); // 设置一行回车
String filename = UUID.randomUUID().toString();
files.add(filename); // 记录文件名，
officeUtils.saveAs("D:" + File.separator + filename + ".doc"); // 生成D:\UUID.doc文件，利用UUID防止同名
officeUtils.close(); // 关闭Office Word创建的文档
officeUtils.quit(); // 退出Office Word程序
MSOfficeGeneratorUtils msOfficeUtils = new MSOfficeGeneratorUtils(false); // 整合过程设置为可见
msOfficeUtils.createNewDocument();
msOfficeUtils.saveAs("D:" + File.separator + "complete.doc");
msOfficeUtils.close();
for (String fileName : files) {
 msOfficeUtils.openDocument("D:" + File.separator + fileName + ".doc");
 msOfficeUtils.copy();
 msOfficeUtils.close();
 msOfficeUtils.openDocument("D:" + File.separator + "complete.doc");
 msOfficeUtils.moveEnd();
 msOfficeUtils.enterDown(1);
 msOfficeUtils.paste();
 msOfficeUtils.saveAs("D:" + File.separator + "complete.doc");
 msOfficeUtils.close();
}
//复制一个内容比较少的*.doc文档，防止在关闭word程序时提示有大量的copy内容在内存中，是否应用于其它程序对话框,
msOfficeUtils.createNewDocument();
msOfficeUtils.insertText("测试消息");
msOfficeUtils.copy();
msOfficeUtils.close();
msOfficeUtils.quit();
imgIndex = 1;
imgMap.clear();
}
public static void sysElementText(Node node) {
if (node.childNodes().size() == 0) {
 if (node.nodeName().equals("img")) { // 处理图片路径问题
 node.after("${image_" + imgIndex + "}"); // 为img添加同级P标签，内容为${image_imgIndexNumber}
 String src = node.attr("src");
 node.remove(); // 删除Img标签。
 StringBuffer imgUrl = new StringBuffer("D:\TRS\TRSWCMV65HBTCIS\WCMData\webpic\"); // 暂时将路径直接写死，正式应用上应将此处改写为WebPic的配置项
 imgUrl.append(src.substring(0, 8)).append("\").append(src.subSequence(0, 10)).append("\").append(src);
 // node.attr("src", imgUrl.toString()); //这一句没有必要，因为此img标签已经移除了
 imgMap.put("${image_" + imgIndex++ + "}", imgUrl.toString());
 }
} else {
 for (Node rNode : node.childNodes()) {
 sysElementText(rNode);
 }
}
}
}

使用JSOUP解析HTML文档

猜你喜欢