java docx转html

源码地址: https://github.com/sunmaolin/wordToHtml
含doc转html,docx转html等相关实例代码。一看就会,一写就对。

提示:本文使用POI工具进行转换。

要求:公司出具的前一份报告为Word文件。现在您需要更改为富文本编辑并嵌入到页面中(以便被百度检索)。以前发布的报告也需要修改和保存。现在的问题是如何将Word转换成html格式,图片需要上传到云存储。[en]Requirements: the previous report issued by the company is a word file. Now you need to change to rich text editing and embed into the page (in order to be retrieved by Baidu). Previously released reports also need to be modified and saved. The problem now is how to convert word to html format, * and images need to be uploaded to cloud storage * .
问题1: poi只针对于doc格式的文件可以做转换。docx做不到。
问题2: 针对问题1,我将docx另存为doc文件。转换后发现样式是写在style标签中。而我需要内联样式。也就是转为html文件是可以的,但是转为html片段可行性不高。当然也可以使用jsoup进行解析处理(太费劲)。
问题3: 针对问题2,我检索发现存在 xdocreport插件可以将poi中的XWPF转换为HTML,且可以为内联样式。但是其中图片是存储到本地。我需要上传到云存储中。

对于问题3,以下是解决方案。[en]For problem 3, here is the solution.

XDocReport means XML Document reporting. It’s Java API to merge XML document created with MS Office (docx) or OpenOffice (odt), LibreOffice (odt) with a Java model to generate report and convert it if you need to another format (PDF, XHTML…).

旧版本和新版本的使用方式不同。但原则是完全相同的。因为POI已经依赖于XDocReport。因此,没有必要额外介绍。项目中是否存在POI从属关系。注意版本冲突。[en]The old and new versions are used differently. But the principle is exactly the same. Because POI is already dependent on XDocReport. So there is no need for additional introduction. If there are POI dependencies in the project. Pay attention to version conflicts.

2.1 老版本


<dependency>
    <groupId>fr.opensagres.xdocreportgroupId>
    <artifactId>org.apache.poi.xwpf.converter.xhtmlartifactId>
    <version>1.0.5version>
dependency>

2.2 新版本

新版本的名称已更改,不再以org.apache.poi命名。[en]The new version, the name has been changed, is no longer named after org.apache.poi.


<dependency>
    <groupId>fr.opensagres.xdocreportgroupId>
    <artifactId>fr.opensagres.poi.xwpf.converter.xhtmlartifactId>
    <version>2.0.3version>
dependency>

这里只对新版本2.0.3进行了编码。推荐。[en]Only the new version 2.0.3 is encoded here. Recommended.

3.1 自定义ImageManager

import fr.opensagres.poi.xwpf.converter.core.ImageManager;

import java.io.IOException;
import java.util.HashMap;
import java.util.Map;

public class CustomImageManager extends ImageManager {

    Map<String, String> imageUrlCache = new HashMap<>();

    public CustomImageManager() {
        super(null, null);
    }

    @Override
    public String resolve(String uri) {
        return imageUrlCache.get(uri);
    }

    @Override
    public void extract(String imagePath, byte[] imageData) throws IOException {

        String serverImageUrl = "https://aliyun-xxx.fervwswv.image";

        imageUrlCache.put(imagePath, serverImageUrl);
    }
}

3.2.转换为html

import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.junit.Test;

import java.io.FileInputStream;
import java.io.FileOutputStream;

public class CustomTestHtml {

    private static final String sourceFile = "C:\\Users\\QM\\Desktop\\a.docx";
    private static final String targetFile = "C:\\Users\\QM\\Desktop\\a.html";

    @Test
    public void poiDocxNew() throws Exception {

        FileInputStream fileInputStream = new FileInputStream(sourceFile);

        XWPFDocument wordDocument = new XWPFDocument(fileInputStream);

        XHTMLOptions xhtmlOptions = XHTMLOptions.create().setImageManager(new CustomImageManager());

        xhtmlOptions.setFragment(true);

        xhtmlOptions.setOmitHeaderFooterPages(true);

        xhtmlOptions.setIgnoreStylesIfUnused(true);

        XHTMLConverter.getInstance().convert(wordDocument, new FileOutputStream(targetFile), xhtmlOptions);

    }

}

Original: https://blog.csdn.net/weixin_45056780/article/details/125218226
Author: 世代农民
Title: java docx转html

原创文章受到原创版权保护。转载请注明出处:https://www.johngo689.com/6093/

转载文章受原作者版权保护。转载请注明原作者出处!

(0)

大家都在看

最近整理资源【免费获取】:   👉 程序员最新必读书单  | 👏 互联网各方向面试题下载 | ✌️计算机核心资源汇总