我需要将Apache POI图片从Word文档转换为html文件

凌朗

2023-03-14

问题内容：

我有一些代码使用Java Apache POI库打开一个Microsoft Word文档，并使用Apache
POI将其转换为html，它还会获取文档上图像的字节数组数据。但是我需要将此信息转换为html才能写出为html文件。任何提示或建议，将不胜感激。请记住，我是台式机开发人员而不是Web程序员，因此，当您提出建议时，请记住这一点。下面的代码获取图像。

 private void parseWordText(File file) throws IOException {
      FileInputStream fs = new FileInputStream(file);
      doc = new HWPFDocument(fs);
      PicturesTable picTable = doc.getPicturesTable();
      if (picTable != null){
           picList = new ArrayList<Picture>(picTable.getAllPictures());
           if (!picList.isEmpty()) {
           for (Picture pic : picList) {
                byte[] byteArray = pic.getContent();
                pic.suggestFileExtension();
                pic.suggestFullFileName();
                pic.suggestPictureType();
                pic.getStartOffset();
           }
        }
     }

然后，下面的代码将文档转换为html。有没有办法在下面的代码中将byteArray添加到ByteArrayOutputStream中？

private void convertWordDoctoHTML(File file) throws ParserConfigurationException, TransformerConfigurationException, TransformerException, IOException {
    HWPFDocumentCore wordDocument = null;
    try {
        wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream(file));
    } catch (IOException ex) {
        Exceptions.printStackTrace(ex);
    }

    WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());
    wordToHtmlConverter.processDocument(wordDocument);
    org.w3c.dom.Document htmlDocument = wordToHtmlConverter.getDocument();
    NamedNodeMap node = htmlDocument.getAttributes();


    ByteArrayOutputStream out = new ByteArrayOutputStream();
    DOMSource domSource = new DOMSource(htmlDocument);
    StreamResult streamResult = new StreamResult(out);

    TransformerFactory tf = TransformerFactory.newInstance();
    Transformer serializer = tf.newTransformer();
    serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    serializer.setOutputProperty(OutputKeys.INDENT, "yes");
    serializer.setOutputProperty(OutputKeys.METHOD, "html");
    serializer.transform(domSource, streamResult);
    out.close();

    String result = new String(out.toByteArray());
    acDocTextArea.setText(newDocText);

    htmlText = result;

}

问题答案：

综观对源代码org.apache.poi.hwpf.converter.WordToHtmlConverter的

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/WordToHtmlConverter.java?view=markup&pathrev=1180740

它JavaDoc中的状态：

此实现不会创建图像或指向它们的链接。可以通过重写{@link #processImage（Element，boolean，Picture）}方法来更改

如果您processImage(...)在790行的AbstractWordConverter.java中查看该方法，则该方法似乎正在调用，然后是另一个名为的方法processImageWithoutPicturesManager(...)。

http://svn.apache.org/viewvc/poi/trunk/src/scratchpad/src/org/apache/poi/hwpf/converter/AbstractWordConverter.java?view=markup&pathrev=1180740

此方法WordToHtmlConverter再次定义，可疑与您要增加代码的地方完全相同（第317行）：

@Override
protected void processImageWithoutPicturesManager(Element currentBlock,
    boolean inlined, Picture picture)
{
    // no default implementation -- skip
    currentBlock.appendChild(htmlDocumentFacade.document
    .createComment("Image link to '"
    + picture.suggestFullFileName() + "' can be here"));
}

我认为您已经开始将图像插入流中了。

创建转换器的子类，例如

    public class InlineImageWordToHtmlConverter extends WordToHtmlConverter

然后覆盖该方法并将任何代码放入其中。

我还没有测试过，但是从理论上看，这应该是正确的方法。

我需要将Apache POI图片从Word文档转换为html文件

相关阅读

相关文章

相关问答

相关工具

相关文档