当前位置: 首页 > 知识库问答 >
问题:

使用docx4j将docx部件转换为html

吕皓
2023-03-14

我有一个应用程序试图拉一些数据从数据库,然后保存在一个docx文件。这些数据的一部分是html代码,因此使用docx4j,我能够将html代码转换为docx格式。这里有一个相关的帖子。

现在,我想使用docx4j将这部分文本(在docx文件的表单元格中)转换回html格式,并将html代码保存到数据库中。

public class AltChunkAddOfTypeHtml {

   private static ObjectFactory factory;
   private final static String inputfilepath = System.getProperty("user.dir")
         + "/test.docx";

   public static void main(String[] args) throws Exception {

      WordprocessingMLPackage wordMLPackage = WordprocessingMLPackage
            .createPackage();
      MainDocumentPart mdp = wordMLPackage.getMainDocumentPart();
      factory = Context.getWmlObjectFactory();
      Tbl table = factory.createTbl();
      Tr tableRow = factory.createTr();

      Tc tableCell = factory.createTc();

      wordMLPackage.getMainDocumentPart().addObject(table);

      String xhtml = "<html><head><title>Import me</title></head><body><p>Hello World!This is the html code converted into docx!!!</p><b>tested by david</b></body></html>";
      ;

      mdp.addAltChunk(AltChunkType.Xhtml, xhtml.getBytes(), tableCell);

      tableRow.getContent().add(tableCell);
      table.getContent().add(tableRow);
      // Round trip
      wordMLPackage = mdp.convertAltChunks();

      wordMLPackage.save(new java.io.File(inputfilepath));

      List<Object> tableCells = getAllElementFromObject(
            wordMLPackage.getMainDocumentPart(), Tc.class);
      System.out.println(tableCells.size());

      /* only one tc in wordMLPackage */
      List<Object> paragraphsInTc = getAllElementFromObject(
            tableCells.get(0), P.class);
      System.out.println(paragraphsInTc.size());
      System.out.println("Ready to create html.");

      WordprocessingMLPackage wordMLPackage2 = WordprocessingMLPackage
            .createPackage();
      for (Object o : paragraphsInTc) {

         wordMLPackage2.getMainDocumentPart().addObject(o);
      }

      HTMLSettings htmlSettings = Docx4J.createHTMLSettings();

      htmlSettings.setWmlPackage(wordMLPackage2);

      OutputStream os;
      os = new FileOutputStream(new java.io.File(
            System.getProperty("user.dir") + "/sample.html"));
      System.out.println("Creating html.");
      Docx4J.toHTML(htmlSettings, os, Docx4J.FLAG_EXPORT_PREFER_XSL);

   }

   private static List<Object> getAllElementFromObject(Object obj,
         Class<?> toSearch) {
      List<Object> result = new ArrayList<Object>();
      if (obj instanceof JAXBElement)
         obj = ((JAXBElement<?>) obj).getValue();

      if (obj.getClass().equals(toSearch))
         result.add(obj);
      else if (obj instanceof ContentAccessor) {
         List<?> children = ((ContentAccessor) obj).getContent();
         for (Object child : children) {
            result.addAll(getAllElementFromObject(child, toSearch));
         }

      }
      return result;
   }
}
<?xml version="1.0" encoding="utf-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships"><head><meta content="text/html; charset=utf-8" http-equiv="Content-Type" /><style><!--/*paged media */ div.header {display: none }div.footer {display: none } /*@media print { */@page { size: A4; margin: 10%; @top-center {content: element(header) } @bottom-center {content: element(footer) } }/*element styles*/ .del  {text-decoration:line-through;color:red;} .ins {text-decoration:none;background:#c0ffc0;padding:1px;}
/* TABLE STYLES */

/* PARAGRAPH STYLES */
.DocDefaults {display:block;margin-bottom: 4mm;line-height: 115%;font-size: 11.0pt;}
.Normal {display:block;}

/* CHARACTER STYLES */ span.DefaultParagraphFont {display:inline;}
--></style><script type="text/javascript"><!--function toggleDiv(divid){if(document.getElementById(divid).style.display == 'none'){document.getElementById(divid).style.display = 'block';}else{document.getElementById(divid).style.display = 'none';}}
--></script></head><body>

  <!-- userBodyTop goes here -->

  <div style="color:red">TO HIDE THESE MESSAGES, TURN OFF debug level logging for org.docx4j.convert.out.common.writer.AbstractMessageWriter </div>

  <div class="document">

  <p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 0in;margin-bottom: 0in;"><span class="DefaultParagraphFont " style="font-weight: normal;color: #000000;font-style: normal;font-size: 11.0pt;;font-family: Calibri;">Hello World!This is the html code converted into docx!!!</span></p>

  <p class="Normal DocDefaults " style="text-align: left;position: relative; margin-left: 0in;margin-bottom: 0in;"><span class="DefaultParagraphFont " style="font-weight: bold;color: #000000;font-style: normal;font-size: 11.0pt;;font-family: Calibri;">tested by david</span></p></div>

  <!-- userBodyTail goes here -->

  </body></html>
<html><head><title>Import me</title></head><body><p>Hello World!This is the html code converted into docx!!!</p><b>tested by david</b></body></html>

或者也许有更好的解决方案来完成从docx到HTML的转换?希望我说清楚了。任何提示都很感激。提前谢了。

共有1个答案

魏威
2023-03-14

通过从word中读取段落和运行来解决,然后添加html标记。

  /**
     * Convert the description in table cell back into html code to be saved into database
     * 
     * @param tc
     * @return
     */
    private String convertTcToHtml(Tc tc) {
        StringBuilder sb = new StringBuilder();
        sb.append("<html><body>");

        List<Object> paragraphs = getAllElementFromObject(tc, P.class);
        if (paragraphs == null || paragraphs.size() == 0) {
            return null;
        }

        /* Description exported from alm only has one paragraph in word. */
        List<Object> runs = getAllElementFromObject(paragraphs.get(0), R.class);
        addRunsToHtmlStringBuffer(sb, runs);

        /* If user modify description in word it may generate more paragraphs in word. */
        if (paragraphs.size() > 1) {
            sb.append("<br />");
            for (int i = 1; i < paragraphs.size(); i++) {
                List<Object> moreRuns = getAllElementFromObject(paragraphs.get(i), R.class);
                addRunsToHtmlStringBuffer(sb, moreRuns);
                /* Every paragraph should be starting from a new line */
                sb.append("<br />");
            }
        }

        sb.append("</body></html>");
        return sb.toString();
    }

    /**
     * Add Texts of a list of Runs to the html string builder
     * 
     * @param sb
     * @param runs
     */
    private void addRunsToHtmlStringBuffer(StringBuilder sb, List<Object> runs) {
        if (runs != null && runs.size() > 0) {
            for (Object r : runs) {
                R run = (R) r;

                List<Object> brs = getAllElementFromObject(run, Br.class);
                if (brs != null && brs.size() > 0) {
                    LOGGER.info("BR:");
                    sb.append("<br/>");
                }

                /* One run usually has one text */
                List<Object> texts = getAllElementFromObject(run, Text.class);
                if (texts != null && texts.size() > 0) {
                    StringBuilder text_sb = new StringBuilder();
                    for (Object t : texts) {
                        Text text = (Text) t;
                        text_sb.append(text.getValue());
                    }

                    String htmlText = replaceWithHtmlCharacters(text_sb.toString());

                    if (run.getRPr() != null && run.getRPr().getB() != null && (run.getRPr().getB().isVal())) {
                        LOGGER.info("Bold Text:");
                        sb.append("<b>");
                        sb.append(htmlText);
                        sb.append("</b>");
                    } else {
                        LOGGER.info("Normal Text:");
                        sb.append(htmlText);
                    }
                }
            }
        }
    }

    /**
     * Replace ", <, > with html special charactors
     * 
     * @param text
     * @return
     */
    private String replaceWithHtmlCharacters(String text) {
        text = text.replace("\"", "&quot;");
        text = text.replace("<", "&lt;");
        text = text.replace(">", "&gt;");

        return text;
    }
 类似资料:
  • 我一直在尝试使用他们的库将html内容转换为docx,我确实在运行我的应用程序后创建了一个docx文件,但它有空白内容,而html中确实有一些内容。请检查下面的代码,我已经包含了git上AndroidDocxtoHTML示例中所有必要的库。 代码: 我不明白我得到的空白文档的代码中缺少了什么。我为java找到了这段代码,我为android修改了这段代码。有些人建议使用夜间构建jar进行xhtml转

  • 我在将HTML转换为docx时遇到了新问题,它引发了异常: 组织。xml。萨克斯。SAXS异常;行号:4;栏目号:73;实体“nbsp”已被引用,但未被声明 正如我所理解的,这是因为docx4j认为我的文件是XML,并希望将其转换为docx但XML中只有5个预定义的实体,而nbsp等实体没有在XML中定义。如何让docx4j将超文本标记语言转换为doc,而无需在doctype中声明实体nbsp?

  • 我正试图用Docx4J将一个DOCX文件转换为PDF,并收到两个不同文档的两个不同的异常。 1)对于文档1,第一个文档的org.docx4j.utils.singletRaversAlutilVisitorCallback.apply(SingletRaversAlutilVisitorCallback.java:27)中出现一个NullPointerException。 下面包含的代码是否是在P

  • null 很抱歉,我无法发布我尝试过的任何内容,因为我还没有在此任务上尝试过任何内容,尽管我使用将从获得的转换为,以便在应用程序的中输出。请开导我,我在压力和困惑中迷失了……!

  • 我的目标是采取现有的措施。docx文件,并使用docx4j将其从Linux命令行转换为PDF(http://www.docx4java.orghttp://www.docx4java.org).入门指南(http://www.docx4java.org/svn/docx4j/trunk/docx4j/docs/Docx4j_GettingStarted.html)指的是最新(2.8.1)软件包中实