问题：

apache poi word到html转换-单词边界

司信厚

2023-03-14

我正在使用以下代码将word转换为html文件

    public Map convert(String wordDocPath, String htmlPath,
        Map conversionParams)
{
    log.info("Converting word file "+wordDocPath)
    try
    {
        String workingFolder = "C:\temp"
        File workingFolderFile = new File(workingFolder)

        FileInputStream fis = new FileInputStream(wordDocPath);
        XWPFDocument document = new XWPFDocument(fis);
        XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(workingFolderFile));
        options.setExtractor(new FileImageExtractor(workingFolderFile))
        File htmlFile = new File(htmlPath);
        OutputStream out = new FileOutputStream(htmlFile)
        XHTMLConverter.getInstance().convert(document, out, options);

        log.info("Converted to HTML file "+htmlPath)

    }
    catch(Exception e)
    {
        log.error("Exception :"+e.getMessage(),e)
    }
}

代码正在正确生成html输出。

我需要在文档中放入一些参数，如[[AGENT\u NAME]]，稍后我将在代码中用正则表达式替换这些参数。但apache poi并没有将此模式视为单个单词，有时会拆分“[[”，“AGENT\u NAME”

apache poi如何决定单词边界？有没有办法控制它？

共有1个答案

吕承望

2023-03-14

经过所有的努力，我终于决定编写代码来解析word doc并合并拆分的运行。这是代码，希望对别人有所帮助

注意：我使用的模式是${pattern}

void mergeSplittedPatterns(XWPFDocument document)
{
    List<XWPFParagraph> paragraphs = document.paragraphs

    for(XWPFParagraph paragraph : paragraphs)
    {
        List<XWPFRun> runs = paragraph.getRuns()

        int firstCharRun,closingCharRun
        boolean firstCharFound = false;
        boolean secondCharFoundImmediately = false;
        boolean closingCharFound = false;
        boolean gotoNextRun = true

        boolean scan = (runs!=null && runs.size()>0)
        int index = 0

        while(scan)
        {
            gotoNextRun = true;
            XWPFRun run = runs.get(index)
            String runText = run.getText(0)
            if(runText!=null)
                for (int i = 0; i < runText.length(); i++)
            {
                char character = runText.charAt(i);

                if(secondCharFoundImmediately)
                {
                    closingCharFound = (character=="}")
                    if(closingCharFound)
                    {
                        closingCharRun = index

                        if(firstCharRun==closingCharRun)
                        {
                            firstCharFound = secondCharFoundImmediately = closingCharFound = false
                            continue;
                        }
                        else
                        {
                            String mergedText= ""
                            for(int j=firstCharRun;j<=closingCharRun;j++)
                            {
                                mergedText += runs.get(j).getText(0)
                            }
                            runs.get(firstCharRun).setText(mergedText,0)

                            for(int j=closingCharRun;j>firstCharRun;j--)
                            {
                                paragraph.removeRun(j)
                            }
                            firstCharFound = secondCharFoundImmediately = closingCharFound = gotoNextRun = false
                            index = firstCharRun
                            break;
                        }
                    }
                }
                else if(firstCharFound)
                {
                    secondCharFoundImmediately = (character=="{")
                    if(!secondCharFoundImmediately)
                    {
                        firstCharFound = secondCharFoundImmediately = closingCharFound = false
                    }
                }
                else if(character=="\$")
                {
                    firstCharFound = true;
                    firstCharRun = index
                }
            }

            if(gotoNextRun)
            {
                index++;
            }

            if(index>=runs.size())
            {
                scan = false;
            }
        }
    }
}

类似资料：

具有非单词字符的单词边界

使用正则表达式匹配表达式为什么这两个示例匹配如下（突出显示）： c# < code>a #b #c #d 具体来说，为什么第一个字符串不匹配包含最后一个#之前的所有内容？由于单词边界（\b）是零宽度匹配，可以在单词字符（\w）和非单词字符（\ w）之间匹配，或者在单词字符和字符串的开始或结束之间匹配，我不确定以非单词字符结束表达式会如何影响匹配。
用Java转换成英语单词

问题内容：在这里，我想问些奇怪的事情。我想问问有什么方法/逻辑可以将整数值转换成包含数字英文单词的字符串值？例如，用户输入22并获得输出22或2。谢谢问题答案：看看这段代码，它可能就是您想要的。例如，在main方法内部，如果有：输出：编辑我复制了下面的代码，清理了一下格式（主方法在底部）：
HTML/PHP表单转换为XML

我希望从HTML/PHP表单的输入创建一个.xml文件。我知道有一些关于这个的帖子，但我找不到我的案子的解决方案。创建Exml.php 我需要关于缺失部分的帮助（而且我不知道jQuery/Ajax代码是否正确）：如何从表单的post/input创建数组，将其发送到createxml.php，并从我们发送的数组创建.xml。
转换词典到Python Dataframe[复制]

上面的脚本抛出错误：“ValueError:如果使用所有标量值，则必须传递索引” 我需要字典键作为数据帧列下面的脚本可以很好地作为字典值添加为列表请让我知道如何实现这一点？
docx4j转换html->docx->html

(*来自http://www.docx4java.org/forums/xhtml-import-f28/html-docx-html-inserts-a-lot-of-space-t1966.html#p6791？sid=78b64a02482926c4dbdbbafbf50d0a914将在应答时更新）我已经创建了一个html测试文档，其内容如下：然后，我的代码从这个html创建一个docx
匹配所有字符直到一个单词边界

基于正则表达式直到但不包括，我试图匹配所有字符，直到一个单词边界。例如，在以下字符串中匹配：我正在使用：一个否定集有字边界和一个加号中继器这样地：它应该查找一个“a”，然后为任何非单词边界的字符获取一个或多个匹配项。所以我希望它在

apache poi word到html转换-单词边界

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档