问题：

使用poi提取docx文件中嵌入段落内的内容

秦俊友

2023-03-14

我正在使用poi从docx文件中提取内容，在处理一个文件时，所有图片都丢失了，我检查了这个文件的格式，发现结构异常：

<w:r>
<w:p xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing">
<w:r>
<w:drawing>
<wp:anchor distT="0" distB="0" distL="114300" distR="114300" simplePos="0" relativeHeight="251658240" behindDoc="0" locked="0" layoutInCell="1" allowOverlap="1">
<wp:simplePos x="0" y="0"/>
<wp:positionH relativeFrom="column">
<wp:align>center</wp:align>
</wp:positionH>
<wp:positionV relativeFrom="paragraph">
<wp:posOffset>2540</wp:posOffset>
</wp:positionV>
<wp:extent cx="5352176" cy="1837188"/>
<wp:wrapTopAndBottom/>
<wp:docPr id="9" name="media/GIUACAFYtDB.png"/>
<a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main">
<a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture">
<pic:nvPicPr>
<pic:cNvPr id="0" name="media/GIUACAFYtDB.png"/>
<pic:cNvPicPr/>
</pic:nvPicPr>
<pic:blipFill>
<a:blip r:embed="rId9"/>
<a:stretch>
<a:fillRect/>
</a:stretch>
</pic:blipFill>
<pic:spPr>
<a:xfrm>
<a:off x="0" y="0"/>
<a:ext cx="5352176" cy="1837188"/>
</a:xfrm>
<a:prstGeom prst="rect"/>
</pic:spPr>
</pic:pic>
</a:graphicData>
</a:graphic>
</wp:anchor>
</w:drawing>
</w:r>
</w:p>
</w:r>

段落元素位于run元素内。我称之为嵌入段落，但我找不到使用poi解析嵌入段落的方法。我如何处理这些数据？

共有1个答案

左丘阳晖

2023-03-14

public static List<XWPFPictureData> extractPictureData(XWPFRun wrun) {
    List<XWPFPicture> pictures = wrun.getEmbeddedPictures();
    List<XWPFPictureData> result = new ArrayList<>();
    if(pictures != null && !pictures.isEmpty()) {
        for (XWPFPicture picture : pictures) {
            XWPFPictureData data = picture.getPictureData();
            if(data != null) {
                result.add(data);
            }
        }
        return result;
    }
    CTR ctr = wrun.getCTR();
    if(ctr.validate()) {    
        return result;
    }
    //this run does not obey openxml protocal.
    XWPFDocument document = wrun.getDocument();
    String xpath = "declare namespace w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' " +
          ".//w:drawing";
    XmlObject[] drawings = ctr.selectPath(xpath);
    for (XmlObject drawing : drawings) {
        String blipPath = "declare namespace a='http://schemas.openxmlformats.org/drawingml/2006/main' " +
                ".//a:blip";
        XmlObject[] blips = drawing.selectPath(blipPath);
        if(blips.length == 0) {
            continue;
        }
        XmlObject blip = blips[0];
        XmlObject blipId = 
                blip.selectAttribute("http://schemas.openxmlformats.org/officeDocument/2006/relationships"
                        , "embed");
        if(blipId == null) {
            continue;
        }
        String id = ((SimpleValue)blipId).getStringValue();
        POIXMLDocumentPart relatedPart = document.getRelationById(id);
        if (relatedPart instanceof XWPFPictureData) {
            XWPFPictureData pictureData =  (XWPFPictureData) relatedPart;
            result.add(pictureData);
        }
    }
    return result;
}

它并不能解决所有问题，但现在它解决了我的问题。我试图访问低级XmlObject，并为嵌入段落构造一个XWPFParagraph对象，但失败了。所以我只使用低级xml处理代码。

类似资料：

使用apache-poi从doc和docx文件中提取标题和段落

我试图通过ApachePOI阅读Microsoft word文档，发现提供了两种方便的方法来扫描文档，如getText（）、getParagraphList（）等。。但我的用例略有不同，我们希望扫描任何文档的方式是，它应该按照文档中出现的相同顺序为我们提供事件/信息，如标题、段落、表格。它将帮助我准备一个文档结构，比如，其主要目的是保持标题和段落之间的关系，如原始文件所示。不确定，但像这样的东西
使用Apache POI从Word文档中提取段落

正如您在word文档中看到的，有许多带有要点的问题。现在，我正在尝试使用apache POI从文件中提取每个段落。这是我当前的代码上述方法的问题在于它打印的是每一行而不是段落。此外，项目符号也从提取的字符串中删除。返回一个纯字符串。谁能解释一下我做错了什么。也请建议如果你有一个更好的想法来解决它。
如何从中提取图像。DOCX使用DocumentFormat。OpenXml。段落

我需要从DOCX文件中提取文本和图像到文本文件（当然，将图像保存为图形文件）。使用下面的代码如何获取图像并将其保存为文本文件中的引用？如果我使用：我可以得到所有的图像，但有时一个图像被用在几个地方。我找不到从列表中获取特定图像的参考。以下是取自（从DOCX提取表）的示例代码：
Apache POI：在java中从word文档（docx）中提取一个段落和随后的表

示例单词内容为 Apache POI提供了API来给出段落和表的列表，但我无法阅读段落（测试用例）并立即查找该段落后面的表。我尝试使用XWPFWordExtractor（读取所有文本）、bodyElementIterator（遍历所有主体元素），但大多数都给出了方法，该方法给出了段落列表和方法，该方法给出了文档中的所有表的列表。我如何浏览所有段落，停在标题‘测试用例’之后的段落（第4段），然后
使用POI-XSSF在java中嵌入文件

我需要使用Java Apache POI在excel中嵌入文件（格式为xlsx)。我找到了一个使用POI-HSSF在excel中嵌入文件（格式为xls）的示例（使用Apache POI将文件嵌入到Excel中），但此示例不适用于excel xlsx格式。有人知道使用是否可以做到这一点吗？
使用POI Java提取Word文档中表格单元格中的内容

我找不到提取每个单元格中文本的方法

使用poi提取docx文件中嵌入段落内的内容

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档