问题：

如何用docx库从MS Word文档中的表中提取图像？

蔡默

2023-03-14

我正在开发一个程序，需要从MS Word文档中提取两个图像，以便在另一个文档中使用它们。我知道图像的位置（文档中的第一个表），但当我试图从表中提取任何信息（即使只是纯文本）时，我会得到空单元格。

这是我想从中提取图像的Word文档。我想从第一页（第一个表，第0行和第1行，第2列）中提取“Rentel”图像。

我尝试了以下代码：

from docxtpl import DocxTemplate

source_document = DocxTemplate("Source document.docx")

# It doesn't really matter which rows or columns I use for the cells, everything is empty
print(source_document.tables[0].cell(0,0).text)

这只会给我空话。。。

我已经读过这篇讨论和这篇文章，问题可能是“包含在Python Docx无法读取的包装器元素中”。他们建议更改源文档，但我希望能够选择以前使用与源文档相同的模板创建的任何文档（因此这些文档也包含相同的问题，我无法单独更改每个文档）。因此，只有Python的解决方案才是我思考解决问题的唯一方法。

因为我也只想要这两个特定的图像，所以通过解压缩Word文件从xml中提取任何随机图像并不真正适合我的解决方案，除非我知道需要从解压缩的Word文件夹中提取哪个图像名称。

我真的很想这样做，因为这是我论文的一部分（我只是一名机电工程师，所以我对软件不太了解）。

[编辑]：这是第一个图像的xml代码（source_document.tables[0]。单元格（0,2）_tc。这是第二个图像（source_document.tables[0]），单元格（1,2）_tc。xml）。然而，我注意到，将（0,2）作为行和列的值，会给出第一个“可见”表中第2列中的所有行。单元格（1,2）给出了第二个“可见”表第2列中的所有行。

如果这个问题不能用Python Docx直接解决，那么是否可以在XML代码中搜索图像名称或ID或其他内容，然后用Python Docx添加这个ID/名称？

共有2个答案

陈开宇

2023-03-14

对于有同样问题的人，这是帮助我解决它的代码：

首先，我使用以下方法从表中提取嵌套单元格：

@staticmethod
def get_nested_cell(table, outer_row, outer_column, inner_row, inner_column):
    """
        Returns the nested cell (table inside a table) of the *document*

        :argument
            table: [docx.Table] outer table from which to get the nested table
            outer_row: [int] row of the outer table in which the nested table is
            outer_column: [int] column of the outer table in which the nested table is
            inner_row: [int] row in the nested table from which to get the nested cell
            inner_column: [int] column in the nested table from which to get the nested cell
        :return
            inner_cell: [docx.Cell] nested cell
    """
    # Get the global first cell
    outer_cell = table.cell(outer_row, outer_column)
    nested_table = outer_cell.tables[0]
    inner_cell = nested_table.cell(inner_row, inner_column)

    return inner_cell

使用这个单元格，我可以获得xml代码，并从xml代码中提取图像。注：

我没有设置图像的宽度和高度，因为我希望它是相同的

def replace_logos_from_source(self, source_document, target_document, inner_row, inner_column):
    """
        Replace the employer and client logo from the *source_document* to the *target_document*. Since the table
        in which the logos are placed are nested tables, the source and target cells with *inner_row* and
        *inner_column* are first extracted from the nested table.

        :argument
            source_document: [DocxTemplate] document from which to extract the image
            target_document: [DocxTemplate] document to which to add the extracted image
            inner_row: [int] row in the nested table from which to get the image
            inner_column: [int] column in the nested table from which to get the image
        :return
            Nothing
    """
    # Get the target and source cell (I know that the table where I want to get the logos from is 'tables[0]' and that the nested table is in outer_row and outer_column '0', so I just filled it in without adding extra arguments to the method)
    target_cell = self.get_nested_cell(target_document.tables[0], 0, 0, inner_row, inner_column)
    source_cell = self.get_nested_cell(source_document.tables[0], 0, 0, inner_row, inner_column)

    # Get the xml code of the inner cell
    inner_cell_xml = source_cell._tc.xml

    # Get the image from the xml code
    image_stream = self.get_image_from_xml(source_document, inner_cell_xml)

    # Add the image to the target cell
    paragraph = target_cell.paragraphs[0]
    if image_stream:  # If not None (image exists)
        run = paragraph.add_run()
        run.add_picture(image_stream)
    else:
        # Set the target cell text equal to the source cell text
        paragraph.add_run(source_cell.text)

@staticmethod
def get_image_from_xml(source_document, xml_code):
    """
        Returns the rId for an image in the *xml_code*

        :argument
            xml_code: [string] xml code from which to extract the image from
        :return
            image_stream: [BytesIO stream] the image to find
            None if no image exists in the xml_file

    """
    # Parse the xml code for the blip
    xml_parser = minidom.parseString(xml_code)

    items = xml_parser.getElementsByTagName('a:blip')

    # Check if an image exists
    if items:
        # Extract the rId of the image
        rId = items[0].attributes['r:embed'].value

        # Get the blob of the image
        source_document_part = source_document.part
        image_part = source_document_part.related_parts[rId]
        image_bytes = image_part._blob

        # Write the image bytes to a file (or BytesIO stream) and feed it to document.add_picture(), maybe:
        image_stream = BytesIO(image_bytes)

        return image_stream
    # If no image exists
    else:
        return None

为了调用该方法，我使用了：

# Replace the employer and client logos
self.replace_logos_from_source(self.source_document, self.template_doc, 0, 2)  # Employer logo
self.replace_logos_from_source(self.source_document, self.template_doc, 1, 2)  # Client logo

丁星火

2023-03-14

首先跳出的是，您发布的两个单元格（w:tc元素）都包含一个嵌套表。这也许是不寻常的，但肯定是一篇有效的作文。也许他们这样做是为了在图片下方的单元格中添加标题或其他内容。

要访问嵌套表，必须执行以下操作：

outer_cell = source_document.tables[0].cell(0,2)
nested_table = outer_cell.tables[0]
inner_cell_1 = nested_table.cell(0, 0)
print(inner_cell_1.text)
# ---etc....---

我不确定这是否解决了你的整个问题，但我觉得这最终是两个或更多的问题，第一个是：“为什么我的表单元格没有出现？”第二个可能是“如何从表单元格中获得图像？”（一旦你真的找到了有问题的细胞）。

类似资料：

从DOCX中提取表

使用OpenXML（C#）解析*. docx文档有一个问题。下面是我的步骤： 1。加载*。docx文档 2。接收段落列表 3。在每个段落中查找文本、图像和表格元素 4。为每个文本和图像元素创建html标记 5。将输出另存为*。html文件我已经了解了如何在文档中定位图像文件并将其解压缩。现在有一个步骤要做——找到表格在文本（段落）中的位置。如果有人知道如何在*中定位表。docx文档使用Ope
如何用Tika从docx中提取文本

我试图从docx中提取文本：tika-app做得很好，但当我试图在代码中做同样的事情时，结果是什么也没有，tika解析器说我的docx文件的内容类型是“application/zip”。我该怎么办？我应该使用递归方法（像这样）还是有其他方法？ java.lang.noClassDefFounderRor:org/apache/poi/openXML4j/exceptions/invalidFor
从pdf文档中提取图像

我知道以前也有人问过类似的问题，但是这些问题已经过时了（有些问题可以追溯到2006年）。我有一个. net 3.5应用程序（w/iTextSharp 5），我正在转换为. net核心（iText 7），它从联邦快递跟踪文档中提取签名，通过SOAP服务以字节[]数组发送。这段代码多年来一直运行良好，只是略有更新。从联邦快递返回的PDF文档中有几个图像，但签名块不是110x46图像（这是pdf文件中
如何从中提取图像。DOCX使用DocumentFormat。OpenXml。段落

我需要从DOCX文件中提取文本和图像到文本文件（当然，将图像保存为图形文件）。使用下面的代码如何获取图像并将其保存为文本文件中的引用？如果我使用：我可以得到所有的图像，但有时一个图像被用在几个地方。我找不到从列表中获取特定图像的参考。以下是取自（从DOCX提取表）的示例代码：
如何从PDF文档中提取文本？

问题内容：如何使用PHP 从PDF文档中提取文本？（我不能使用其他工具，我没有root用户访问权限）我发现一些函数可用于纯文本，但是它们不能很好地处理Unicode字符： http://www.hashbangcode.com/blog/zend-lucene-and-pdf-documents-part-2-pdf- data-extraction-437.html 问题答案：下载 c
使用VBA从Word文档中提取图像

我需要循环一些word文档，并从word文档中提取图像，并将其保存在单独的文件夹中。我尝试过将它们保存为超文本标记语言文档的方法，但它不太适合我的需求。现在，我使用inlineshapes对象循环浏览图像，然后将它们复制粘贴到publisher文档上，然后将它们保存为图像。但是，在运行脚本时，我会遇到运行时自动化错误。对于使用Publisher运行时库，我尝试了早期绑定和晚期绑定，但都遇到了错误

如何用docx库从MS Word文档中的表中提取图像？

共有2个答案

相关问答

相关文章

相关阅读

相关工具

相关文档