问题：

从DOCX中提取表

那开济

2023-03-14

使用OpenXML（C#）解析*. docx文档有一个问题。

下面是我的步骤：
1。加载*。docx文档
2。接收段落列表
3。在每个段落中查找文本、图像和表格元素
4。为每个文本和图像元素创建html标记
5。将输出另存为*。html文件

我已经了解了如何在文档中定位图像文件并将其解压缩。现在有一个步骤要做——找到表格在文本（段落）中的位置。

如果有人知道如何在*中定位表。docx文档使用OpenXML，请帮助。谢谢

附加：好吧，也许我不清楚，解释一下我的意思。若我们得到段落的内容，你们可以找到文本块、图片等chield对象。所以，若段落中包含了包含图片的运行，那个就意味着在Word文档的这个位置放置了图片。

我的函数示例：

public static string ParseDocxDocument(string pathToFile)
    {
        StringBuilder result = new StringBuilder();
        WordprocessingDocument wordProcessingDoc = WordprocessingDocument.Open(pathToFile, true);
        List<ImagePart> imgPart = wordProcessingDoc.MainDocumentPart.ImageParts.ToList();
        IEnumerable<Paragraph> paragraphElement = wordProcessingDoc.MainDocumentPart.Document.Descendants<Paragraph>();
        int imgCounter = 0;


        foreach (Paragraph par in paragraphElement)
        {

                //Add new paragraph tag
                result.Append("<div style=\"width:100%; text-align:");

                //Append anchor style
                if (par.ParagraphProperties != null && par.ParagraphProperties.Justification != null)
                    switch (par.ParagraphProperties.Justification.Val.Value)
                    {
                        case JustificationValues.Left:
                            result.Append("left;");
                            break;
                        case JustificationValues.Center:
                            result.Append("center;");
                            break;
                        case JustificationValues.Both:
                            result.Append("justify;");
                            break;
                        case JustificationValues.Right:
                        default:
                            result.Append("right;");
                            break;
                    }
                else
                    result.Append("left;");

                //Append text decoration style
                if (par.ParagraphProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties.HasChildren)
                    foreach (OpenXmlElement chield in par.ParagraphProperties.ParagraphMarkRunProperties.ChildElements)
                    {
                        switch (chield.GetType().Name)
                        {
                            case "Bold":
                                result.Append("font-weight:bold;");
                                break;
                            case "Underline":
                                result.Append("text-decoration:underline;");
                                break;
                            case "Italic":
                                result.Append("font-style:italic;");
                                break;
                            case "FontSize":
                                result.Append("font-size:" + ((FontSize)chield).Val.Value + "px;");
                                break;
                            default: break;
                        }
                    }

                result.Append("\">");

                //Add image tag
                IEnumerable<Run> runs = par.Descendants<Run>();
                foreach (Run run in runs)
                {
                    if (run.HasChildren)
                    {
                        foreach (OpenXmlElement chield in run.ChildElements.Where(o => o.GetType().Name == "Picture"))
                        {
                            result.Append(string.Format("<img style=\"{1}\" src=\"data:image/jpeg;base64,{0}\" />", GetBase64Image(imgPart[imgCounter].GetStream()),
                                           ((DocumentFormat.OpenXml.Vml.Shape)chield.ChildElements.Where(o => o.GetType().Name == "Shape").FirstOrDefault()).Style
                                ));
                            imgCounter++;
                        }
                    }
                }

                //Append inner text
                IEnumerable<Text> textElement = par.Descendants<Text>();
                if (par.Descendants<Text>().Count() == 0)
                    result.Append("<br />");

                foreach (Text t in textElement)
                {
                    result.Append(t.Text);
                }


                result.Append("</div>");
                result.Append(Environment.NewLine);

        }

        wordProcessingDoc.Close();

        return result.ToString();
    }

现在我不需要在文本中指定表格位置（如Word中所示）。

最终：

好了，各位，我知道了。在我的示例函数中，有一个很大的错误。我列举了文档正文的段落元素。表格和段落处于同一级别，所以函数忽略表格。所以我们需要枚举文档体的元素。

这里是我的测试函数，用于从docx生成正确的HTML（它只是测试代码，所以不干净）

public static string ParseDocxDocument(string pathToFile)
    {
        StringBuilder result = new StringBuilder();
        WordprocessingDocument wordProcessingDoc = WordprocessingDocument.Open(pathToFile, true);
        List<ImagePart> imgPart = wordProcessingDoc.MainDocumentPart.ImageParts.ToList();
        List<string> tableCellContent = new List<string>();
        IEnumerable<Paragraph> paragraphElement = wordProcessingDoc.MainDocumentPart.Document.Descendants<Paragraph>();
        int imgCounter = 0;

        foreach (OpenXmlElement section in wordProcessingDoc.MainDocumentPart.Document.Body.Elements<OpenXmlElement>())
        {
            if(section.GetType().Name == "Paragraph")
            {
              Paragraph par = (Paragraph)section;
                //Add new paragraph tag
                result.Append("<div style=\"width:100%; text-align:");

                //Append anchor style
                if (par.ParagraphProperties != null && par.ParagraphProperties.Justification != null)
                    switch (par.ParagraphProperties.Justification.Val.Value)
                    {
                        case JustificationValues.Left:
                            result.Append("left;");
                            break;
                        case JustificationValues.Center:
                            result.Append("center;");
                            break;
                        case JustificationValues.Both:
                            result.Append("justify;");
                            break;
                        case JustificationValues.Right:
                        default:
                            result.Append("right;");
                            break;
                    }
                else
                    result.Append("left;");

                //Append text decoration style
                if (par.ParagraphProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties != null && par.ParagraphProperties.ParagraphMarkRunProperties.HasChildren)
                    foreach (OpenXmlElement chield in par.ParagraphProperties.ParagraphMarkRunProperties.ChildElements)
                    {
                        switch (chield.GetType().Name)
                        {
                            case "Bold":
                                result.Append("font-weight:bold;");
                                break;
                            case "Underline":
                                result.Append("text-decoration:underline;");
                                break;
                            case "Italic":
                                result.Append("font-style:italic;");
                                break;
                            case "FontSize":
                                result.Append("font-size:" + ((FontSize)chield).Val.Value + "px;");
                                break;
                            default: break;
                        }
                    }

                result.Append("\">");

                //Add image tag
                IEnumerable<Run> runs = par.Descendants<Run>();
                foreach (Run run in runs)
                {
                    if (run.HasChildren)
                    {
                        foreach (OpenXmlElement chield in run.ChildElements.Where(o => o.GetType().Name == "Picture"))
                        {
                            result.Append(string.Format("<img style=\"{1}\" src=\"data:image/jpeg;base64,{0}\" />", GetBase64Image(imgPart[imgCounter].GetStream()),
                                           ((DocumentFormat.OpenXml.Vml.Shape)chield.ChildElements.Where(o => o.GetType().Name == "Shape").FirstOrDefault()).Style
                                ));
                            imgCounter++;
                        }
                        foreach (OpenXmlElement table in run.ChildElements.Where(o => o.GetType().Name == "Table"))
                        {
                            result.Append("<strong>HERE'S TABLE</strong>");
                        }
                    }
                }

                //Append inner text
                IEnumerable<Text> textElement = par.Descendants<Text>();
                if (par.Descendants<Text>().Count() == 0)
                    result.Append("<br />");

                foreach (Text t in textElement.Where(o=>!tableCellContent.Contains(o.Text.Trim())))
                {
                    result.Append(t.Text);
                }


                result.Append("</div>");
                result.Append(Environment.NewLine);

            }
            else if (section.GetType().Name=="Table")
            {
                result.Append("<table>");
                Table tab = (Table)section;
                foreach (TableRow row in tab.Descendants<TableRow>())
                {
                    result.Append("<tr>");
                    foreach (TableCell cell in row.Descendants<TableCell>())
                    {
                        result.Append("<td>");
                        result.Append(cell.InnerText);
                        tableCellContent.Add(cell.InnerText.Trim());
                        result.Append("</td>");
                    }
                    result.Append("</tr>");
                }
                result.Append("</table>");
            }                
        }


        wordProcessingDoc.Close();

        return result.ToString();
    }

    private static string GetBase64Image(Stream inputData)
    {
        byte[] data = new byte[inputData.Length];
        inputData.Read(data, 0, data.Length);
        return Convert.ToBase64String(data);
    }

张勇

2023-03-14

尝试按以下步骤查找文档中的第一个表。

Table table = doc.MainDocumentPart.Document.Body.Elements<Table>().First();

从DOCX中提取表

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档