问题：

读取.doc文件内容并用java写入pdf文件

张华池

2023-03-14

我正在编写一个java代码，它利用Apache-poi读取ms-office.doc文件，利用itext jar API创建并写入pdf文件。我已经阅读了.doc文件中打印的文本和表格。现在我正在寻找一个读取文档中写入的图像的解决方案。我已经编写了如下代码来读取文档文件中的图像。为什么这段代码不起作用。

public static void main(String[] args) {
    POIFSFileSystem fs = null;  
    Document document = new Document();
    WordExtractor extractor = null ;
    try {
        fs = new POIFSFileSystem(new FileInputStream("C:\\DATASTORE\\tableandImage.doc"));
        HWPFDocument hdocument=new HWPFDocument(fs);
        extractor = new WordExtractor(hdocument);
        OutputStream fileOutput = new FileOutputStream(new File("C:/DATASTORE/tableandImage.pdf"));
        PdfWriter.getInstance(document, fileOutput);
        document.open();
        Range range=hdocument.getRange();
        String readText=null;
        PdfPTable createTable;
        CharacterRun run;
        PicturesTable picture;

        for(int i=0;i<range.numParagraphs();i++) {
            Paragraph par = range.getParagraph(i);
            readText=par.text();
            if(!par.isInTable()) {
                if(readText.endsWith("\n")) {
                    readText=readText+"\n";
                    document.add(new com.itextpdf.text.Paragraph(readText));
                } if(readText.endsWith("\r")) {
                      readText += "\n";
                      document.add(new com.itextpdf.text.Paragraph(readText));
                  }
                run =range.getCharacterRun(i);
                picture=hdocument.getPicturesTable();
                if(picture.hasPicture(run)) {
                //if(run.isSpecialCharacter()) {  
                    Picture pic=picture.extractPicture(run, true);
                    byte[] picturearray=pic.getContent();
                    com.itextpdf.text.Image image=com.itextpdf.text.Image.getInstance(picturearray);
                    document.add(image);
                }
            } else if (par.isInTable()) { 
                  Table table = range.getTable(par);
                  TableRow tRow1= table.getRow(0);
                  int numColumns=tRow1.numCells();
                  createTable=new PdfPTable(numColumns);
                  for (int rowId=0;rowId<table.numRows();rowId++) {
                      TableRow tRow = table.getRow(rowId);
                      for (int cellId=0;cellId<tRow.numCells();cellId++) {
                          TableCell tCell = tRow.getCell(cellId);
                          PdfPCell c1 = new PdfPCell(new Phrase(tCell.text()));
                          createTable.addCell(c1);
                      }
                  }
                  document.add(createTable);
              } 
        }
    }catch(IOException e) {
        System.out.println("IO Exception");
        e.printStackTrace();
    }
    catch(Exception exep) {
        exep.printStackTrace();
    }finally {  
        document.close();  
    }  
}

存在的问题是：1。条件if（Picture.HasPicture（run））不满足，但文档具有jpeg图像。

我在读取表时遇到以下html" target="_blank">异常。

java.lang.IllegalArgumentException：该段不是pagecode.readdocxordocfile.main（readdocxordocfile.java:113)的org.apache.poi.hwpf.usermodel.range.gettable（range.java:876)表中的第一段

有人能帮我解决这个问题吗。谢谢你。

共有1个答案

薛浩言

2023-03-14

关于您的例外情况：

您的代码遍历所有段落，并为其中的每一段调用isintable()。由于表通常由几个这样的段落组成，因此对getTable()的调用也会为单个表执行几次。

但是，您的代码应该做的是找到表的第一段，然后处理其中的所有段落（通过getRow(m).getCell(n))，并最终在表后面的第一段中继续外部循环。代码方面，这可能大致如下所示（假设没有合并单元格，没有嵌套表，也没有其他有趣的边缘情况）：

if (par.isInTable()) {
    Table table = range.getTable(par);
    for (int rn=0; rn<table.numRows(); rn++) {
        TableRow row = table.getRow(rn);
        for (int cn=0; cn<row.numCells(); cn++) {
            TableCell cell = row.getCell(cn);
            for (int pn=0; pn<cell.numParagraphs(); pn++) {
                Paragraph cellParagraph = cell.getParagraph(pn);
                // your PDF conversion code goes here
            }
        }
    }
    i += table.numParagraphs()-1; // skip the already processed (table-)paragraphs in the outer loop
}

PictureStore pictureStore = new PictureStore(hdocument);
// bla bla ...
for (int cr=0; cr < par.numCharacterRuns(); cr++) {
    CharacterRun characterRun = par.getCharacterRun(cr);
    Field field = hdocument.getFields().getFieldByStartOffset(FieldsDocumentPart.MAIN, characterRun.getStartOffset());
    if (field != null && field.getType() == 0x3A) { // 0x3A is type "EMBED"   
        Picture pic = pictureStore.getPicture(field.secondSubrange(characterRun));
    }
}

类似资料：

如何读取pdf文件并将其写入outputStream

问题内容：我需要读取文件路径为“ C：\ file.pdf”的pdf文件，并将其写入outputStream。最简单的方法是什么？ ................................................... ................................................... 问题答案： import java.io.*; 到目前为止
Python解析并读取PDF文件内容的方法

本文向大家介绍Python解析并读取PDF文件内容的方法，包括了Python解析并读取PDF文件内容的方法的使用技巧和注意事项，需要的朋友参考一下本文实例讲述了Python解析并读取PDF文件内容的方法。分享给大家供大家参考，具体如下：一、问题描述利用python，去读取pdf文本内容。二、效果三、运行环境 python2.7 四、需要安装的库五、实现源代码代码1（win64）代码
C#编程读取文档Doc、Docx及Pdf内容的方法

本文向大家介绍C#编程读取文档Doc、Docx及Pdf内容的方法，包括了C#编程读取文档Doc、Docx及Pdf内容的方法的使用技巧和注意事项，需要的朋友参考一下本文实例讲述了C#编程读取文档Doc、Docx及Pdf内容的方法。分享给大家供大家参考。具体分析如下： Doc文档：Microsoft Word 14.0 Object Library (GAC对象，调用前需要安装word。安装的wor
如何读取和写入JS文件中的内容

我有一个js文件，其中包含一些字典结构，如下例- 文件：read_js。js公司我想使用typecript在此字典中添加一些数据。如何实现这一点？我尝试了，但这返回文件中存在的所有文本，因此无法读取字典并附加我自己的键值并重新写入js文件。
使用Java并发从大文件（2GB）读取并写入另一个文件

我有一个巨大的文件（2GB），其中只包含员工编号。我必须阅读此文件，获取员工号码并调用数据库以获取员工的工资，然后将其写入另一个文件中，并将员工姓名和工资作为其行。现在的问题是，通过直接读取这个巨大的文件通过简单的nio在java我的STS内存溢出或它需要4-5小时来完成整个读-取-写过程。所以我想用Java并发来拯救我。为此，我有一个实现Runnable的EmployeeDetails类，
Loop（读取文件内容）

逐行读取文本文件的内容，每次一行（比 FileReadLine 执行的更好）。 Loop, Read, InputFile [, OutputFile] 参数 Read 此参数必须为单词 READ. InputFile 需要在循环中读取内容的文本文件的名称, 如果未指定绝对路径则假定在 %A_WorkingDir% 中. 支持 Windows 和 Unix 格式; 即文件的行结束符可以是回车和换行

读取.doc文件内容并用java写入pdf文件

共有1个答案

相关问答

相关文章

相关阅读

相关工具

相关文档