poi读取doc生成pdf
One of my customers has an insane amount of PDF and Microsoft Word DOC files on their website. It's core to their online services so it's not as though they're garbage files up on the server. My customer wanted their website's search engine (Sphider) to read these PDF files and DOC files so that their clients could get at the documents they needed without going through a bunch of summary pages to get them. I was successful in the task, so let me show you how to read PDF and DOC files using PHP.
我的一位客户的网站上有大量的PDF和Microsoft Word DOC文件。 它是其在线服务的核心,因此就好像它们不是服务器上的垃圾文件一样。 我的客户希望其网站的搜索引擎(Sphider)读取这些PDF文件和DOC文件,以便他们的客户可以获取所需的文档,而无需经过一堆摘要页面来获取它们。 我已经成功完成了这项任务,所以让我向您展示如何使用PHP读取PDF和DOC文件。
阅读PDF文件 (Reading PDF Files)
To read PDF files, you will need to install the XPDF package, which includes "pdftotext." Once you have XPDF/pdftotext installed, you run the following PHP statement to get the PDF text:
要阅读PDF文件,您需要安装XPDF软件包 ,其中包括“ pdftotext”。 安装XPDF / pdftotext后,运行以下PHP语句以获取PDF文本:
$content = shell_exec('/usr/local/bin/pdftotext '.$filename.' -'); //dash at the end to output content
读取DOC文件 (Reading DOC Files)
Like the PDF example above, you'll need to download another package. This package is called Antiword. Here's the code to grab the Word DOC content:
像上面的PDF示例一样,您需要下载另一个软件包。 该软件包称为Antiword 。 这是获取Word DOC内容的代码:
$content = shell_exec('/usr/local/bin/antiword '.$filename);
The above code does NOT read DOCX files and does not (and purposely so) preserve formatting. There are other libraries that will preserve formatting but in our case, we just want to get at the text.
上面的代码不读取DOCX文件,并且不(有意地)保留格式。 还有其他一些库将保留格式,但是在我们的情况下,我们只想获取文本。
A special thank you to Jeremy Parrish for his help and insight with this task.
特别感谢Jeremy Parrish在此任务上的帮助和见识。
poi读取doc生成pdf