工作需要要用python解析各种文档,我敬爱的manager AKA Byrd推荐给了我textract。
“Textract is the most ridiculous library that I've ever used before”,其实它还是挺强大的,只是对于pdf不太友好。
-----------------------------------------------------------------------------------------------------------------
第一个坑:
用 pip install textract 安装好这个库之后
import textract
textract.process('a.pdf', method='pdfminer')
报错:textract.exceptions.ShellError: The command `pdf2txt.py a.pdf` failed with exit code 127
所以 pip install pdfminer.six
第一个坑到这里就踩完了。
第二个坑:
再次运行代码,这次出现了这样的报错信息
UnboundLocalError: local variable 'pipe' referenced before assignment
查看源代码utils.py
def run(self, args):
"""Run ``command`` and return the subsequent ``stdout`` and ``stderr``
as a tuple. If the command is not successful, this raises a
:exc:`textract.exceptions.ShellError`.
"""
# run a subprocess and put the stdout and stderr on the pipe object
try:
pipe = subprocess.Popen(
args,
stdout=subprocess.PIPE, stderr=subprocess.PIPE,
)
except OSError as e:
if e.errno == errno.ENOENT:
# File not found.
# This is equivalent to getting exitcode 127 from sh
raise exceptions.ShellError(
' '.join(args), 127, '', '',
)
# pipe.wait() ends up hanging on large files. using
# pipe.communicate appears to avoid this issue
stdout, stderr = pipe.communicate()
# if pipe is busted, raise an error (unlike Fabric)
if pipe.returncode != 0:
raise exceptions.ShellError(
' '.join(args), pipe.returncode, stdout, stderr,
)
return stdout, stderr
发现是红字部分出错,心里"WaduHek ?!" 我就写了一句代码,这报错算是怎么回事呀?
继续Google,发现了这样一篇帖子 https://github.com/deanmalmgren/textract/issues/154 ,详读了一下发现了原因:
在源代码pdf_parser.py中
def extract_pdfminer(self, filename, **kwargs):
"""Extract text from pdfs using pdfminer."""
stdout, _ = self.run(['pdf2txt.py', filename])
return stdout
这个 pdf2txt.py 无法被找到
以下是两种解决方法:
第1种方法:
修改源代码,使其为:
def extract_pdfminer(self, filename, **kwargs):
"""Extract text from pdfs using pdfminer."""
stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
return stdout
run的第二个参数 'path/to/pdf2txt.py' ,要改成你系统上的pdf2.txt.py的绝对路径(相对路径我没试过,不知道可不可行)。
比如我是在virtualenv下开发的,所以我的路径就是 c:\Users\....\venv\Scripts\pdf2.txt.py
这样的话就可运行了。
(后话:当我用这种方法时,如果我将我的代码修改成这样:
import textract
textract.process('a.pdf')
即去掉了method='pdfminer'。
根据源代码pdf_parser.py :
def extract(self, filename, method='', **kwargs):
if method == '' or method == 'pdftotext':
try:
return self.extract_pdftotext(filename, **kwargs)
except ShellError as ex:
# If pdftotext isn't installed and the pdftotext method
# wasn't specified, then gracefully fallback to using
# pdfminer instead.
if method == '' and ex.is_not_installed():
return self.extract_pdfminer(filename, **kwargs)
else:
raise ex
elif method == 'pdfminer':
return self.extract_pdfminer(filename, **kwargs)
elif method == 'tesseract':
return self.extract_tesseract(filename, **kwargs)
else:
raise UnknownMethod(method)
def extract_pdftotext(self, filename, **kwargs):
"""Extract text from pdfs using the pdftotext command line utility."""
if 'layout' in kwargs:
args = ['pdftotext', '-layout', filename, '-']
else:
args = ['pdftotext', filename, '-']
stdout, _ = self.run(args)
return stdout
def extract_pdfminer(self, filename, **kwargs):
"""Extract text from pdfs using pdfminer."""
stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
return stdout
当我的 method =='' 时,它应该先进入
try:
return self.extract_pdftotext(filename, **kwargs)
然后意识到我并没有安装pdftotxt,再进入到
except ShellError as ex:
# If pdftotext isn't installed and the pdftotext method
# wasn't specified, then gracefully fallback to using
# pdfminer instead.
if method == '' and ex.is_not_installed():
return self.extract_pdfminer(filename, **kwargs)
最终应该还是会去执行
def extract_pdfminer(self, filename, **kwargs):
"""Extract text from pdfs using pdfminer."""
stdout, _ = self.run(['python','path/to/pdf2txt.py', filename])
return stdout
这么来说,我修改自己的代码后应该还是能运行的。然而它又报错了
.............................
.............................
textract.exceptions.ShellError: The command `pdftotext a.pdf -` failed with exit code 127
------------- stdout -------------------------- stderr -------------
时间原因没去深究,有想法的同学希望可以教教我为什么,万分感谢。
最后因为这个方法要修改源代码,那我以后用 pip freeze > requirements.txt + pip install -r requirements.txt 安装的textract库会有问题,所以我没有选择这种方法 ...... 会不会有同学读了这么久读到这了发现这方法居然还不好用然后就把网页关了hh)
第二种方法 :
根据帖子中的Irq3000:
".... so unfortunately I will have to package a copy of pdf2txt.py within my own package, as there is no reliable way to know where the "Scripts" is and add it dynamically to the path that is both crossplatform ...."
将 pdf2txt.py 文件复制到你的项目文件夹下,然后通过继承,在你的程序下添加自己(不是'自己',是Irq3000)写一个pdfminer类:
class MyPdfMinerParser(ShellParser): """Extract text from pdf files using the native python PdfMiner library""" def extract(self, filename, **kwargs): """Extract text from pdfs using pdfminer and pdf2txt.py wrapper.""" # Create a temporary output file tempfilefh, tempfilepath = mkstemp(suffix='.txt') os.close(tempfilefh) # close to allow writing to tesseract # Extract text from pdf using the entry script pdf2txt (part of PdfMiner) pdf2txt.main(['', '-o', tempfilepath, filename]) # Read the results of extraction with open(tempfilepath, 'rb') as f: res = f.read() # Remove temporary output file os.remove(tempfilepath) return res pdfminerparser = MyPdfMinerParser() result = pdfminerparser.process('a.pdf', 'utf8')
第三个坑:
当我以为终于完事了的时候,运行代码,结果出现这样的报错 :
usage: dpMain.py [-h] [-d] [-p PAGENOS]
[--page-numbers PAGE_NUMBERS [PAGE_NUMBERS ...]]
[-m MAXPAGES] [-P PASSWORD] [-o OUTFILE] [-t OUTPUT_TYPE]
[-c CODEC] [-s SCALE] [-A] [-V] [-W WORD_MARGIN]
[-M CHAR_MARGIN] [-L LINE_MARGIN] [-F BOXES_FLOW]
[-Y LAYOUTMODE] [-n] [-R ROTATION] [-O OUTPUT_DIR] [-C] [-S]
files [files ...]
dpMain.py: error: unrecognized arguments: a.pdf
........ 行吧
看了一下源代码,分析了一下,发现
pdf2txt.main(['', '-o', tempfilepath, filename])这行代码的第一个参数是多余的(或者说我水平还低,实在没发现它有什么用),将其去除,得:
pdf2txt.main(['-o', tempfilepath, filename])
大功告成 !
参考于:https://github.com/euske/pdfminer
https://github.com/deanmalmgren/textract/issues/154