text = textract.process(file_path, method='pdfminer', encoding='utf-8')
报错:
File "D:\anaconda3\lib\site-packages\textract\parsers\__init__.py", line 77, in process
return parser.process(filename, encoding, **kwargs)
File "D:\anaconda3\lib\site-packages\textract\parsers\utils.py", line 46, in process
byte_string = self.extract(filename, **kwargs)
File "D:\anaconda3\lib\site-packages\textract\parsers\pdf_parser.py", line 31, in extract
return self.extract_pdfminer(filename, **kwargs)
File "D:\anaconda3\lib\site-packages\textract\parsers\pdf_parser.py", line 48, in extract_pdfminer
stdout, _ = self.run(['pdf2txt.py', filename])
File "D:\anaconda3\lib\site-packages\textract\parsers\utils.py", line 96, in run
stdout, stderr = pipe.communicate()
UnboundLocalError: local variable 'pipe' referenced before assignment
原因分析:
问题出在 Lib/site-packages/textract/parsers/pdf_parser.py 这个文件的这段代码(48行)
def extract_pdfminer(self, filename, **kwargs):
"""Extract text from pdfs using pdfminer."""
stdout, _ = self.run(['pdf2txt.py', filename])
return stdout
原因是没有执行通过什么方式执行文件
解决方案:
改为如下代码
def extract_pdfminer(self, filename, **kwargs):
"""Extract text from pdfs using pdfminer."""
stdout, _ = self.run(['python','pdf2txt.py', filename])
return stdout
问题解决。
如果读取到的文件出现乱码 在代码text = textract.process(file_path, method='pdfminer', encoding='utf-8')
, 后边加上text = text.decode('utf-8')
进行解码
补充:
安装textract的时候并不会自动帮你安装pdfminer,需要手动安装pdfminer
pip install pdfminer.six
然后之前别忘记把 pdf2txt.py文件复制到项目下 否则会报其他错误
参考官方文档:https://textract.readthedocs.io/en/latest/installation.html
Ubuntu / Debian上安装textract解析pdf前要安装前置软件:
apt-get install python-dev libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocrflac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig libpulse-dev
pip install textract
安装完还是不行 如果要解析pdf还要pip install pdfminer.six
参考文档:https://textract.readthedocs.io/en/stable/
这是python 直接执行python脚本文件应该就没问题了
定位问题后,解决办法就很简单啦,有两种方法
1.使用PYTHONIOENCODING运行python的时候加上PYTHONIOENCODING=utf-8,即
PYTHONIOENCODING=utf-8 python XXXX.py
2.重新定义标准输出
标准输出的定义如下
import sys
import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
原文链接:https://blog.csdn.net/AckClinkz/article/details/78538462
https://blog.csdn.net/u011415481/article/details/80794567