最近接到一个需求,要将所有不同格式的文档(包括.doc/.docx/.wps)转成统一格式,如都转为.docx,或直接转为.html 或.txt。经调研后,发现有这样几款工具:
win32com
)。当前大家的生产环境一般都是Linux环境,更换win服务器会造成一系列的连带问题,比如其他库是否兼容等等,非常麻烦,所以找到.doc/.wps在Linux下的处理方式非常重要。还好,最后被我找到了,那就是LibreOffice
libreoffice --version
,看看你是否已经安装此款工具。如果还没有安装,参考下文安装LibreOfficelibreoffice --headless --convert-to txt path-to-your-doc.doc
你同样可以指定转换后的文件输出路径,也可以批量地将doc/docx/wps文件传给LibreOffice接口:
libreoffice --headless --convert-to html --outdir /your/output/dir /your/doc_docx_wps/files/*.{dosx,doc,wps}
import os
os.system("libreoffice --headless --convert-to txt path-to-your-doc.doc")
当然,如果你嫌这个接口的单进程速度太慢,你也可以用Python执行多进程启动转换:
import subprocess
import os, glob
from multiprocessing.dummy import Pool
def worker(fname, dstdir=os.path.expanduser("~")):
subprocess.call(["libreoffice", "--headless", "--convert-to", "pdf", fname], cwd=dstdir)
pool = Pool()
pool.map(worker, glob.iglob(
os.path.join(os.path.expanduser("~"), "*.doc")
))
其实LibreOffice功能很强大,它还可以对xhtml、pdf、jpeg、png等等多种格式进行转换。具体支持的格式如下
The following list of document formats are currently available:
bib - BibTeX [.bib]
doc - Microsoft Word 97/2000/XP [.doc]
doc6 - Microsoft Word 6.0 [.doc]
doc95 - Microsoft Word 95 [.doc]
docbook - DocBook [.xml]
docx - Microsoft Office Open XML [.docx]
docx7 - Microsoft Office Open XML [.docx]
fodt - OpenDocument Text (Flat XML) [.fodt]
html - HTML Document (OpenOffice.org Writer) [.html]
latex - LaTeX 2e [.ltx]
mediawiki - MediaWiki [.txt]
odt - ODF Text Document [.odt]
ooxml - Microsoft Office Open XML [.xml]
ott - Open Document Text [.ott]
pdb - AportisDoc (Palm) [.pdb]
pdf - Portable Document Format [.pdf]
psw - Pocket Word [.psw]
rtf - Rich Text Format [.rtf]
sdw - StarWriter 5.0 [.sdw]
sdw4 - StarWriter 4.0 [.sdw]
sdw3 - StarWriter 3.0 [.sdw]
stw - Open Office.org 1.0 Text Document Template [.stw]
sxw - Open Office.org 1.0 Text Document [.sxw]
text - Text Encoded [.txt]
txt - Text [.txt]
uot - Unified Office Format text [.uot]
vor - StarWriter 5.0 Template [.vor]
vor4 - StarWriter 4.0 Template [.vor]
vor3 - StarWriter 3.0 Template [.vor]
wps - Microsoft Works [.wps]
xhtml - XHTML Document [.html]
The following list of graphics formats are currently available:
bmp - Windows Bitmap [.bmp]
emf - Enhanced Metafile [.emf]
eps - Encapsulated PostScript [.eps]
fodg - OpenDocument Drawing (Flat XML) [.fodg]
gif - Graphics Interchange Format [.gif]
html - HTML Document (OpenOffice.org Draw) [.html]
jpg - Joint Photographic Experts Group [.jpg]
met - OS/2 Metafile [.met]
odd - OpenDocument Drawing [.odd]
otg - OpenDocument Drawing Template [.otg]
pbm - Portable Bitmap [.pbm]
pct - Mac Pict [.pct]
pdf - Portable Document Format [.pdf]
pgm - Portable Graymap [.pgm]
png - Portable Network Graphic [.png]
ppm - Portable Pixelmap [.ppm]
ras - Sun Raster Image [.ras]
std - OpenOffice.org 1.0 Drawing Template [.std]
svg - Scalable Vector Graphics [.svg]
svm - StarView Metafile [.svm]
swf - Macromedia Flash (SWF) [.swf]
sxd - OpenOffice.org 1.0 Drawing [.sxd]
sxd3 - StarDraw 3.0 [.sxd]
sxd5 - StarDraw 5.0 [.sxd]
sxw - StarOffice XML (Draw) [.sxw]
tiff - Tagged Image File Format [.tiff]
vor - StarDraw 5.0 Template [.vor]
vor3 - StarDraw 3.0 Template [.vor]
wmf - Windows Metafile [.wmf]
xhtml - XHTML [.xhtml]
xpm - X PixMap [.xpm]
The following list of presentation formats are currently available:
bmp - Windows Bitmap [.bmp]
emf - Enhanced Metafile [.emf]
eps - Encapsulated PostScript [.eps]
fodp - OpenDocument Presentation (Flat XML) [.fodp]
gif - Graphics Interchange Format [.gif]
html - HTML Document (OpenOffice.org Impress) [.html]
jpg - Joint Photographic Experts Group [.jpg]
met - OS/2 Metafile [.met]
odg - ODF Drawing (Impress) [.odg]
odp - ODF Presentation [.odp]
otp - ODF Presentation Template [.otp]
pbm - Portable Bitmap [.pbm]
pct - Mac Pict [.pct]
pdf - Portable Document Format [.pdf]
pgm - Portable Graymap [.pgm]
png - Portable Network Graphic [.png]
potm - Microsoft PowerPoint 2007/2010 XML Template [.potm]
pot - Microsoft PowerPoint 97/2000/XP Template [.pot]
ppm - Portable Pixelmap [.ppm]
pptx - Microsoft PowerPoint 2007/2010 XML [.pptx]
pps - Microsoft PowerPoint 97/2000/XP (Autoplay) [.pps]
ppt - Microsoft PowerPoint 97/2000/XP [.ppt]
pwp - PlaceWare [.pwp]
ras - Sun Raster Image [.ras]
sda - StarDraw 5.0 (OpenOffice.org Impress) [.sda]
sdd - StarImpress 5.0 [.sdd]
sdd3 - StarDraw 3.0 (OpenOffice.org Impress) [.sdd]
sdd4 - StarImpress 4.0 [.sdd]
sxd - OpenOffice.org 1.0 Drawing (OpenOffice.org Impress) [.sxd]
sti - OpenOffice.org 1.0 Presentation Template [.sti]
svg - Scalable Vector Graphics [.svg]
svm - StarView Metafile [.svm]
swf - Macromedia Flash (SWF) [.swf]
sxi - OpenOffice.org 1.0 Presentation [.sxi]
tiff - Tagged Image File Format [.tiff]
uop - Unified Office Format presentation [.uop]
vor - StarImpress 5.0 Template [.vor]
vor3 - StarDraw 3.0 Template (OpenOffice.org Impress) [.vor]
vor4 - StarImpress 4.0 Template [.vor]
vor5 - StarDraw 5.0 Template (OpenOffice.org Impress) [.vor]
wmf - Windows Metafile [.wmf]
xhtml - XHTML [.xml]
xpm - X PixMap [.xpm]
The following list of spreadsheet formats are currently available:
csv - Text CSV [.csv]
dbf - dBASE [.dbf]
dif - Data Interchange Format [.dif]
fods - OpenDocument Spreadsheet (Flat XML) [.fods]
html - HTML Document (OpenOffice.org Calc) [.html]
ods - ODF Spreadsheet [.ods]
ooxml - Microsoft Excel 2003 XML [.xml]
ots - ODF Spreadsheet Template [.ots]
pdf - Portable Document Format [.pdf]
pxl - Pocket Excel [.pxl]
sdc - StarCalc 5.0 [.sdc]
sdc4 - StarCalc 4.0 [.sdc]
sdc3 - StarCalc 3.0 [.sdc]
slk - SYLK [.slk]
stc - OpenOffice.org 1.0 Spreadsheet Template [.stc]
sxc - OpenOffice.org 1.0 Spreadsheet [.sxc]
uos - Unified Office Format spreadsheet [.uos]
vor3 - StarCalc 3.0 Template [.vor]
vor4 - StarCalc 4.0 Template [.vor]
vor - StarCalc 5.0 Template [.vor]
xhtml - XHTML [.xhtml]
xls - Microsoft Excel 97/2000/XP [.xls]
xls5 - Microsoft Excel 5.0 [.xls]
xls95 - Microsoft Excel 95 [.xls]
xlt - Microsoft Excel 97/2000/XP Template [.xlt]
xlt5 - Microsoft Excel 5.0 Template [.xlt]
xlt95 - Microsoft Excel 95 Template [.xlt]
xlsx - Microsoft Excel 2007/2010 XML [.xlsx]