tika-python

授权协议 Apache-2.0 License
开发语言 Python
所属分类 神经网络/人工智能、 自然语言处理
软件类型 开源软件
地区 不详
投 递 者 秦毅
操作系统 跨平台
开源组织
适用人群 未知
 软件概览

tika-python

A Python port of the Apache Tikalibrary that makes Tika available using theTika REST Server.

This makes Apache Tika available as a Python library,installable via Setuptools, Pip and Easy Install.

To use this library, you need to have Java 7+ installed on yoursystem as tika-python starts up the Tika REST server in thebackground.

Inspired by Aptivate Tika.

Installation (with pip)

  1. pip install tika

Installation (without pip)

  1. python setup.py build
  2. python setup.py install

Airgap Environment Setup

To get this working in a disconnected environment, download a tika server file (both tika-server.jar and tika-server.jar.md5, which can be found here) and set the TIKA_SERVER_JAR environment variable to TIKA_SERVER_JAR="file:////tika-server.jar" which successfully tells python-tika to "download" this file and move it to /tmp/tika-server.jar and run as background process.

This is the only way to run python-tika without internet access. Without this set, the default is to check the tika version and pull latest every time from Apache.

Environment Variables

These are read once, when tika/tika.py is initially loaded and used throughout after that.

  1. TIKA_VERSION - set to the version string, e.g., 1.12 or default to current Tika version.
  2. TIKA_SERVER_JAR - set to the full URL to the remote Tika server jar to download and cache.
  3. TIKA_SERVER_ENDPOINT - set to the host (local or remote) for the running Tika server jar.
  4. TIKA_CLIENT_ONLY - if set to True, then TIKA_SERVER_JAR is ignored, and relies on the value for TIKA_SERVER_ENDPOINT and treats Tika like a REST client.
  5. TIKA_TRANSLATOR - set to the fully qualified class name (defaults to Lingo24) for the Tika translator implementation.
  6. TIKA_SERVER_CLASSPATH - set to a string (delimited by ':' for each additional path) to prepend to the Tika server jar path.
  7. TIKA_LOG_PATH - set to a directory with write permissions and the tika.log and tika-server.log files will be placed in this directory.
  8. TIKA_PATH - set to a directory with write permissions and the tika_server.jar file will be placed in this directory.
  9. TIKA_JAVA - set the Java runtime name, e.g., java or java9
  10. TIKA_STARTUP_SLEEP - number of seconds (float) to wait per check if Tika server is launched at runtime
  11. TIKA_STARTUP_MAX_RETRY - number of checks (int) to attempt for Tika server startup if launched at runtime
  12. TIKA_JAVA_ARGS - set java runtime arguments, e.g, -Xmx4g
  13. TIKA_LOG_FILE - set the filename for the log file. default: tika.log. if it is an empty string (''), no log file is created.

Testing it out

Parser Interface (backwards compat prior to REST)

#!/usr/bin/env python
import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Parser Interface

The parser interface extracts text and metadata using the /rmetainterface. This is one of the better ways to get the internal XHTMLcontent extracted.

Note:Alert IconThe parser interface needs the following environment variable set on the console for printing of the extracted content.export PYTHONIOENCODING=utf8

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Optionally, you can pass Tika server URL along with the callwhat's useful for multi-instance execution or when Tika is dockerzed/linked.

parsed = parser.from_file('/path/to/file', 'http://tika:9998/tika')
string_parsed = parser.from_buffer('Good evening, Dave', 'http://tika:9998/tika')

Specify Output Format To XHTML

The parser interface is optionally able to output the content as XHTML rather than plain text.

Note:Alert IconThe parser interface needs the following environment variable set on the console for printing of the extracted content.export PYTHONIOENCODING=utf8

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('/path/to/file', xmlContent=True)
print(parsed["metadata"])
print(parsed["content"])

# Note: This is also available when parsing from the buffer.

Unpack Interface

The unpack interface handles both metadata and text extraction in a singlecall and internally returns back a tarball of metadata and text entries thatis internally unpacked, reducing the wire load for extraction.

#!/usr/bin/env python
import tika
from tika import unpack
parsed = unpack.from_file('/path/to/file')

Detect Interface

The detect interface provides a IANA MIME type classification for theprovided file.

#!/usr/bin/env python
import tika
from tika import detector
print(detector.from_file('/path/to/file'))

Config Interface

The config interface allows you to inspect the Tika Server environment'sconfiguration including what parsers, mime types, and detectors theserver has been configured with.

#!/usr/bin/env python
import tika
from tika import config
print(config.getParsers())
print(config.getMimeTypes())
print(config.getDetectors())

Language Detection Interface

The language detection interface provides a 2 character languagecode texted based on the text in provided file.

#!/usr/bin/env python
from tika import language
print(language.from_file('/path/to/file'))

Translate Interface

The translate interface translates the text automatically extractedby Tika from the source language to the destination language.

#!/usr/bin/env python
from tika import translate
print(translate.from_file('/path/to/spanish', 'es', 'en'))

Using a Buffer

Note you can also use a Parser and Detector.from_buffer(string) method to dynamically parsera string buffer in Python and/or detect its MIMEtype. This is useful if you've already loadedthe content into memory.

Using Client Only Mode

You can set Tika to use Client only mode by setting

import tika
tika.TikaClientOnly = True

Then you can run any of the methods and it will fullyomit the check to see if the service on localhost isrunning and omit printing the check messages.

Changing the Tika Classpath

You can update the classpath that Tika server uses bysetting the classpath as a set of ':' delimited strings.For example if you want to get Tika-Python working withGeoTopicParsing,you can do this, replace paths below with your own paths, asidentified hereand make sure that you have done this:

kill Tika server (if already running):

ps aux | grep java | grep Tika
kill -9 PID
import tika.tika
import os
from tika import parser
home = os.getenv('HOME')
tika.tika.TikaServerClasspath = home + '/git/geotopicparser-utils/mime:'+home+'/git/geotopicparser-utils/models/polar'
parsed = parser.from_file(home + '/git/geotopicparser-utils/geotopics/polar.geot')
print parsed["metadata"]

Customizing the Tika Server Request

You may customize the outgoing HTTP request to Tika server by setting requestOptions on the .from_file and .from_buffer methods (Parser, Unpack , Detect, Config, Language, Translate). It should be a dictionary of arguments that will be passed to the request method. The request method documentation specifies valid arguments. This will override any defaults except for url and params /data.

from tika import parser
parsed = parser.from_file('/path/to/file', requestOptions={'timeout': 120})

New Command Line Client Tool

When you install Tika-Python you also get a new commandline client tool, tika-python installed in your /path/to/python/bindirectory.

The options and help for the command line tool can be seen by typingtika-python without any arguments. This will also download a copy ofthe tika-server jar and start it if you haven't done so already.

tika.py [-v] [-o <outputDir>] [--server <TikaServerEndpoint>] [--install <UrlToTikaServerJar>] [--port <portNumber>] <command> <option> <urlOrPathToFile>

tika.py parse all test.pdf test2.pdf                   (write output JSON metadata files for test1.pdf_meta.json and test2.pdf_meta.json)
tika.py detect type test.pdf                           (returns mime-type as text/plain)
tika.py language file french.txt                       (returns language e.g., fr as text/plain)
tika.py translate fr:en french.txt                     (translates the file french.txt from french to english)
tika.py config mime-types                              (see what mime-types the Tika Server can handle)

A simple python and command-line client for Tika using the standalone Tika server (JAR file).
All commands return results in JSON format by default (except text in text/plain).

To parse docs, use:
tika.py parse <meta | text | all> <path>

To check the configuration of the Tika server, use:
tika.py config <mime-types | detectors | parsers>

Commands:
  parse  = parse the input file and write a JSON doc file.ext_meta.json containing the extracted metadata, text, or both
  detect type = parse the stream and 'detect' the MIME/media type, return in text/plain
  language file = parse the file stream and identify the language of the text, return its 2 character code in text/plain
  translate src:dest = parse and extract text and then translate the text from source language to destination language
  config = return a JSON doc describing the configuration of the Tika server (i.e. mime-types it
             can handle, or installed detectors or parsers)

Arguments:
  urlOrPathToFile = file to be parsed, if URL it will first be retrieved and then passed to Tika

Switches:
  --verbose, -v                  = verbose mode
  --encode, -e           = encode response in UTF-8
  --csv, -c    = report detect output in comma-delimited format
  --server <TikaServerEndpoint>  = use a remote Tika Server at this endpoint, otherwise use local server
  --install <UrlToTikaServerJar> = download and exec Tika Server (JAR file), starting server on default port 9998

Example usage as python client:
-- from tika import runCommand, parse1
-- jsonOutput = runCommand('parse', 'all', filename)
 or
-- jsonOutput = parse1('all', filename)

Questions, comments?

Send them to Chris A. Mattmann.

Contributors

  • Chris A. Mattmann, JPL
  • Brian D. Wilson, JPL
  • Dongni Zhao, USC
  • Kenneth Durri, University of Maryland
  • Tyler Palsulich, New York University & Google
  • Joe Germuska, Northwestern University
  • Vlad Shvedov, Profinda.com
  • Diogo Vieira, Globo.com
  • Aron Ahmadia, Continuum Analytics
  • Karanjeet Singh, USC
  • Renat Nasyrov, Yandex
  • James Brooking, Blackbeard
  • Yash Tanna, USC
  • Igor Tokarev, Freelance
  • Imraan Parker, Freelance
  • Annie K. Didier, JPL
  • Juan Elosua, TEGRA Cybersecurity Center
  • Carina de Oliveira Antunes, CERN

Thanks

Thanks to the DARPA MEMEX program for funding most of the original portions of this work.

License

Apache License, version 2

  • ###################################################################################################### #Description: This is a PoC for remote command execution in Apache Tika-server.

  • 在pycharm中使用tika解析docx格式的中文文档(文件名及内容都是中文),结果发现文件内容parsed['content']没问题,正常显示。但是中文文件名parsed['metadata']['resourceName']全部显示为字节编码b’\xe6\x96\x87\xe6\xa1\xa301.docx’的形式,该文档实际名字为“文档01.docx"。网上试过很多方法都不行,包括在代码

  • tika-app-1.21 经过三年的雄心勃勃的想法,终于在11月从Apache孵化器中脱颖而出,Apache Tika突然出现,最初的目的是脱离网络搜索项目Apache Nutch,并独自成为重要的元数据提取器。 现在,Apache Tika背后的团队发布了内容分析工具箱的更新,该工具箱可检测语言并从文本文档,电子表格,PDF或图像等内容中提取元数据。 Apache Tika 1.1通过增加对P

  • 使用Tika进行文件类型校验 Tika是什么 ​ 我们都知道,普通的文件后缀校验并不能校验出这个文件的类型,大部分的文件类型校验都是通过获取文件的魔数来判断文件的类型,因为对于大多数类型文件来说他的魔数是固定的(例如class文件的魔数就是:CA FE )。所以目前大部分网络上找到的处理方案是将各个文件的魔数放倒Map集合中,然后通过获取文件的魔数,从Map集合查找对应的文件类型。但是同类型的文件

  • Python Tika guide IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command line pip install git+git://github.com/aptivate/python-tika.git will do the work ! Much better isn't it ?

  • Tika是一个内容抽取的工具集合(a toolkit for text extracting)。利用Tika,我们可以获得文件的实际类型(https://blog.csdn.net/helihongzhizhuo/article/details/90404387 )、文件的编码格式(https://blog.csdn.net/helihongzhizhuo/article/details/9040

  • Tika有一个解析器库,可以分析各种文档格式的内容,并提取它们。然后检测所述文档的类型,它从解析器库选择的适当的分析器,并传递该文档。不同类别的Tika方法来解析不同的文件格式。过程中可能会报错 报错: Use tika with python, runtimeerror: unable to start tika server 解决: 这个是缺java包,可以去java官网(https://ww

  • 前言 创作开始时间:2021年7月1日10:10:50 如题。网上给了很多种方法,但是有的不太好使,这里给出一个可行的解决方案。 环境 windows 10 conda Python 3.8 解决方案 我一共尝试了三种方案,具体代码如下: pdf_path = os.path.join("E:\\input", "中国计算机学会推荐国际学术会议和期刊目录-2019.pdf") # 方案1 # 没

 相关资料
  • 学习如何在Java编程中使用Tika。 以下是示例 - 如何使用java从PDF中提取内容。 如何使用java从ODF中提取内容。 如何使用java从Excel工作表中提取内容。 如何使用java从文本文档中提取内容。 如何使用java从XML文档中提取内容。 如何使用java从HTML文档中提取内容。 如何使用java从java .class文件中提取内容。

  • Tika 是一个内容抽取的工具集合(a toolkit for text extracting)。它集成了 POI, Pdfbox 并且为文本抽取工作提供了一个统一的界面。其次,Tika 也提供了便利的扩展 API,用来丰富其对第三方文件格式的支持。 在当前的0.2-SNAPSHOT 版本中, Tika 提供了对如下文件格式的支持: PDF - 通过 Pdfbox MS-* - 通过 POI HT

  • 我们在JBOSS中安装了多核Solr3.6,并具有TIKA提取功能。这是Windows 2008 R2虚拟机上的新安装。这种精确的设置已经在许多其他部署中发挥作用。下面是调用“extract”时的堆栈跟踪: 2013-01-31 08:52:51,908严重[org.apache.solr.servlet.solrdispatchfilter]java.lang.nosuchmethoderror

  • 我正在尝试使用tika包来解析文件。Tika已成功安装,使用cmd中的代码运行 我在Jupyter中的代码是: 然而,我收到以下错误: 2018-07-25 10:20:13,325[MainThread][WARNI]无法看到启动日志消息;正在重试...2018-07-25 10:20:18,329[MainThread][WARNI]无法看到启动日志消息;正在重试...2018-07-25 1

  • 本教程提供了对 Apache Tika 库的基本了解,它支持的文件格式,以及使用 Apache Tika 的内容和元数据提取。