当前位置：首页 > 软件库 > 神经网络/人工智能 > 自然语言处理 >

tika-python

授权协议 Apache-2.0 License

开发语言 Python

所属分类神经网络/人工智能、自然语言处理

软件类型开源软件

地区不详

投递者秦毅

操作系统跨平台

开源组织无

适用人群未知

软件概览

tika-python

A Python port of the Apache Tikalibrary that makes Tika available using theTika REST Server.

This makes Apache Tika available as a Python library,installable via Setuptools, Pip and Easy Install.

To use this library, you need to have Java 7+ installed on yoursystem as tika-python starts up the Tika REST server in thebackground.

Inspired by Aptivate Tika.

Installation (with pip)

pip install tika

Installation (without pip)

python setup.py build
python setup.py install

Airgap Environment Setup

To get this working in a disconnected environment, download a tika server file (both tika-server.jar and tika-server.jar.md5, which can be found here) and set the TIKA_SERVER_JAR environment variable to TIKA_SERVER_JAR="file:////tika-server.jar" which successfully tells python-tika to "download" this file and move it to /tmp/tika-server.jar and run as background process.

This is the only way to run python-tika without internet access. Without this set, the default is to check the tika version and pull latest every time from Apache.

Environment Variables

These are read once, when tika/tika.py is initially loaded and used throughout after that.

TIKA_VERSION - set to the version string, e.g., 1.12 or default to current Tika version.
TIKA_SERVER_JAR - set to the full URL to the remote Tika server jar to download and cache.
TIKA_SERVER_ENDPOINT - set to the host (local or remote) for the running Tika server jar.
TIKA_CLIENT_ONLY - if set to True, then TIKA_SERVER_JAR is ignored, and relies on the value for TIKA_SERVER_ENDPOINT and treats Tika like a REST client.
TIKA_TRANSLATOR - set to the fully qualified class name (defaults to Lingo24) for the Tika translator implementation.
TIKA_SERVER_CLASSPATH - set to a string (delimited by ':' for each additional path) to prepend to the Tika server jar path.
TIKA_LOG_PATH - set to a directory with write permissions and the tika.log and tika-server.log files will be placed in this directory.
TIKA_PATH - set to a directory with write permissions and the tika_server.jar file will be placed in this directory.
TIKA_JAVA - set the Java runtime name, e.g., java or java9
TIKA_STARTUP_SLEEP - number of seconds (float) to wait per check if Tika server is launched at runtime
TIKA_STARTUP_MAX_RETRY - number of checks (int) to attempt for Tika server startup if launched at runtime
TIKA_JAVA_ARGS - set java runtime arguments, e.g, -Xmx4g
TIKA_LOG_FILE - set the filename for the log file. default: tika.log. if it is an empty string (''), no log file is created.

Testing it out

Parser Interface (backwards compat prior to REST)

#!/usr/bin/env python
import tika
tika.initVM()
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Parser Interface

The parser interface extracts text and metadata using the /rmetainterface. This is one of the better ways to get the internal XHTMLcontent extracted.

Note:The parser interface needs the following environment variable set on the console for printing of the extracted content.export PYTHONIOENCODING=utf8

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])

Optionally, you can pass Tika server URL along with the callwhat's useful for multi-instance execution or when Tika is dockerzed/linked.

parsed = parser.from_file('/path/to/file', 'http://tika:9998/tika')
string_parsed = parser.from_buffer('Good evening, Dave', 'http://tika:9998/tika')

Specify Output Format To XHTML

The parser interface is optionally able to output the content as XHTML rather than plain text.

Note:The parser interface needs the following environment variable set on the console for printing of the extracted content.export PYTHONIOENCODING=utf8

#!/usr/bin/env python
import tika
from tika import parser
parsed = parser.from_file('/path/to/file', xmlContent=True)
print(parsed["metadata"])
print(parsed["content"])

# Note: This is also available when parsing from the buffer.

Unpack Interface

The unpack interface handles both metadata and text extraction in a singlecall and internally returns back a tarball of metadata and text entries thatis internally unpacked, reducing the wire load for extraction.

#!/usr/bin/env python
import tika
from tika import unpack
parsed = unpack.from_file('/path/to/file')

Detect Interface

The detect interface provides a IANA MIME type classification for theprovided file.

#!/usr/bin/env python
import tika
from tika import detector
print(detector.from_file('/path/to/file'))

Config Interface

The config interface allows you to inspect the Tika Server environment'sconfiguration including what parsers, mime types, and detectors theserver has been configured with.

#!/usr/bin/env python
import tika
from tika import config
print(config.getParsers())
print(config.getMimeTypes())
print(config.getDetectors())

Language Detection Interface

The language detection interface provides a 2 character languagecode texted based on the text in provided file.

#!/usr/bin/env python
from tika import language
print(language.from_file('/path/to/file'))

Translate Interface

The translate interface translates the text automatically extractedby Tika from the source language to the destination language.

#!/usr/bin/env python
from tika import translate
print(translate.from_file('/path/to/spanish', 'es', 'en'))

Using a Buffer

Note you can also use a Parser and Detector.from_buffer(string) method to dynamically parsera string buffer in Python and/or detect its MIMEtype. This is useful if you've already loadedthe content into memory.

Using Client Only Mode

You can set Tika to use Client only mode by setting

import tika
tika.TikaClientOnly = True

Then you can run any of the methods and it will fullyomit the check to see if the service on localhost isrunning and omit printing the check messages.

Changing the Tika Classpath

You can update the classpath that Tika server uses bysetting the classpath as a set of ':' delimited strings.For example if you want to get Tika-Python working withGeoTopicParsing,you can do this, replace paths below with your own paths, asidentified hereand make sure that you have done this:

kill Tika server (if already running):

ps aux | grep java | grep Tika
kill -9 PID

import tika.tika
import os
from tika import parser
home = os.getenv('HOME')
tika.tika.TikaServerClasspath = home + '/git/geotopicparser-utils/mime:'+home+'/git/geotopicparser-utils/models/polar'
parsed = parser.from_file(home + '/git/geotopicparser-utils/geotopics/polar.geot')
print parsed["metadata"]

Customizing the Tika Server Request

You may customize the outgoing HTTP request to Tika server by setting requestOptions on the .from_file and .from_buffer methods (Parser, Unpack , Detect, Config, Language, Translate). It should be a dictionary of arguments that will be passed to the request method. The request method documentation specifies valid arguments. This will override any defaults except for url and params /data.

from tika import parser
parsed = parser.from_file('/path/to/file', requestOptions={'timeout': 120})

New Command Line Client Tool

When you install Tika-Python you also get a new commandline client tool, tika-python installed in your /path/to/python/bindirectory.

The options and help for the command line tool can be seen by typingtika-python without any arguments. This will also download a copy ofthe tika-server jar and start it if you haven't done so already.

tika.py [-v] [-o <outputDir>] [--server <TikaServerEndpoint>] [--install <UrlToTikaServerJar>] [--port <portNumber>] <command> <option> <urlOrPathToFile>

tika.py parse all test.pdf test2.pdf                   (write output JSON metadata files for test1.pdf_meta.json and test2.pdf_meta.json)
tika.py detect type test.pdf                           (returns mime-type as text/plain)
tika.py language file french.txt                       (returns language e.g., fr as text/plain)
tika.py translate fr:en french.txt                     (translates the file french.txt from french to english)
tika.py config mime-types                              (see what mime-types the Tika Server can handle)

A simple python and command-line client for Tika using the standalone Tika server (JAR file).
All commands return results in JSON format by default (except text in text/plain).

To parse docs, use:
tika.py parse <meta | text | all> <path>

To check the configuration of the Tika server, use:
tika.py config <mime-types | detectors | parsers>

Commands:
  parse  = parse the input file and write a JSON doc file.ext_meta.json containing the extracted metadata, text, or both
  detect type = parse the stream and 'detect' the MIME/media type, return in text/plain
  language file = parse the file stream and identify the language of the text, return its 2 character code in text/plain
  translate src:dest = parse and extract text and then translate the text from source language to destination language
  config = return a JSON doc describing the configuration of the Tika server (i.e. mime-types it
             can handle, or installed detectors or parsers)

Arguments:
  urlOrPathToFile = file to be parsed, if URL it will first be retrieved and then passed to Tika

Switches:
  --verbose, -v                  = verbose mode
  --encode, -e           = encode response in UTF-8
  --csv, -c    = report detect output in comma-delimited format
  --server <TikaServerEndpoint>  = use a remote Tika Server at this endpoint, otherwise use local server
  --install <UrlToTikaServerJar> = download and exec Tika Server (JAR file), starting server on default port 9998

Example usage as python client:
-- from tika import runCommand, parse1
-- jsonOutput = runCommand('parse', 'all', filename)
 or
-- jsonOutput = parse1('all', filename)

Questions, comments?

Send them to Chris A. Mattmann.

Contributors

Chris A. Mattmann, JPL
Brian D. Wilson, JPL
Dongni Zhao, USC
Kenneth Durri, University of Maryland
Tyler Palsulich, New York University & Google
Joe Germuska, Northwestern University
Vlad Shvedov, Profinda.com
Diogo Vieira, Globo.com
Aron Ahmadia, Continuum Analytics
Karanjeet Singh, USC
Renat Nasyrov, Yandex
James Brooking, Blackbeard
Yash Tanna, USC
Igor Tokarev, Freelance
Imraan Parker, Freelance
Annie K. Didier, JPL
Juan Elosua, TEGRA Cybersecurity Center
Carina de Oliveira Antunes, CERN

Thanks

Thanks to the DARPA MEMEX program for funding most of the original portions of this work.

License

Apache License, version 2

使用案例

[EXP]Apache Tika-server < 1.18 - Command Injection

###################################################################################################### #Description: This is a PoC for remote command execution in Apache Tika-server.
python中使用tika解析文档的中文文件名显示问题

在pycharm中使用tika解析docx格式的中文文档（文件名及内容都是中文），结果发现文件内容parsed['content']没问题，正常显示。但是中文文件名parsed['metadata']['resourceName']全部显示为字节编码b’\xe6\x96\x87\xe6\xa1\xa301.docx’的形式，该文档实际名字为“文档01.docx"。网上试过很多方法都不行，包括在代码
tika-app-1.21_Apache Tika通过1.1版本进入新社区

tika-app-1.21 经过三年的雄心勃勃的想法，终于在11月从Apache孵化器中脱颖而出，Apache Tika突然出现，最初的目的是脱离网络搜索项目Apache Nutch，并独自成为重要的元数据提取器。现在，Apache Tika背后的团队发布了内容分析工具箱的更新，该工具箱可检测语言并从文本文档，电子表格，PDF或图像等内容中提取元数据。 Apache Tika 1.1通过增加对P
使用Tika进行文件类型校验

使用Tika进行文件类型校验 Tika是什么我们都知道，普通的文件后缀校验并不能校验出这个文件的类型，大部分的文件类型校验都是通过获取文件的魔数来判断文件的类型，因为对于大多数类型文件来说他的魔数是固定的（例如class文件的魔数就是：CA FE ）。所以目前大部分网络上找到的处理方案是将各个文件的魔数放倒Map集合中，然后通过获取文件的魔数，从Map集合查找对应的文件类型。但是同类型的文件
Python Tika guide

Python Tika guide IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command line pip install git+git://github.com/aptivate/python-tika.git will do the work ! Much better isn't it ?
tika

Tika是一个内容抽取的工具集合(a toolkit for text extracting)。利用Tika，我们可以获得文件的实际类型（https://blog.csdn.net/helihongzhizhuo/article/details/90404387 ）、文件的编码格式（https://blog.csdn.net/helihongzhizhuo/article/details/9040
Python_Tika

Tika有一个解析器库，可以分析各种文档格式的内容，并提取它们。然后检测所述文档的类型，它从解析器库选择的适当的分析器，并传递该文档。不同类别的Tika方法来解析不同的文件格式。过程中可能会报错报错： Use tika with python, runtimeerror: unable to start tika server 解决：这个是缺java包，可以去java官网(https://ww
【已解决】Python读取PDF文件的内容

前言创作开始时间：2021年7月1日10:10:50 如题。网上给了很多种方法，但是有的不太好使，这里给出一个可行的解决方案。环境 windows 10 conda Python 3.8 解决方案我一共尝试了三种方案，具体代码如下： pdf_path = os.path.join("E:\\input", "中国计算机学会推荐国际学术会议和期刊目录-2019.pdf") # 方案1 # 没

tika-python

tika-python

Installation (with pip)

Installation (without pip)

Airgap Environment Setup

Environment Variables

Testing it out

Parser Interface (backwards compat prior to REST)

Parser Interface

Specify Output Format To XHTML

Unpack Interface

Detect Interface

Config Interface

Language Detection Interface

Translate Interface

Using a Buffer

Using Client Only Mode

Changing the Tika Classpath

Customizing the Tika Server Request

New Command Line Client Tool

Questions, comments?

Contributors

Thanks

License

同类工具

相关阅读

相关文章

相关问答

相关文档