IMPORTANT NOTE: Thanks to Chris Wilson's work it seems that a simple command line pip install git+git://github.com/aptivate/python-tika.git
will do the work ! Much better isn't it ? See http://blog.aptivate.org/2012/02/01/content-indexing-in-django-using-apache-tika/ for more info. The following is now clearly deprecated, I keep it here just in case...
This document is a very short guide for building and using Tika (an all purpose documents' content and metadata extraction library) through a Python wrapper. The wrapper is built using JCC.
http://lucene.apache.org/tika/
http://lucene.apache.org/pylucene/jcc/index.html
Until now only the few functionalities I am interested in were tested.
Install jcc : http://lucene.apache.org/pylucene/jcc/documentation/install.html
Install tika : http://lucene.apache.org/tika/0.7/gettingstarted.html
Don't forget to run mvn install
in tika directory.
You will need the jar files from tika-parsers/target, tika-core/target and tika-app/target.
Build Tika Python wrapper with jcc:
> cd jcc/jcc > sudo python __main__.py --jar jar/tika-parsers-0.7.jar --jar jar/tika-core-0.7.jar\ java.io.File java.io.FileInputStream java.io.StringBufferInputStream\ --package org.xml.sax.ContentHandler --package org.xml.sax.SAXException\ --include jar/tika-app-0.7.jar --python tika --reserved asm --build --install
I have been told that the package line should be: "--package org.xml.sax". I don't know if it is because of a version change and I haven't tested it, but try it if you have errors with the command as it is.
1 feb 2012: thanks to another fellow tika user for his input:
I concur with the need to change the package to "--package org.xml.sax". Without this, I do not get "errors" during the compilation process, but jcc silently ignores the all-important AutoDetectParser.parse() method, and produces a wrapper with no such method in it, because it doesn't recognise the return type. This causes the example code that you gave to fail because of the missing method. I also needed to add an OSGI library for Tika 1.0, which I happened to find on my system, so my final command was: python ../jcc/jcc/__main__.py \ --include /usr/share/java/org.eclipse.osgi.jar --jar tika-parsers-1.0.jar \ --jar tika-core-1.0.jar \ java.io.File java.io.FileInputStream \ java.io.StringBufferInputStream \ --package org.xml.sax \ --include tika-app-1.0.jar \ --python tika --version 1.0 --reserved asm
In a python console:
# Setup module and virtual machine import tika tika.initVM() # The all purpose parser from Tika (html, pdf, open documents, etc...) parser = tika.AutoDetectParser() # Create input from a small fake html code # Alternatively you can use: input = tika.FileInputStream(tika.File("/path/to/example")) input = tika.StringBufferInputStream("<html><title>My title</title><body>My body</body></html>") # Create handler for content, metadata and context content = tika.BodyContentHandler() metadata = tika.Metadata() context = tika.ParseContext() # Parse the data and display result parser.parse(input,content,metadata,context) content.toString() > u'My body' metadata.toString() > u'title=My title Content-Encoding=UTF-8 Content-Type=text/html ' metadata.get('title') > u'My title'