Tika now has a module for Deep Learning powered by theDL4J toolkit. The initial included model is for InceptionV3and so using this module, natively in Java, Tika can useDeep learning for metadata/text extraction from Images usingthe power of the Inception model (Github-165).
A new parser for sentiment analysis using a categorical(multi-class, anry, sad, neutral, like, love) and binary(positive/negative) was added leveraging the USC datascience work (TIKA-2016).
Tika now has the ability to automatically detect objects in videos,using OpenCV and Tensorflow (TIKA-2322).
Change default behavior to parse embedded documents even if the userforgets to specify a Parser.class in the ParseContext (TIKA-2096).Users who wish to parse only the container document should setan EmptyParser as the Parser.class in the ParseContext.
Change default behavior of Office Parsers to _not_ extractMacros. User needs to setExtractMacros to "true" (TIKA-2302).
Added tika-eval module (TIKA-1332).
Unified logging across Tika: SLF4J as logging API, Apache Log4j asimplementation with JCL and JUL bridges in standalone tools liketika-app, tika-batch and tika-server (TIKA-2245).
Add parser for XLSB files (TIKA-1195).
Add parsers for EMF/WMF files (TIKA-2246/TIKA-2247).
Add parsers for WordPerfect and QuattroPro (.qpw) files.Contributed by Pascal Essiembre (TIKA-1946 and TIKA-2228).
Add experimental SAX parser for .pptx files. To select this parser,set useSAXPptxExtractor(true) on OfficeParserConfig (TIKA-2210).
Add experimental SAX parser for .docx files. To select this parser,set useSAXDocxExtractor(true) on OfficeParserConfig (TIKA-1321, TIKA-2191).
Add mime detection and parser for Word 2006ML format (TIKA-2179).
Bug fix for WordPerfect via Pascal Essiembre (TIKA-2352).
Added "text-main" equivalent option to tika-server via/tika/main (TIKA-2343).
Enabled configuration of the EncodingDetector used byparsers that extend AbstractEncodingDetectorParser (TIKA-2273).
Prevent easily preventable OOMs for both detection and parsingof some compression formats (TIKA-2330).
Extract images and thumbnails from ODT via Sam Bayer (TIKA-2295).
Fix potential NPE in FeedParser via Julien Nioche (TIKA-2269).
Official mime types for BMP, EMF and WMF have been registered withIANA, so switch to these (image/bmp image/emf image/wmf) (TIKA-2250)
Be more parsimonious with BufferedInputStreams via Josh Hight(TIKA-2244).
Enable handling of hyphenated language codes in TesseractOCRParservia Graham Russell (TIKA-2231).
Improve style tags in ODT (TIKA-2242).
Add container detection for embedded MSEquation files (TIKA-2238).
Add parsing of JBIG2 and extraction of JBIG2 from PDFs whenrequired dependencies are added to class path by user.Contributed by Pascal Essiembre (TIKA-2232).
Mime magic for the OneNote family (.one / .onetoc / .onepkg), no parser(TIKA-2224).
Add configurability of "preserve-interword-spacing" toTesseractOCRParser (TIKA-2190).
Upgrade PDFBox to 2.0.6 and JempBox 1.8.13 (TIKA-2361.
Refactor MockParser to consolidate service loadingand mime types into tika-core/src/test (TIKA-2195).
Enabled extraction of embedded objects from headers, footers,footnotes, endnotes and comments in legacy .docx parser (TIKA-2192).
Allow extraction of PDActions (including Javascript) fromPDFs (TIKA-2090). This is turned off by default. Usersmust setExtractActions(true) on the PDFParserConfig.
Change default behavior in experimental .docx parser to ignoredeleted text to align with .doc (TIKA-2187).
Upgrade to Apache POI 3.16 (TIKA-2116, TIKA-2181, TIKA-2329).
Allow configuration of timeout for ForkParser (TIKA-2170).
Add extraction of .jpx inline images from PDFs when required dependencies are added by user to class path (TIKA-2175).
Add .jpx, .jp2, .ppm to formats handled by Tesseract (TIKA-2174).
Upgrade "provided" Sqlite to 3.16.1 (TIKA-2334).
Upgrade CXF version to 3.0.12 (TIKA-2292).
Add Lingo24 Language Detector (TIKA-2297).
Further mime magic for WebVTT (TIKA-1772)
Extend support for increased PSM options up to 13 for modernversions of Tesseract (TIKA-2357).