当前位置: 首页 > 工具软件 > my-office > 使用案例 >

tika解析加密的office文件

权玉泽
2023-12-01

use Tika(https://tika.apache.org) to detect file MME type and check whether it's correct type for specific file extension.

For internal minetype/file extension not covered by Tika, we could configure it in customize minetype configuration file like below:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<mime-info>
  <mime-type 
type="application/octet-stream">
    <glob 
pattern="*.unl"/>
  </mime-type>
</mime-info>

 

As mentioned in Tika documentation(https://tika.apache.org/1.13/detection.html ), For typically container based formats, the
magic detection may not be enough.

password protected OOXML files are actually stored in an OLE2 (application/x-tika-msoffice) container.(I tried with
tika-parsers, Encrypted Microsoft Office OOXML files return the same media type- 'application/x-tika-ooxml-protected'.
Referring to fucntion testDetectProtectedOOXML() and testDetectProtectedOLE2()
in  https://github.com/apache/tika/blob/master/tika-parsers/src/test/java/org/apache/tika/detect/TestContainerAwareDetector.java

in tika-mimetypes.xml,which defines the valid mime types used by Tika.

<mime-type type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet">

  <_comment>Office Open XML Workbook</_comment>

  <glob pattern="*.xlsx"/>

  <sub-class-of type="application/x-tika-ooxml"/>

</mime-type>

...

<mime-type type="application/vnd.ms-excel"> 

 

  <!-- Use DefaultDetector / org.apache.tika.parser.microsoft.POIFSContainerDetector for more reliable detection of OLE2
documents -->

 ...

  <glob pattern="*.xls"/>

...

 

  <sub-class-of type="application/x-tika-msoffice"/>

</mime-type>

so,it works well if you change file extension from 'xlsx' to '.xls' as inputStream and fileName have the same media type
'application/x-tika-msoffice'.

(Note:

Using magic detection, it is easy to spot that a given file is an OLE2 document, or a Zip file. Using magic detection alone, it is very difficult (and often impossible) to tell what kind of file lives inside the container.

For some use cases, speed is important, so having a quick way to know the container type is sufficient. For other cases however, you don't mind spending a bit of time (and memory!) processing the container to get a more accurate answer on its contents. For these cases, the additional container aware detectors contained in the Tika Parsers jar should be used.

 

 

 

 

转载于:https://my.oschina.net/cdt/blog/1837606

 类似资料: