使用Tika进行文件类型校验

呼延俊风

2023-12-01

使用Tika进行文件类型校验

Tika是什么

我们都知道，普通的文件后缀校验并不能校验出这个文件的类型，大部分的文件类型校验都是通过获取文件的魔数来判断文件的类型，因为对于大多数类型文件来说他的魔数是固定的（例如class文件的魔数就是：CA FE ）。所以目前大部分网络上找到的处理方案是将各个文件的魔数放倒Map集合中，然后通过获取文件的魔数，从Map集合查找对应的文件类型。但是同类型的文件的魔数真的都是固定的么？事实上并不是这样的，mp4文件的魔数就不是固定的。那就是意味着，你放了一个mp4的魔数，下次检测mp4文件的时候并不能保证校验通过！

因此推荐使用 Apache 下的一个解析类库：Tika 进行文件类型的校验。Tika不仅可以进行文件类型的校验，还可以对文件的内容进行解析，功能强大，本文只针对Tika文件的类型校验进行讲解。

如何使用Tika进行文件类型校验

那么如何使用Tika进行文件类型的校验那，非常的简单

引入依赖

<dependency>
    <groupId>org.apache.tika</groupId>
    <artifactId>tika-core</artifactId>
    <version>1.28.1</version>
</dependency>

见代码

public class Test {
  public static void main(String[] args) throws IOException, TikaException {
    Tika tika = new Tika();
    String mimeType = tika.detect(new File("G:\\test.zip"));
    System.out.println(mimeType);
  }
}

输出：application/zip

可见使用非常的简单，事实上还是想简单了，实际在自己测试的过程中又发现了很多问题。

Tika文件类型校验存在的问题

问题发生的过程

在做zip包导入导出的功能时，使用了tika来校验上传的文件类型，在自测的过程中发现，当我用上面的代码去解析 .xlsx文件时发现得到的结果也是 application/zip，问题出在哪呢？然后我用editplus打开zip文件和xlsx文件获取他们的魔数，发现它们头四位的魔数居然是一致的！然后我又研究了下他的detect相关的api（放出部分api）：

    public String detect(InputStream stream) throws IOException {
        return detect(stream, new Metadata());
    }

    /**
     * Detects the media type of the given document. The type detection is
     * based on the content of the given document stream and the name of the
     * document.
     * <p>
     * If the document stream supports the
     * {@link InputStream#markSupported() mark feature}, then the stream is
     * marked and reset to the original position before this method returns.
     * Only a limited number of bytes are read from the stream.
     * <p>
     * The given document stream is <em>not</em> closed by this method.
     *
     * @since Apache Tika 0.9
     * @param stream the document stream
     * @param name document name
     * @return detected media type
     * @throws IOException if the stream can not be read
     */
	public String detect(InputStream stream, String name) throws IOException {
        Metadata metadata = new Metadata();
        metadata.set(Metadata.RESOURCE_NAME_KEY, name);
        return detect(stream, metadata);
    }

从上面代码我们可以看到Tika还支持传入文件的名称，为什么要提供传入文件名称的api方法，是不是意味着他知道有这种情况？所以我用第二个方法重新尝试了下，这次正确的解析出了文件的类型：application/vnd.openxmlformats-officedocument.spreadsheetml.sheet。为什么会这样？还有就是他返回的文件类型格式显然没有达到我想要的预期，他这个文件类型这么复杂，其他的类型我要怎么比对文件的格式，难道要我一个个文件试过去然后建立映射？apache他的解析库应该不会这么设计，网上资料有限，为了探究他正确的使用姿势，不得不研究下它的源码了。

源码剖析

可以直接猜测下，Tika要解析这么多文件类型，他一定有自己的类型库，所以我根据上面xlxs返回的复杂文件类型，全局搜了下，找到了它包下面的：tika-mimetypes.xml 文件，这个文件可以理解就是他的文件类型库（截取部分）：

  <mime-type type="application/vnd.openxmlformats-officedocument.spreadsheetml.sheet">
    <_comment>Office Open XML Workbook</_comment>
    <glob pattern="*.xlsx"/>
    <sub-class-of type="application/x-tika-ooxml"/>
  </mime-type>
	
  <mime-type type="application/x-tika-ooxml">
    <sub-class-of type="application/zip"/>
    <!-- Only works if the Content Types or rels file is the first zip entry -->
    <magic priority="50">
      <match value="PK\003\004" type="string" offset="0">
        <match value="[Content_Types].xml" type="string" offset="30"/>
        <match value="_rels/.rels" type="string" offset="30"/>
      </match>
    </magic>
  </mime-type>

但从这个xml文件的结构中我们可以猜测下，在Tika的体系中，类型应该是有父子集关系的，还有就是有个glob pattern，我可以断定他一定是做了类型的映射！

然后我们在搜索下 Tika是从哪些地方加载了这个xml文件，发现是在：org.apache.tika.mime.MimeTypes#getDefaultMimeTypes(java.lang.ClassLoader) 方法内使用了xml文件，于是研究下 MimeTypes 这个类的方法，发现它有一个方法：

    /**
     * Returns the registered media type with the given name (or alias).
     * The named media type is automatically registered (and returned) if
     * it doesn't already exist.
     *
     * @param name media type name (case-insensitive)
     * @return the registered media type with the given name or alias
     * @throws MimeTypeException if the given media type name is invalid
     */
public MimeType forName(String name) throws MimeTypeException{
	... ...
}

从注释中我猜测这个方法其实就是映射的方法，于是写代码验证下：

public class Test {

  public static void main(String[] args) throws IOException, TikaException {
    Tika tika = new Tika();

    MimeTypes defaultMimeTypes = MimeTypes.getDefaultMimeTypes();
    MimeType mimeType = defaultMimeTypes.forName("application/vnd.openxmlformats-officedocument.spreadsheetml.sheet");
    System.out.println(mimeType.getExtension());
  }
}
输出：.xlsx

结果证明我的猜想是正确的！

接下来，在来研究下detect方法的源码实现，源码较长，直接放出关键代码：

public MediaType detect(InputStream input, Metadata metadata)
            throws IOException {
        List<MimeType> possibleTypes = null;

        // Get type based on magic prefix
        if (input != null) {
            input.mark(getMinLength());
            try {
                byte[] prefix = readMagicHeader(input);
                possibleTypes = getMimeType(prefix);
            } finally {
                input.reset();
            }
        }
    
        // Get type based on resourceName hint (if available)
        String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
        if (resourceName != null) {
			... ... 略
            if (name != null) {
                MimeType hint = getMimeType(name);

                // For server-side scripting languages, we cannot rely on the filename to detect the mime type
                if (!(isHttp && hint.isInterpreted())) {
                    // If we have some types based on mime magic, try to specialise
                    //  and/or select the type based on that
                    // Otherwise, use the type identified from the name
                    possibleTypes = applyHint(possibleTypes, hint);
                }
            }
        }

		... ... 略

        if (possibleTypes == null || possibleTypes.isEmpty()) {
            // Report that we don't know what it is
            return MediaType.OCTET_STREAM;
        } else {
            return possibleTypes.get(0).getType();
        }
    }

上面代码中有两行，可以看出，其实Tika也是通过校验文件的魔数来确认文件的类型的。

byte[] prefix = readMagicHeader(input);
possibleTypes = getMimeType(prefix);

那么当魔数一致时，Tika是如何解决文件的区分的？下面的代码给了答案，上面我们提到，传入文件名称（文件后缀.xxx）就可以实现文件的校验。它的处理原理就是下面代码：

    String resourceName = metadata.get(Metadata.RESOURCE_NAME_KEY);
    if (resourceName != null) { 
        if (name != null) {
            MimeType hint = getMimeType(name);
            if (!(isHttp && hint.isInterpreted())) {
             
                possibleTypes = applyHint(possibleTypes, hint);
            }
        }
    }
    
   private List<MimeType> applyHint(List<MimeType> possibleTypes, MimeType hint) {
        if (possibleTypes == null || possibleTypes.isEmpty()) {
            return Collections.singletonList(hint);
        } else {
            for (int i=0; i<possibleTypes.size(); i++) {
                final MimeType type = possibleTypes.get(i);
                if (hint.equals(type) ||
                    registry.isSpecializationOf(hint.getType(), type.getType())) {
                    return Collections.singletonList(hint);
                }
            }
        }
        return possibleTypes;
    }

    public boolean isSpecializationOf(MediaType a, MediaType b) {
        return isInstanceOf(getSupertype(a), b);
    }

简单解读就是：

先校验Tika解析出的文件类型和你传入的文件类型是否一致
如果不一致在校验下传入的文件类型是否是Tika解析出的文件类型的子集（xml文件中的元素：sub-class-of）
如果是子集，返回子集映射的类型

可以看下.xlsx它的xml文件中的父子集关系就是：application/zip > application/x-tika-ooxml > application/vnd.openxmlformats-officedocument.spreadsheetml.sheet ，所以传入带后缀的文件名称后，就可以正确的解析出文件类型了。

代码优化

了解了Tika文件类型检测的原理后，我们就知道如何正确的使用了，对原来的代码进行优化下：

private static final MimeTypes DEFAULT_MIME_TYPES = MimeTypes.getDefaultMimeTypes();

/**
 * 文件类型检测
 *
 * @param bytes 文件数据
 * @param expectFileType 预期的文件类型
 * @return
 */
public static boolean fileTypeDetect(byte[] bytes, FileTypeEnum expectFileType) {
    String extension = "." + expectFileType.getMsg();
    try {
        Tika tika = new Tika();
        String detectedMediaType = tika.detect(bytes, extension);
        MimeType mimeType = DEFAULT_MIME_TYPES.forName(detectedMediaType);

        return CollectionUtils.isNotEmpty(mimeType.getExtensions())
                    && mimeType.getExtensions().stream().anyMatch(ext -> ext.equals(extension));
    } catch (Exception e) {
        // do something
    }
    return true;
}

使用时注意

使用时需要注意，在获取魔数的时候，流会被读取！

    byte[] readMagicHeader(InputStream stream) throws IOException {
        if (stream == null) {
            throw new IllegalArgumentException("InputStream is missing");
        }

        byte[] bytes = new byte[getMinLength()];
        int totalRead = 0;

        int lastRead = stream.read(bytes);
        while (lastRead != -1) {
            totalRead += lastRead;
            if (totalRead == bytes.length) {
                return bytes;
            }
            lastRead = stream.read(bytes, totalRead, bytes.length - totalRead);
        }

        byte[] shorter = new byte[totalRead];
        System.arraycopy(bytes, 0, shorter, 0, totalRead);
        return shorter;
    }

总结

源码总会给你最好的答案~

使用Tika进行文件类型校验

使用Tika进行文件类型校验

Tika是什么

如何使用Tika进行文件类型校验

Tika文件类型校验存在的问题

问题发生的过程

源码剖析

代码优化

使用时注意

总结

相关阅读

相关文章

相关问答

相关文档