利用开源cpdetector转换指定文件下所有文件的编码格式

微生啸

2023-12-01

最近想整理一下以前写的代码，发现很多项目的编码格式不统一或同一个项目中不同的文件编码格式也不相同，于是在网上找了一下相关博客，发现大部分都需要在方法参数上输入源文件的编码格式，这种做法对需要转换的文件数量少的情况下，还可以勉强使用，当需要转换的文件很多时，很明显是不适用的。所以如果有工具能获取文件的编码格式，然后按照文件的编码格式读取文件，最后按照指定的编码格式将读取到的文件内容写入文件，从而完成文件的编码格式转换。

本文是利用开源项目cpdetector来检测文件的编码格式，它所在的网址是：http://cpdetector.sourceforge.net/。cpdetector是基于统计学原理的，不保证完全正确，因此我使用不覆盖转换，在源文件路劲下创建一个convert文件夹用于保存转换后得文件，这个可以根据自己的需求来实现的。

获取文件的编码格式有多重实现方式，可以参考Java如何获取文件编码格式

代码实现如下：

package com.zhouj.endless.utils;

import info.monitorenter.cpdetector.io.ASCIIDetector;
import info.monitorenter.cpdetector.io.CodepageDetectorProxy;
import info.monitorenter.cpdetector.io.JChardetFacade;
import info.monitorenter.cpdetector.io.UnicodeDetector;

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.util.Set;

public class ConvertFileCharset {

    public static final String PREFIX_NAME = "convert";

    /**
     * @param srcFile 源文件路径,单个文件或文件夹
     * @param toCharset 转换的编码格式
     * @param typeSet 需要转换文件的后缀名
     *
     * */
    public static void convert(File srcFile, String toCharset, Set<String> typeSet) throws IOException {
        if (srcFile.isFile()) {
            if (typeSet.contains(getFileSuffixName(srcFile))) {
                _execute(srcFile.getPath(), toCharset);
            }
        } else {
            for (File file : srcFile.listFiles()) {
                if (file.isFile()) {
                    if (typeSet.contains(getFileSuffixName(file))) {
                        _execute(file.getPath(), toCharset);
                    }
                } else {
                    convert(file, toCharset, typeSet);
                }
            }
        }
    }

    private static void _execute(String srcFilePath, String destCharset) throws IOException {
        // 已经装换过的文件,不能重复装换
        if (!srcFilePath.contains(PREFIX_NAME)) {
            File srcFile = new File(srcFilePath);
            String charsetName = getFileCharset(srcFile);
            // 如果源文件的编码格式与将要转换的编码格式一致,则不转换
            if (!charsetName.equalsIgnoreCase(destCharset)){
                FileInputStream inputStream = new FileInputStream(srcFile);
                BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(inputStream, charsetName));

                String destPathPrefix = srcFile.getParent() + File.separator + PREFIX_NAME;
                String destFileName = srcFilePath.substring(srcFilePath.lastIndexOf(File.separator));
                String destPath = destPathPrefix + destFileName;
                File destFile = new File(destPath);
                if (!destFile.getParentFile().exists()) {
                    destFile.getParentFile().mkdirs();
                } else {
                    if (destFile.exists()) {
                        destFile.delete();
                    }
                }
                destFile.createNewFile();
                FileOutputStream outputStream = new FileOutputStream(destFile, true);
                BufferedWriter bufferedWriter = new BufferedWriter(new OutputStreamWriter(outputStream, destCharset));
                String temp;
                while ((temp = bufferedReader.readLine()) != null) {
                    bufferedWriter.write(temp + "\r\n");
                }
                bufferedWriter.close();
                outputStream.close();
                bufferedReader.close();
                inputStream.close();
                System.out.println("源文件编码格式:" + charsetName + " 转换文件路径:" + destFile);
            }
        }
    }


    private static String getFileCharset(File srcFile) {
        CodepageDetectorProxy detector = CodepageDetectorProxy.getInstance();
        // 用到antlr.jar、chardet.jar
        detector.add(JChardetFacade.getInstance());
        // ASCIIDetector用于ASCII编码测定
        detector.add(ASCIIDetector.getInstance());
        // UnicodeDetector用于Unicode家族编码的测定
        detector.add(UnicodeDetector.getInstance());
        java.nio.charset.Charset charset = null;
        try {
            charset = detector.detectCodepage(srcFile.toURI().toURL());
        } catch (Exception ex) {
            ex.printStackTrace();
        }
        if (charset != null)
            // 如果charset返回void时,设置为GBK,void这个值是我在实际转换中出现
            return charset.name() == "void" ? "GBK" : charset.name();
        else
            return "UTF-8";
    }

    private static String getFileSuffixName(File file) {
        String fileName = file.getName();
        try {
            return fileName.substring(fileName.lastIndexOf("."));
        } catch (Exception e) {
            return "";
        }
    }
}

测试代码：

package com.zhouj.endless.temp;

import com.zhouj.endless.utils.ConvertFileCharset;

import java.io.File;
import java.io.IOException;
import java.util.Set;
import java.util.TreeSet;

public class FileEncodingDetect {

    public static void main(String[] args) throws IOException {
        String srcFilePath = "F:\\test";
        File srcFile = new File(srcFilePath);
        Set typeSet = new TreeSet();
        typeSet.add(".java");
        ConvertFileCharset.convert(srcFile, "utf-8",typeSet);
    }
}

运行结果：

Connected to the target VM, address: '127.0.0.1:55754', transport: 'socket'
Disconnected from the target VM, address: '127.0.0.1:55754', transport: 'socket'
源文件编码格式:GB2312 转换文件路径:F:\test\convert\ObjectMethodDemo.java

Process finished with exit code 0

好久没有写博客，希望这次开始就不要停下来！心向前，无畏惧！

利用开源cpdetector转换指定文件下所有文件的编码格式

相关阅读

相关文章

相关问答

相关文档