File.listFiles（）使用JDK 6处理Unicode名称（Unicode规范化问题）

熊烨

2023-03-14

问题内容：

在OS X和Linux上列出Java6中的目录内容时，我正遇到一个奇怪的文件名编码问题：与File.listFiles()和相关的方法似乎以与系统其余部分不同的编码返回文件名。

请注意，导致这些问题的不仅仅是显示这些文件名。我主要感兴趣的是将文件名与远程文件存储系统进行比较，因此我更关心名称字符串的内容，而不是用于打印输出的字符编码。

这是一个演示程序。它创建一个具有Unicode名称的文件，然后打印出从直接创建的文件获得的文件名的 URL编码
版本，以及在父目录下列出的相同文件（您应该在空目录中运行此代码）。结果显示该File.listFiles()方法返回的不同编码。

    String fileName = "Trîcky Nåme";
    File file = new File(fileName);
    file.createNewFile();
    System.out.println("File name: " + URLEncoder.encode(file.getName(), "UTF-8"));

    // Get parent (current) dir and list file contents
    File parentDir = file.getAbsoluteFile().getParentFile();
    File[] children = parentDir.listFiles();
    for (File child: children) {
        System.out.println("Listed name: " + URLEncoder.encode(child.getName(), "UTF-8"));
    }

这是在系统上运行此测试代码时得到的。注意%CCvs %C3字符表示。

OS X雪豹：

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_20"
Java(TM) SE Runtime Environment (build 1.6.0_20-b02-279-10M3065)
Java HotSpot(TM) 64-Bit Server VM (build 16.3-b01-279, mixed mode)

KUbuntu Linux（在同一OS X系统上的VM中运行）：

File name: Tri%CC%82cky+Na%CC%8Ame
Listed name: Tr%C3%AEcky+N%C3%A5me

$ java -version
java version "1.6.0_18"
OpenJDK Runtime Environment (IcedTea6 1.8.1) (6b18-1.8.1-0ubuntu1)
OpenJDK Client VM (build 16.0-b13, mixed mode, sharing)

我曾尝试过各种黑客获得字符串的同意，包括设置file.encoding系统属性和各种LC_CTYPE和LANG环境变量。没有任何帮助，我也不想求助于此类黑客。

与这个（有点相关？）问题不同，我可以从列出的文件中读取数据，尽管名称是奇数

问题答案：

使用Unicode，可以使用多种有效的方式来表示同一字母。您在“棘手的名称”中使用的字符是“带小圆音的拉丁字母i”和“带圆环的拉丁字母a”。

您说“注意%CC与%C3字符的表示形式”，但是仔细看，您看到的是序列

i 0xCC 0x82 vs. 0xC3 0xAE
a 0xCC 0x8A vs. 0xC3 0xA5

That is, the first is letter i followed by 0xCC82 which is the UTF-8
encoding of theUnicode\u0302
“combining circumflex accent” character while the second is UTF-8 for
\u00EE “latin
small letter i with circumflex”. Similarly for the other pair, the first is
the letter a followed by 0xCC8A the “combining ring above” character and the
second is “latin small letter a with ring above”. Both of these are valid
UTF-8 encodings of valid Unicode character strings, but one is in “composed”
and the other in “decomposed” format.

OS X HFSPlus卷将字符串（例如文件名）存储为“完全分解”。Unix文件系统实际上是根据文件系统驱动程序选择存储方式来存储的。您不能在不同类型的文件系统之间做任何笼统的声明。

有关组合形式与分解形式的一般性讨论，请参见Wikipedia上有关Unicode等价的文章，其中特别提到了OSX。

有关转换表格的信息，请参阅Apple的Tech Q＆A
QA1235（不幸的是，在Objective-C中）。

Apple的java-dev邮件列表上的最新电子邮件线程可能会对您有所帮助。

基本上，您需要先将分解形式标准化为组合形式，然后才能比较字符串。

File.listFiles（）使用JDK 6处理Unicode名称（Unicode规范化问题）

相关阅读

相关文章

相关问答

相关工具

相关文档