Java：阅读器和编码

朱高丽

2023-03-14

问题内容：

Java的默认编码为ASCII。是？（请参见下面的编辑）

当文本文件被编码为UTF-8？读者如何知道自己必须使用UTF-8？

我谈论的读者是：

FileReaders
BufferedReader来自的Sockets
一个Scanner从System.in
…

轮到我们，编码取决于操作系统，这意味着以下内容并非在每个操作系统上都适用：

'a'== 97

问题答案：

读者如何知道他必须使用UTF-8？

通常你指定 你自己
的一个InputStreamReader。它有一个采用字符编码的构造函数。例如

Reader reader = new InputStreamReader(new FileInputStream("c:/foo.txt"), "UTF-8");

所有其他读者（据我所知）都使用平台默认字符编码，这实际上可能不是正确的编码（例如 -cough- CP-1252）。

从理论上讲，您还可以根据字节顺序标记自动检测字符编码。这将几种unicode编码与其他编码区分开来。不幸的是Java
SE对此没有任何API，但是您可以自制一个可以用来替换的API，InputStreamReader如上面的示例所示：

public class UnicodeReader extends Reader {
    private static final int BOM_SIZE = 4;
    private final InputStreamReader reader;

    /**
     * Construct UnicodeReader
     * @param in Input stream.
     * @param defaultEncoding Default encoding to be used if BOM is not found,
     * or <code>null</code> to use system default encoding.
     * @throws IOException If an I/O error occurs.
     */
    public UnicodeReader(InputStream in, String defaultEncoding) throws IOException {
        byte bom[] = new byte[BOM_SIZE];
        String encoding;
        int unread;
        PushbackInputStream pushbackStream = new PushbackInputStream(in, BOM_SIZE);
        int n = pushbackStream.read(bom, 0, bom.length);

        // Read ahead four bytes and check for BOM marks.
        if ((bom[0] == (byte) 0xEF) && (bom[1] == (byte) 0xBB) && (bom[2] == (byte) 0xBF)) {
            encoding = "UTF-8";
            unread = n - 3;
        } else if ((bom[0] == (byte) 0xFE) && (bom[1] == (byte) 0xFF)) {
            encoding = "UTF-16BE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE)) {
            encoding = "UTF-16LE";
            unread = n - 2;
        } else if ((bom[0] == (byte) 0x00) && (bom[1] == (byte) 0x00) && (bom[2] == (byte) 0xFE) && (bom[3] == (byte) 0xFF)) {
            encoding = "UTF-32BE";
            unread = n - 4;
        } else if ((bom[0] == (byte) 0xFF) && (bom[1] == (byte) 0xFE) && (bom[2] == (byte) 0x00) && (bom[3] == (byte) 0x00)) {
            encoding = "UTF-32LE";
            unread = n - 4;
        } else {
            encoding = defaultEncoding;
            unread = n;
        }

        // Unread bytes if necessary and skip BOM marks.
        if (unread > 0) {
            pushbackStream.unread(bom, (n - unread), unread);
        } else if (unread < -1) {
            pushbackStream.unread(bom, 0, 0);
        }

        // Use given encoding.
        if (encoding == null) {
            reader = new InputStreamReader(pushbackStream);
        } else {
            reader = new InputStreamReader(pushbackStream, encoding);
        }
    }

    public String getEncoding() {
        return reader.getEncoding();
    }

    public int read(char[] cbuf, int off, int len) throws IOException {
        return reader.read(cbuf, off, len);
    }

    public void close() throws IOException {
        reader.close();
    }
}

编辑作为对您的编辑的答复：

因此编码取决于操作系统。因此，这意味着并非在每个操作系统上都是如此：
'a'== 97

不，这不是真的。的ASCII（其含有128个字符，编码0x00，直到与0x7F）为
基础
的所有其它的字符编码。只有字符ASCII集以外的字符可能会冒用其他编码显示不同的风险。该ISO-8859编码涵盖了人物ASCII以相同的代码点范围。该Unicode编码涵盖了人物ISO-8859-1以相同的代码点范围。

您可能会发现每个博客都很有趣：

每个软件开发人员绝对，肯定必须了解的Unicode和字符集的绝对最低要求（无借口！）（两者的更多理论依据）
Unicode-如何正确获取字符？（两者更实用）

Java：阅读器和编码

相关阅读

相关文章

相关问答

相关工具

相关文档