Java 8 RandomAccessFile 读取 UTF-8

2019-11-14 本文已影响0人 quaeast

问题结论

为什么 RandomAccessFile 的 readLine() 读 UTF-8 文件是乱码？

RandomAccessFile 的函数 readLine() 使用 ISO-8859-1 解码文件，所以读取 UTF-8 的文件会造成乱码。解决方式就是再使用ISO-8859-1编码得到原先的byte[]数组，再用这个数组重新构造 String 即可。

但是使用ISO-8859-1解码并没有在文档中提及，这个隐藏特性的来源是什么呢？

源码之下，了无秘密

首先查看 readLine() 的源码

public final String readLine() throws IOException {
        StringBuffer input = new StringBuffer();
        int c = -1;
        boolean eol = false;

        while (!eol) {
            switch (c = read()) {
            case -1:
            case '\n':
                eol = true;
                break;
            case '\r':
                eol = true;
                long cur = getFilePointer();
                if ((read()) != '\n') {
                    seek(cur);
                }
                break;
            default:
                input.append((char)c);
                break;
            }
        }

        if ((c == -1) && (input.length() == 0)) {
            return null;
        }
        return input.toString();
}

readLine 实际上就是通过迭代 read() 函数读取单个字节，并把每个字节转化成 char 类型依次装入 input 中，在遇到换行符之后停止操作。这里我们可以注意到有趣的一点，就是他对于不同协议中的两种换行符都做了考虑。

所以我们继续探索 read() 函数

read() 源码

public int read() throws IOException {
        return read0();
}

read() 函数非常的简单，通过注释我了解到，read() 的作用就是读取一个字节的内容，并把这个字节装入 int 中返回。

思考

再回头看 readLine()，这两个函数都非常的简单，无论是代码还是注释根本就没有提什么 ISO-8859-1标准，那这一行是怎么莫名其妙的被以 ISO-8859-1 解码的呢。

这是因为，Java 的 char 类型使用的是Unicode。而ISO-8859-1是一位定长的字符集，Unicode 的前256位和ISO-8859-1是重合的。换句话说，Unicode的前256位就是ISO-8859-1。所以在 readLine() 中对每一字节进行读取并立即转化成 char 类型的过程，就相当于完成了 ISO-8859-1 解码。

而我们要想得到原始的 byte 串怎么办呢？有两种方法，一种就是把 readLine() 读出的 String 用 ISO-8859-1 编码转回编码前的 byte[] 数组。

import java.io.IOException;
import java.io.RandomAccessFile;


public class LearnBytes {
    public static void main(String[] args) throws IOException {
        String path = "files/t.txt";
        RandomAccessFile rf = new RandomAccessFile(path, "rw");
        String buffer = rf.readLine();
        byte[] originalBytes = buffer.getBytes("ISO-8859-1”); //反编码回文件中本原始的字节流
        String utf8 = new String(originalBytes); //String 构造函数默认接受 UTF-8 编码
    }
}

另一种方法就是通过 read() 自己写一个readLine()函数，返回值是byte[]类型。

总结

It’s not a bug. It’s an undocumented feature.

参考及补充

严格来讲，ISO-8859-1 和 Java char 使用的 Unicode 是字符集。而 UTF-8 是基于 Unicode 的编码方式。

ISO-8859-1 wiki
Unicode wiki
UTF-8 wiki