本文链接： https://blog.csdn.net/m0_38060977/article/details/102754838

一 map映射结合Charset

public static void main(String[] args) throws Exception {
        String inputFile = "input_1.txt";
        String outputFile = "output_1.txt";

        RandomAccessFile inputRandomAccessFile = new RandomAccessFile(inputFile,"r");
        RandomAccessFile outputRandomAccessFile = new RandomAccessFile(outputFile,"rw");


        long inputFileLength = new File(inputFile).length();

        FileChannel inputFileChannel = inputRandomAccessFile.getChannel();
        FileChannel outputFileChannel = outputRandomAccessFile.getChannel();

        MappedByteBuffer inputData = inputFileChannel.map(FileChannel.MapMode.READ_ONLY,0,inputFileLength);

        //使用ASCII就不行，因为ASCII采用7个字节表示。转换过程种会出错,丢失第8个字节
        Charset charset = Charset.forName("utf-8");

        CharsetDecoder charsetDecoder = charset.newDecoder();
        CharsetEncoder charsetEncoder = charset.newEncoder();

        CharBuffer charBuffer = charsetDecoder.decode(inputData);
        ByteBuffer byteBuffer = charsetEncoder.encode(charBuffer);

        outputFileChannel.write(byteBuffer);
//        outputFileChannel.write(inputData);

        inputFileChannel.close();
        outputFileChannel.close();



    }

测试发现：即使不用Charset进行解码在编码，而是直接调用注释行代码也可以进行copy操作。因为MappedByteBuffer也是属于bytebuff，可以直接操作

二各种编码介绍

ASCII

American standard code information interchange，美国信息交换标准码
用7 bit表示一个字符，一共可以表示128个

ISO-8859-1

用一个字节（8 bit）表示一个字符，共计可表示256种字符。是在ascii基础上扩展的，向下兼容ascii标准。
完全用上了所有的位，不向ascii浪费了一个字节

gb2321

两个字节表示一个汉字，没有纳入生僻字。gbk扩展了gb2312，gb13030又扩展了gbk（包括了所有简体字中文）

unicode

万国码，包括了世界上上所有语言的文字。在java种表示（\uxxxx）。
Unicode 是一种字符集，Unicode 的学名是 “Universal Multiple-Octet Coded Character Set”，简称为UCS。UCS 可以看作是 “Unicode Character Set” 的缩写。
这一标准的 2 字节形式通常称作 UCS-2。然而，受制于 2 字节数量的限制，UCS-2 只能表示最多 65536 个字符。Unicode 的 4 字节形式被称为 UCS-4 或 UTF-32，能够定义 Unicode 的全部扩展，最多可定义 100 万个以上唯一字符。
2016-06-21 颁发的 Unicode 9.0 共收录 128,237 个字。

utf8

utf8是一种存储方式，unicode是一种编码方式；utf8是unicode的实现方式之一。（它是一种规定，Unicode 本身只规定了每个字符的数字编号是多少，并没有规定这个编号如何存储。）

utf16-LE(little endian)，utf16-BE(big endian) 即小端和大端，放在文件最开始位置
变长字节表示法
一般来说，utf8最多会用3个字节来表示，最多可用6个字节（很少用到）。
BOM（byte order mark 字节序）

参考

https://segmentfault.com/q/1010000009652523

nio（4）字符集编解码

目录

一 map映射结合Charset

二各种编码介绍

ASCII

ISO-8859-1

gb2321

unicode

utf8

推荐

参考

猜你喜欢

nio（4）字符集编解码

目录

一 map映射结合Charset

二 各种编码介绍

ASCII

ISO-8859-1

gb2321

unicode

utf8

推荐

参考

猜你喜欢

二各种编码介绍