Java String.getBytes()详解&Properties load方法 中文乱码

Java String.getBytes()详解

基础概念

  • Jvm 内存中 String 的表示是采用 unicode 编码
  • UTF-8 是 Unicode 的实现方式之一

JDK

    /**
     * Encodes this {@code String} into a sequence of bytes using the named
     * charset, storing the result into a new byte array.
     */
    public byte[] getBytes(String charsetName) throws UnsupportedEncodingException {
    
    
    }

    /**
     * Constructs a new {@code String} by decoding the specified array of
     * bytes using the specified {@linkplain java.nio.charset.Charset charset}.
     * The length of the new {@code String} is a function of the charset, and
     * hence may not be equal to the length of the byte array.
     */
    public String(byte bytes[], Charset charset) {
    
    
    }

getBytes(String charsetName)

对字符串按照 charsetName 进行编码(unicode→charsetName),返回编码后的字节。
getBytes() 表示按照系统默认编码方式进行。

String(byte bytes[], Charset charset)

对字节按照 charset 进行解码(charset→unicode),返回解码后的字符串。
String(byte bytes[]) 表示按照系统默认编码方式进行

示例

正确用法

String s = "浣犲ソ"; //这是"你好"的gbk编码的字符串
String ss = new String(s.getBytes("GBK"), "UTF-8");
System.out.println(ss);
System.out.println( new String(str.getBytes("UTF-8"),"UTF-8"));

错误用法

System.out.println( new String(str.getBytes("UTF-8"),"GBK"));



Properties load方法 中文乱码

前言

最近做项目,业务问为啥utf-8的properties文件中文值 load后乱码,还问我什么原因,分析了源码,终于知道了原因。知道了为什么properties需要Unicode编码的原因。

properties编码

img

创建一个properties文件,对于Java 类的Properties,实际上就是Hashtable。

publicclass Properties extends Hashtable<Object,Object> {
    
    

properties load

properties的load方法有3种

img

其中除了xml load,流输入有Reader和InputStream,Reader的实现一般不会乱码,比如StringReader

img

而InputStream却容易乱码,以上面示例的properties文件UTF-8 编码

img

原因分析

先看中文乱码的原因,明明properties文件是UTF-8编码,乱码必然是其他编码格式重新编码了。

    public synchronized void load(InputStream inStream) throws IOException {
    
    
        //JDK 对象判空断言
        Objects.requireNonNull(inStream, "inStream parameter is null");
        // 1 LineReader; 2 load0
        load0(new LineReader(inStream));
    }

LineReader,实际上还是一行一行读取,支持InputStream和Reader

img

再看load0

    private void load0(LineReader lr) throws IOException {
    
    
        StringBuilder outBuffer = new StringBuilder();
        int limit;
        int keyLen;
        int valueStart;
        boolean hasSep;
        boolean precedingBackslash;
 
        //循环读行
        while ((limit = lr.readLine()) >= 0) {
    
    
            keyLen = 0;
            valueStart = limit;
            hasSep = false;
 
            // JDK 也真么玩,看来sout是真方便
            //System.out.println("line=<" + new String(lineBuf, 0, limit) + ">");
            precedingBackslash = false;
            // 读到内容,解析读取的char
            while (keyLen < limit) {
    
    
                //缓存行 char[]数组
                char c = lr.lineBuf[keyLen];
                //need check if escaped.
                // 判断= 或者: 看来":"也可以分割key value
                // 这里跟后面的转义符相关
                if ((c == '=' ||  c == ':') && !precedingBackslash) {
    
    
                    valueStart = keyLen + 1;
                    hasSep = true;
                    break;
                // 空格处理,认为是值开始的下标
                // \t 制表符 \r 回车 \n 换行 \f 换页,将当前位置移到下一页的开头
                } else if ((c == ' ' || c == '\t' ||  c == '\f') && !precedingBackslash) {
    
    
                    valueStart = keyLen + 1;
                    break;
                }
                //转义符,必须再次转义;\\转义符后面的内容原封不动,因为已经有转义符字符在文本保留下来
                if (c == '\\') {
    
    
                    precedingBackslash = !precedingBackslash;
                } else {
    
    
                    precedingBackslash = false;
                }
                // 移位,直到找到特征 = : 空格
                keyLen++;
            }
            while (valueStart < limit) {
    
    
                //循环读取每个字符,直到 “= 或者 :” 后非空格的字符
                char c = lr.lineBuf[valueStart];
                if (c != ' ' && c != '\t' &&  c != '\f') {
    
    
                    if (!hasSep && (c == '=' ||  c == ':')) {
    
    
                        hasSep = true;
                    } else {
    
    
                        break;
                    }
                }
                valueStart++;
            }
            //前面都是区分char[]的line,拿到下标,方便区分key value
            String key = loadConvert(lr.lineBuf, 0, keyLen, outBuffer);
            String value = loadConvert(lr.lineBuf, valueStart, limit - valueStart, outBuffer);
            put(key, value);
        }
    }

再看
loadConvert

    private String loadConvert(char[] in, int off, int len, StringBuilder out) {
    
    
        char aChar;
        int end = off + len;
        int start = off;
        while (off < end) {
    
    
            aChar = in[off++];
            //找到转义符\\  注意上面的off++
            if (aChar == '\\') {
    
    
                break;
            }
        }
        //空字符串
        if (off == end) {
    
     // No backslash
            return new String(in, start, len);
        }
 
        // backslash found at off - 1, reset the shared buffer, rewind offset
        out.setLength(0);
        off--;
        out.append(in, start, off - start);//转义符前的数据,如果有\\转义符
 
        while (off < end) {
    
    
            //转义符\\位置
            aChar = in[off++];
            if (aChar == '\\') {
    
      //处理Unicode
                aChar = in[off++]; //转义符下一个字符
                if(aChar == 'u') {
    
     //u开头
                    // Read the xxxx
                    int value=0;
                    //Unicode 4位 转义Unicode并且转为char
                    for (int i=0; i<4; i++) {
    
    
                        aChar = in[off++];
                        //遍历每一个Unicode字符,编码
                        switch (aChar) {
    
    
                          case '0': case '1': case '2': case '3': case '4':
                          case '5': case '6': case '7': case '8': case '9':
                             value = (value << 4) + aChar - '0';
                             break;
                          case 'a': case 'b': case 'c':
                          case 'd': case 'e': case 'f':
                             value = (value << 4) + 10 + aChar - 'a';
                             break;
                          case 'A': case 'B': case 'C':
                          case 'D': case 'E': case 'F':
                             value = (value << 4) + 10 + aChar - 'A';
                             break;
                          default:
                              throw new IllegalArgumentException(
                                           "Malformed \\uxxxx encoding.");
                        }
                    }
                    out.append((char)value);
                } else {
    
    
                    //特殊转义\t \r \n \f 意义前面注释有
                    if (aChar == 't') aChar = '\t';
                    else if (aChar == 'r') aChar = '\r';
                    else if (aChar == 'n') aChar = '\n';
                    else if (aChar == 'f') aChar = '\f';
                    out.append(aChar);
                }
            } else {
    
    
                //没有转义\\直接加char
                out.append(aChar);
            }
        }
        return out.toString();
    }

至此终于知道Unicode为什么被properties原生支持了。那么乱码是怎么生成的呢,毕竟需要对utf-8用其他编码才会乱码,字符本身不会乱码。就要看前面的readLine了。

    private static class LineReader {
    
    
        LineReader(InputStream inStream) {
    
    
            this.inStream = inStream;
            inByteBuf = new byte[8192]; //8K 缓冲
        }
 
        LineReader(Reader reader) {
    
    
            this.reader = reader;
            inCharBuf = new char[8192]; //8K 缓冲
        }
 
        //过程变量
        char[] lineBuf = new char[1024]; //给外面读取的每行数据,需要配合len
        private byte[] inByteBuf;
        private char[] inCharBuf; 
        private int inLimit = 0;
        private int inOff = 0;
        private InputStream inStream;
        private Reader reader;
 
        int readLine() throws IOException {
    
    
            // use locals to optimize for interpreted performance
            int len = 0; //关键,表示有用行的byte或者char长度,结合lineBuf给外面读取
            int off = inOff;
            int limit = inLimit;
 
            boolean skipWhiteSpace = true;
            boolean appendedLineBegin = false;
            boolean precedingBackslash = false;
            //关键,读字节流还是字符流
            boolean fromStream = inStream != null;
            byte[] byteBuf = inByteBuf;
            char[] charBuf = inCharBuf;
            char[] lineBuf = this.lineBuf;
            char c;
 
            //迭代死循环
            while (true) {
    
    
                if (off >= limit) {
    
     //offset解析完成才会读取
                    //按照字节或者字符读取,缓冲8K,读取长度
                    inLimit = limit = fromStream ? inStream.read(byteBuf)
                                                 : reader.read(charBuf);
                    if (limit <= 0) {
    
     //读完所有内容
                        if (len == 0) {
    
     //空内容
                            return -1; 
                        }
                        return precedingBackslash ? len - 1 : len; // 末尾\标记不计入
                    }
                    off = 0; //恢复处理offset
                }
 
                // (char)(byte & 0xFF) is equivalent to calling a ISO8859-1 decoder.
                // (char)(byte & 0xFF)表示ISO8859-1编码,乱码的来源,字符流不编码
                c = (fromStream) ? (char)(byteBuf[off++] & 0xFF) : charBuf[off++];
 
                if (skipWhiteSpace) {
    
     // 空格 制表符 换页特殊字符开头
                    if (c == ' ' || c == '\t' || c == '\f') {
    
    
                        continue;
                    }
                    // 开头就是换行
                    if (!appendedLineBegin && (c == '\r' || c == '\n')) {
    
    
                        continue;
                    }
                    skipWhiteSpace = false; //已经处理空白,没有空白和换行了
                    appendedLineBegin = false;
 
                }
                //开始的时候
                if (len == 0) {
    
     // Still on a new logical line
                    if (c == '#' || c == '!') {
    
     //注释行,直接丢了
                        // Comment, quickly consume the rest of the line
 
                        // When checking for new line characters a range check,
                        // starting with the higher bound ('\r') means one less
                        // branch in the common case.
                        commentLoop: while (true) {
    
    
                            if (fromStream) {
    
    
                                byte b;
                                while (off < limit) {
    
     //读取一行
                                    b = byteBuf[off++]; //关键是off下标
                                    if (b <= '\r' && (b == '\r' || b == '\n'))
                                        break commentLoop;
                                }
                                if (off == limit) {
    
     // 没数据就读取,缓冲8K
                                    inLimit = limit = inStream.read(byteBuf);
                                    if (limit <= 0) {
    
     // EOF
                                        return -1;
                                    }
                                    off = 0;
                                }
                            } else {
    
     //字符流
                                while (off < limit) {
    
    
                                    c = charBuf[off++]; //字符缓冲同理
                                    if (c <= '\r' && (c == '\r' || c == '\n'))
                                        break commentLoop;
                                }
                                if (off == limit) {
    
    
                                    inLimit = limit = reader.read(charBuf);
                                    if (limit <= 0) {
    
     // EOF
                                        return -1;
                                    }
                                    off = 0;
                                }
                            }
                        }
                        skipWhiteSpace = true; //从新开始计行,可以跳过空白
                        continue;
                    }
                }
 
                if (c != '\n' && c != '\r') {
    
     // 判断换行,不是换行符
                    lineBuf[len++] = c; //lineBuf 行缓存
                    if (len == lineBuf.length) {
    
     //行字符长度达到后扩容
                        int maxLen = Integer.MAX_VALUE - 8; // VM allocation limit
                        int newLen = len * 2; //尝试2倍
                        if (newLen < 0 || newLen > maxLen) {
    
     // check for under/overflow
                            newLen = maxLen;
                        }
                        if (newLen <= len) {
    
     // still not good? last-ditch attempt then
                           if (len != Integer.MAX_VALUE) {
    
    
                               newLen = len + 1;
                           } else {
    
    
                               throw new OutOfMemoryError("Required array length too large");
                           }
                        }
                        lineBuf = new char[newLen];
                        // 扩容后复制数据
                        System.arraycopy(this.lineBuf, 0, lineBuf, 0, len);
                        this.lineBuf = lineBuf;
                    }
                    // flip the preceding backslash flag 翻转前面的反斜杠标记
                    precedingBackslash = (c == '\\') ? !precedingBackslash : false;
                } else {
    
     //空白行结束处理
                    // reached EOL
                    if (len == 0) {
    
    
                        skipWhiteSpace = true;
                        continue;
                    }
                    //再试着读取
                    if (off >= limit) {
    
    
                        inLimit = limit = fromStream ? inStream.read(byteBuf)
                                                     : reader.read(charBuf);
                        off = 0;
                        if (limit <= 0) {
    
     // EOF
                            return precedingBackslash ? len - 1 : len;
                        }
                    }
                    //反斜杠结束,去掉反斜杠
                    if (precedingBackslash) {
    
    
                        // backslash at EOL is not part of the line
                        len -= 1;
                        // skip leading whitespace characters in the following line
                        skipWhiteSpace = true;
                        appendedLineBegin = true;
                        precedingBackslash = false;
                        // take care not to include any subsequent \n
                        if (c == '\r') {
    
    
                            if (fromStream) {
    
    
                                if (byteBuf[off] == '\n') {
    
    
                                    off++;
                                }
                            } else {
    
    
                                if (charBuf[off] == '\n') {
    
    
                                    off++;
                                }
                            }
                        }
                    } else {
    
    
                        inOff = off;
                        return len; //返回行长度
                    }
                }
            }
        }
    }

总结

properties utf-8编码文件读取到properties ,中文乱码是由于读取char的过程使用ISO8859-1编码。properties load读取分为字节流和字符流,字符流不会编码。

properties在Java使用Hashtable存储,properties读取的过程是通过字节或者字符一个一个迭代读取,一行一行的读取,忽略注释行,忽略空行,忽略反斜杠的结尾\,读取一行使用char[]和len缓存处理key和value。按照这种逻辑xml效率会低很多,不过properties也不考虑效率问题。

猜你喜欢

转载自blog.csdn.net/qq_43842093/article/details/132792620