python中unicode和unicodeescape

在python中，unicode是内存编码集，一般我们将数据存储到文件时，需要将数据先编码为其他编码集，比如utf-8、gbk等。

读取数据的时候再通过同样的编码集进行解码即可。

 
          #python3 
         
          >>> s =  
          '中国' 
         
          >>> a = s.encode() 
         
          >>> a 
         
          b 
          '\xe4\xb8\xad\xe5\x9b\xbd' 
         
          >>> b = a.decode() 
         
          >>> b 
         
          '中国'

但是其实还有一种unicode-escape编码集，他是将unicode内存编码值直接存储：

 
          #python3 
         
          >>> s =  
          '中国' 
         
          >>> b = s.encode( 
          'unicode-escape' 
          ) 
         
          >>> b 
         
          b 
          '\\u4e2d\\u56fd' 
         
          >>> c = b.decode( 
          'unicode-escape' 
          ) 
         
          >>> c 
         
          '中国'

拓展：还有一种string-escape编码集，在2中可以对字节流用string-escape进行编码

 
          #python2 
         
          >>> s =  
          '中国' 
         
          >>> a = s.decode( 
          'gbk' 
          ) 
         
          >>> print a 
         
          中国 
         
          >>> b = s.decode( 
          'utf-8' 
          ) 
         
          Traceback (most recent call last): 
         
          File  
          "<stdin>" 
          , line 1,  
          in 
          <module> 
         
          File  
          "D:\python\python2.7\lib\encodings\utf_8.py" 
          , line 16,  
          in 
          decode 
         
          return 
          codecs.utf_8_decode(input, errors, True) 
         
          UnicodeDecodeError:  
          'utf8' 
          codec can't decode  
          byte 
          0xd6  
          in 
          position 0: invalid c 
         
          ontinuation  
          byte 
         
          >>> c = s.decode( 
          'string-escape' 
          ) 
         
          >>> print c 
         
          中国

chardet.detect()

使用chardet.detect()进行编码集检测时很多时候并不准确，比如中文过少时会识别成IBM855编码集：

 
          #python3 
         
          >>> s =  
          '中国' 
         
          >>> c = s.encode( 
          'gbk' 
          ) 
         
          >>> chardet.detect(c) 
         
          { 
          'encoding' 
          :  
          'IBM855' 
          ,  
          'confidence' 
          : 0.7679697235616183,  
          'language' 
          :  
          'Russian' 
          }

注：855 OEM 西里尔语 IBM855。

中文比较多时，还是准确的：

 
       >>> s =  
       '中国范文芳威风威风' 
      
 
       >>> c = s.encode( 
       'gbk' 
       ) 
      
 
       >>> chardet.detect(c) 
      
 
       { 
       'encoding' 
       :  
       'GB2312' 
       ,  
       'confidence' 
       : 0.99,  
       'language' 
       :  
       'Chinese' 
       } 
      

python中unicode和unicodeescape

猜你喜欢