KKFileView(六)

2021SC@SDUSC

文件编码

一、文本文件编码方式

ANSI:系统预设的标准文字储存格式。

UTF-8:UTF意为通用字集转换格式(Universal Character Set Transformation Format),UTF-8是Unicode的8位元格式。如果使用只能在同类位元组内支持8个位元的重要资料一类的旧式传输媒体,可选择UTF-8格式

Unicode:世界上所有主要指令文件的联集,包括商业和个人电脑所使用的公用字集。当采用Unicode格式储存文件时,可使用Unicode控制字符辅助说明语言的文字覆盖范围,

Unicode big endian:在Big-endian处理器(如苹果Macintosh电脑)上建立的Unicode文件中的文字位元组(存放单位)排列顺序,与在Intel处理器上建立的文件的文字位元组排列顺序相反。

 二、解析文本文件编码

函数detectEncoding :参数为文件,返回值为编码枚举中的一种编码(GB2312、HZ、BIG5、* EUC_TW、ASCII 或 OTHER)此函数查看文件并为每个编码类型分配一个概率分数 .返回概率最高的编码类型,该方法先将文件处理为字节流,然后交由重载的detectEncoding函数处理,具体代码:

public int detectEncoding(File testfile) {
        FileInputStream chinesefile;
        byte[] rawtext;
        rawtext = new byte[(int) testfile.length()];
        try {
            chinesefile = new FileInputStream(testfile);
            chinesefile.read(rawtext);
            chinesefile.close();
        } catch (Exception e) {
            System.err.println("Error: " + e);
        }
        return detectEncoding(rawtext);
    }

detectEncoding :参数为字节数组,返回为编码枚举中的一种编码(GB2312、HZ、* BIG5、EUC_TW、ASCII 或 OTHER),此函数查看字节数组并为每个 编码类型分配一个概率分数,返回概率最高的编码类型。

public int detectEncoding(byte[] rawtext) {
        int[] scores;
        int index, maxscore = 0;
        int encoding_guess = OTHER;
        scores = new int[TOTALTYPES];
        // Assign Scores
        scores[GB2312] = gb2312_probability(rawtext);
        scores[GBK] = gbk_probability(rawtext);
        scores[GB18030] = gb18030_probability(rawtext);
        scores[HZ] = hz_probability(rawtext);
        scores[BIG5] = big5_probability(rawtext);
        scores[CNS11643] = euc_tw_probability(rawtext);
        scores[ISO2022CN] = iso_2022_cn_probability(rawtext);
        scores[UTF8] = utf8_probability(rawtext);
        scores[UNICODE] = utf16_probability(rawtext);
        scores[EUC_KR] = euc_kr_probability(rawtext);
        scores[CP949] = cp949_probability(rawtext);
        scores[JOHAB] = 0;
        scores[ISO2022KR] = iso_2022_kr_probability(rawtext);
        scores[ASCII] = ascii_probability(rawtext);
        scores[SJIS] = sjis_probability(rawtext);
        scores[EUC_JP] = euc_jp_probability(rawtext);
        scores[ISO2022JP] = iso_2022_jp_probability(rawtext);
        scores[UNICODET] = 0;
        scores[UNICODES] = 0;
        scores[ISO2022CN_GB] = 0;
        scores[ISO2022CN_CNS] = 0;
        scores[OTHER] = 0;
        // Tabulate Scores
        for (index = 0; index < TOTALTYPES; index++) {
            if (debug)
                System.err.println("Encoding " + nicename[index] + " score " + scores[index]);
            if (scores[index] > maxscore) {
                encoding_guess = index;
                maxscore = scores[index];
            }
        }
        // Return OTHER if nothing scored above 50
        if (maxscore <= 50) {
            encoding_guess = OTHER;
        }
        return encoding_guess;
    }

猜你喜欢

转载自blog.csdn.net/eldrida1/article/details/121620263