2021SC@SDUSC

文件编码

一、文本文件编码方式

ANSI：系统预设的标准文字储存格式。

UTF-8：UTF意为通用字集转换格式(Universal Character Set Transformation Format)，UTF-8是Unicode的8位元格式。如果使用只能在同类位元组内支持8个位元的重要资料一类的旧式传输媒体，可选择UTF-8格式

Unicode：世界上所有主要指令文件的联集，包括商业和个人电脑所使用的公用字集。当采用Unicode格式储存文件时，可使用Unicode控制字符辅助说明语言的文字覆盖范围，

Unicode big endian：在Big-endian处理器（如苹果Macintosh电脑）上建立的Unicode文件中的文字位元组（存放单位）排列顺序，与在Intel处理器上建立的文件的文字位元组排列顺序相反。

二、解析文本文件编码

函数detectEncoding ：参数为文件，返回值为编码枚举中的一种编码（GB2312、HZ、BIG5、* EUC_TW、ASCII 或 OTHER）此函数查看文件并为每个编码类型分配一个概率分数 .返回概率最高的编码类型，该方法先将文件处理为字节流，然后交由重载的detectEncoding函数处理，具体代码：

public int detectEncoding(File testfile) {
        FileInputStream chinesefile;
        byte[] rawtext;
        rawtext = new byte[(int) testfile.length()];
        try {
            chinesefile = new FileInputStream(testfile);
            chinesefile.read(rawtext);
            chinesefile.close();
        } catch (Exception e) {
            System.err.println("Error: " + e);
        }
        return detectEncoding(rawtext);
    }

detectEncoding ：参数为字节数组，返回为编码枚举中的一种编码（GB2312、HZ、* BIG5、EUC_TW、ASCII 或 OTHER），此函数查看字节数组并为每个编码类型分配一个概率分数，返回概率最高的编码类型。

public int detectEncoding(byte[] rawtext) {
        int[] scores;
        int index, maxscore = 0;
        int encoding_guess = OTHER;
        scores = new int[TOTALTYPES];
        // Assign Scores
        scores[GB2312] = gb2312_probability(rawtext);
        scores[GBK] = gbk_probability(rawtext);
        scores[GB18030] = gb18030_probability(rawtext);
        scores[HZ] = hz_probability(rawtext);
        scores[BIG5] = big5_probability(rawtext);
        scores[CNS11643] = euc_tw_probability(rawtext);
        scores[ISO2022CN] = iso_2022_cn_probability(rawtext);
        scores[UTF8] = utf8_probability(rawtext);
        scores[UNICODE] = utf16_probability(rawtext);
        scores[EUC_KR] = euc_kr_probability(rawtext);
        scores[CP949] = cp949_probability(rawtext);
        scores[JOHAB] = 0;
        scores[ISO2022KR] = iso_2022_kr_probability(rawtext);
        scores[ASCII] = ascii_probability(rawtext);
        scores[SJIS] = sjis_probability(rawtext);
        scores[EUC_JP] = euc_jp_probability(rawtext);
        scores[ISO2022JP] = iso_2022_jp_probability(rawtext);
        scores[UNICODET] = 0;
        scores[UNICODES] = 0;
        scores[ISO2022CN_GB] = 0;
        scores[ISO2022CN_CNS] = 0;
        scores[OTHER] = 0;
        // Tabulate Scores
        for (index = 0; index < TOTALTYPES; index++) {
            if (debug)
                System.err.println("Encoding " + nicename[index] + " score " + scores[index]);
            if (scores[index] > maxscore) {
                encoding_guess = index;
                maxscore = scores[index];
            }
        }
        // Return OTHER if nothing scored above 50
        if (maxscore <= 50) {
            encoding_guess = OTHER;
        }
        return encoding_guess;
    }

KKFileView(六）

文件编码

一、文本文件编码方式

二、解析文本文件编码

猜你喜欢