Java get the character encoding of the file

For the reading of character files, if you want to read the expected text content, you need to pass the corresponding file encoding. If the encoding of the read file does not match the actual encoding of the file, garbled characters will appear. This article provides several ways to obtain files. Encoding implementation. The file encoding format described in this article is the Chinese encoding of UTF-8 and GB2312 (GBK) in a narrow sense, and other non-Chinese encodings are not considered.

Therefore, this article starts with a simple example and provides two files, namely GB2312.txt and UTF8.txt. The actual encoding of these two files can be obtained in the code respectively. Currently, there are 3 methods in total, and the details are as follows :

Jdk built-in Charset

There is no specific practice of this method. I listened to the side and said casually, mainly using the encoding in Charset to encode, whether canEncode can be encoded according to the specified encoding, and I also saw that the file content is read. to, and then converted to String type, and then compare the text after using new String (file content.getBytes (encoding), encoding) with the original file content, if the comparison is consistent, it means that the specified text encoding, this method is personal There is no special mastery, so it can be mentioned here.

Guava provides Utf8 tool class

@Test
public void testGuava() throws Exception {
    byte[] gb2312Bytes = FileUtils.readFileToByteArray(new File("c:\\test\\GB2312.txt"));
    System.out.println("GB2312.txt文件是否是UTF-8编码:" + Utf8.isWellFormed(gb2312Bytes));
    byte[] utf8ytes = FileUtils.readFileToByteArray(new File("c:\\test\\UTF-8.txt"));
    System.out.println("UTF-8.txt文件是否是UTF-8编码:" + Utf8.isWellFormed(utf8ytes));
}

The output of the GB2312.txt file is false, and the output of the UTF-8.txt file is true. This example indicates whether the byte encoding of the read file content is UTF-8 encoding.

Github open source project Cpdetector

After running the Github open source project Cpdetector for a while, I found that the encoding format of the specified file can be obtained by passing the file stream or file address. This implementation is not bad. Due to the short analysis time, but the output results are expected to be better, if it is applied in an actual project, a further round of detailed exploration is required. For example, detectorProxy needs to add more instance instances. The reference code is as follows:

@Test
public testCpdetector() throws Exception {
    File file1 = new File("c:\\test\\GB2312.txt");
    File file2 = new File("c:\\test\\UTF-8.txt");
    CodepageDetectorProxy detectorProxy = CodepageDetectorProxy.getInstance();
    detectorProxy.add(ASCIIDetector.getInstance());
    detectorProxy.add(UnicodeDetector.getInstance());
    detectorProxy.add(JChardetFacade.getInstance());
    System.out.println(detectorProxy.detectCodepage(file1.toURI().toURL()) + "----UTF-8编码");
    System.out.println(detectorProxy.detectCodepage(file2.toURI().toURL()) + "----GB2312编码");
}

Apache Tika

The Apache Tika open source project can detect and extract metadata and structured content from documents in different formats (such as HTML, PDF, OFFICE, jar, zip, mp3, etc., more than a thousand different file types), Tika can be used It is used for search engine indexing, content analysis, translation, etc., so it is definitely not a problem if it is only used to obtain file encoding, but I personally learned about it from the official website, and after writing some related examples and running it, it is not a problem. Get the desired result (but I think it is definitely possible), so let's do it, and study further when there are similar needs later.

Refer to more addresses: https://www.chendd.cn/blog/article/1550863959014342657.html

Guess you like

Origin blog.csdn.net/haiyangyiba/article/details/129087357