Play with the code|JS method to realize the Base64 encoding of Chinese strings to utf-8

Table of contents

UTF-8 string codec

Solution

analyze

utf8_to_b64

b64_to_utf8

The unescape and escape methods are deprecated

reason

Solution

Base64 codec under Node.js


Base64 codec

Base64 is a positional notation using base 64. It uses the largest power of 2 to represent printable ASCII characters only. This makes it useful as a transfer encoding for email. Variables in Base64 use the characters AZ, az, and 0-9, so there are 62 characters in total, used as the first 64 numbers, and the last two used as numbers. The symbols are different in different systems. Some other encoding methods such as uuencode, and later versions of binhex use a different 64 character set to represent 6 binary digits, but they are not called Base64.

 

In fact, in JavaScript, there are two functions natively used to deal with decoding and encoding base64 strings:

  • btoa(): Create a Base64-encoded ASCII string from a "string" of binary data ("btoa" actually means "binary to ASCII").
  • atob(): Decodes a Base64-encoded string ("atob" actually means "ASCII to binary").

 

It can be called by  window.atob(string), window.btoa(base64string) which is very convenient.

UTF-8 string codec

UTF-8 (8 bits, Universal Character Set/Unicode Transformation Format) is a variable-length character encoding for Unicode. It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, so that the software that originally processed ASCII characters can continue to be used without or with only a small number of modifications. Therefore, it has gradually become the preferred encoding for email, web pages, and other applications that store or transmit text.

 

As mentioned above, both btoa and  atob only support ASCII characters, and do not support Unicode characters. Calls on Unicode strings will report an error in most browsers  Character Out Of Range because the characters are outside the range of ASCII.

Solution

We can encode the string after escaping, and re-escaping the decoding result into a Unicode string when decoding.

function utf8_to_b64(str) {
  return btoa(unescape(encodeURIComponent(str)));
}

function b64_to_utf8(str) {
  return decodeURIComponent(escape(atob(str)));
}

// 用例:
utf8_to_b64("测试"); // "5rWL6K+V"
b64_to_utf8("5rWL6K+V"); // "测试"

analyze

It looks strange in the middle, what happened?

The main thing is to use  encodeURIComponentand decodeURIComponent process the received string parameters as UTF-8 strings.

utf8_to_b64

Look at  utf8_to_b64 the method first.

Here, because  encodeURIComponent the method accepts UTF-8 strings, you can first use  encodeURIComponent the method to convert the UTF-8 strings into  %XX%XX hexadecimal notation. Then use  unescape the method to translate the hexadecimal to the corresponding content in ASCII, so that it becomes  btoa an ASCII string that the method can accept. Finally, directly use  btoa the method to encode to a Base64 string.

encodeURIComponent("测试"); // "%E6%B5%8B%E8%AF%95"
unescape("%E6%B5%8B%E8%AF%95"); // "æµ\x8Bè¯\x95"
btoa("æµ\x8Bè¯\x95"); // "5rWL6K+V"

Generally speaking, it is a process of converting UTF-8 strings into ASCII strings and then encoding them.

b64_to_utf8

Look at the method again  b64_to_utf8 .

In fact, it is the other way around. First, the Base64 string  atob is decoded into an ASCII string through a method, then  escape the ASCII string is converted into a hexadecimal symbol through a method, and finally the hexadecimal symbol  decodeURIComponent is parsed into UTF-8 through a method.

atob("5rWL6K+V"); // "æµ\x8Bè¯\x95"
escape("æµ\x8Bè¯\x95"); // "%E6%B5%8B%E8%AF%95"
decodeURIComponent("%E6%B5%8B%E8%AF%95"); // "测试"

Deprecations  unescape and  escape methods

reason

This feature has been removed from the web standard, although some browsers still support it, but may stop supporting it at some time in the future, please try not to use this feature.

It can be seen  unescape that the and  escape methods have been marked as obsolete and are recommended for use  decodeURI or  decodeURIComponent alternatives  unescape, recommended for use  encodeURI or  encodeURIComponent alternatives  escape.

According to the content of percent encoding - Wikipediaescape , we can know that  when dealing  0xff with other characters, the characters are directly used  unicode and a "% u" is added in front, while  encodeURI UTF-8 is performed first, and then UTF- Add a "%" before each bytecode of 8

RFC 3986, published in January 2005, recommends that all new URIs must not percent-encode unreserved characters; other characters are suggested to be converted to UTF-8 byte sequences and then percent-encode their byte values . Previous URIs are not affected by this standard.
There are some non-standard representations of Unicode characters in URIs as:  %uxxxx, where xxxx is the Unicode code point value represented by 4 hexadecimal digits. There is no such character representation in any RFC, and it has been rejected by the W3C (page archive backup, at Internet Archive). The third edition of ECMA-262 still contains functions  escape(string) that use this syntax, but also has functions  encodeURI(uri) that convert characters to UTF-8 byte sequences and percent-encode each byte.

So  escape it's a non-standard implementation of percent-encoding, so it's no surprise that it's deprecated.

Solution

Although  escape it is a non-standard implementation of percent encoding, we actually used this feature in the above method  escape , and here we provide an implementation that does not use  unescape and  escape after the method.

function utf8_to_b64(str) {
  return btoa(
    encodeURIComponent(str).replace(/%([0-9A-F]{2})/g, function (match, p1) {
      return String.fromCharCode("0x" + p1);
    })
  );
}

function b64_to_utf8(str) {
  return decodeURIComponent(
    atob(str)
      .split("")
      .map(function (c) {
        return "%" + ("00" + c.charCodeAt(0).toString(16)).slice(-2);
      })
      .join("")
  );
}

// 用例:
utf8_to_b64("测试"); // "5rWL6K+V"
b64_to_utf8("5rWL6K+V"); // "测试"

Base64 codec under Node.js

Using the above method in Node.js, you may find that btoa the and  atob method, which only supports ASCII methods, have also been marked as obsolete, so what method to use in Node.js?

Node.js provides a more convenient method, which is to use  Buffer, in addition to supporting strings, it also supports other data.

function utf8_to_b64(str) {
  return Buffer.from(str).toString("base64");
}

function b64_to_utf8(str) {
  return Buffer.from(str, "base64").toString("utf8");
}

 

Guess you like

Origin blog.csdn.net/qq_22903531/article/details/131934492