16 chapters, international

HTTP packets can carry any content language represented. Because of HTTP, the entity body just binary container for information only .


To support international in HTTP server returns the contents of the document also need to tell the client what is the use of information and language alphabet, so that clients can correctly parse out information and display characters. Server through Content-Type in the charset parameter and Content-Language inform the client of the alphabet and the language message header.


At the same time, not all clients alphabet and language can be processed, so the client at the time of initiating the request, you can also send Accept-Charset and the Accept-LANGUAG E header, informing the server can handle their own coding type and language . These two headers also supports priority by the q parameter setting priorities.
Therefore, HTTP is international, is to introduce the content Benpian mainly related to the character set encoding (character set encoding) and language tags (language tag)

1 character set and HTTP

1.1 The concept of the character set

The so-called character set is the starting character encoding into binary code. HTTP character set values illustrate how to convert a binary code of the substantial contents for a particular alphabet characters . Each named charset tags algorithm for converting binary character code (or vice versa). Character Set mark was standardized by the IANA MIME character set registry maintenance. Typically specified by charset parameter in the Content-Type. as follows:

Content-Type: text/html; charset=utf-8

1.2 character set encoding

Above that, the character set and encoding purpose is to convert the binary information with our characters. After the conversion of the specific process is generally done in two steps (here to the binary information into character process as an example):

  • First we need to binary information into character code (character set of a specific number of a character);
  • Get this code, we then follow this code is to find the character corresponding to the code from the character set.

After the above two steps, you can get the correct character. The above two steps we are dependent on the correct charset value. If the value is incorrect, the same period of the binary information might draw the wrong character code, the second step is also different depending on the value of the charset, correspondence between the character codes and the character is different. So the usual garbled phenomenon, sometimes it is because the value of charset error caused.

MIME charset tags normalized values charset used . For example: utf-8, iso-8895-1 , gbk like.

Note:

This value is generally not case-sensitive, when the processing of this note.

HTTP only interested in the character data and associated transmission language and character set labels; show the shape of the character is displayed by the user's graphics software (including browsers, operating systems, fonts, etc.) to complete.

 

1.3 client Accept-Charset

Under normal circumstances, the server will pass the Content-Type to use charset parameter tells the client header MIME character set .

But the return information if the server does not appear, we may need the client to judge for themselves as much as possible from the character set of the document content. Typically in an HTML document, you will pass <META HTTP-EQUIV = "Content-Type"> Description As used character set information. as follows:

<HEAD>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=gbk">
<META LANG="jp">
<TITLE>A Japanese Document</TITLE>
</HEAD>
<BODY>
...

The receiving part of the class described gbk encoding the html document uses.

The world class range is very large character set encoding, the server does not necessarily return the character set of the client correctly handled. Therefore, in order to avoid the server returns the character set some clients can not handle, the client when sending the request, as far as possible through the Accept-Charset header tells the server what character systems we support.

 

2 Introduction to Character Encoding

2.1 character set term

Here we have a set of the series of 8-bit byte is converted to a series of rules of the character set is called a character set. Comprising the encoding scheme and character set encoding. In the introduction character set will be used when what some of the terms:

  • Character: characters are letters, numbers, punctuation, ideographic characters (such as Chinese), a symbol, text, or other form of writing "atom."
  • Font: Font is a form of character, the same character can have multiple font, such as a same Chinese characters, Arial and Times New Roman font to show there is a difference.
  • After the character encoding: the unique number assigned to a number of characters
  • Code space: plans for the character code value of the integer range
  • Bits per character codes used: Code width
  • Character library: character set specific job
  • Coded character set: character library consisting of the character set has been coded, and a code for each character in the code space assigned.
  • Character encoding schemes: the character code encoding digital binary code algorithm.

2.2 character encoding scheme

At present character encoding schemes mainly in the following three kinds:

a), fixed-width
  encoding mode represents a fixed width of each encoded character with a fixed number of bits. In this manner the processing speed, but may waste some space
B), variable width (modeless)
  the name implies, is based on a variable width bit character codes required, using a different bit widths are stored. This can save resources, but also to indicate different character according to need to use a plurality of bytes, such as noon requires two bytes
C), a variable width (with modal)
  have modality encoded using special "turn righteousness "mode switching between different modes (which I did not quite understand its meaning)

Common coding examples:

a), 8-bit fixed-width
  bit fixed width encoding the identity of each character code corresponding encoded 8-bit binary value. It can only support 256-character code range of characters. iso-8859 family of character sets used is eight identity code.
b), UTF-8
  this should be the more popular of a universal coding, and can handle many languages. A variable length coding is non-modal. The first byte indicates the number of high-byte character encoded employed, each subsequent byte code values required to contain 6 bits, which can be encoding ASCII compatibility. In the following table:

3 language tags and HTTP

Spoken language tag is named after a string of standardized phrases. At the same time, we need to standardize the string tag, whether the person everyone has their own naming conventions, the string format is different, the language will not be able to extract information from the tag.

3.1 Content-Language and Accept-Language header

Content-Language entity header field describes the target audience languages entity, which is not limited to text documents . Audio clips, movies and applications are likely to be language-specific audiences. Any media type-specific language audience can have Content-Language header. We also can specify multiple headers in the language, such as: Content-Language: en, fr .

Correspondingly, the client can also Accept-Language inform the type of server-side language that we can accept. From left to right in order of priority

3.2 Types of language tag

In RFC 3066, "Tags for the Identification of Languages" (labeled markup language) is recorded in a standardized syntax language tags. You can use language to refer to:

  • General language classification (for example, es for Spanish);
  • Country-specific languages ​​(such as en-GB representative of the British English);
  • Dialect language (such as no-bok refers to the written language of Norway);
  • Regional languages ​​(such as sgn-US-MA on behalf of the United States on the island of Martha's Vineyard Sign Language);
  • Non-standardized language variants (such as i-navajo);
  • Non-standard language (such as x-snowboarder-slang).

 Typically marked by one or more parts, the intermediate from "-" to separate, for example:
Language tag format

among them:

  • The first is called a master tag tag whose value is normalized
  • The second sub-tag is optional, it follows its own naming rules
  • Other sub-trailing mark is unregistered

The first and second sub-sub-mark mark is a special maintenance and naming of specific documentation and related organizations, would later introduce some. Other special safeguard those who do not generally need to be registered in IANA.

Note: All tags are case-insensitive. When we parse should pay attention to this point.

3.2.1 marks the first child

A first subtag is usually standardized language symbols, selected in ISO 639 language standard set. But it can also be identified by the letter i registered with IANA name, or represent proprietary name or extension with x. Its rules are as follows:

  • If only two characters, that is the language code from ISO 639 and 639-1 standards, such as ar, en, etc.
  • There are three characters, and that is the language code from ISO 639-210 standard and its extensions, such as ara, eng, etc.
  • The letter i, the language tag is registered in the IANA explicit
  • Letters x, the language flag is private, non-standard, or extension of the sub-marks.

3.2.2 The second sub-mark

The second sub-tag is generally standardized national symbol, selected regions of the country code and the ISO 3166 standards set. But can also be registered with IANA other strings, its rules are as follows:

  • Two characters that are defined in ISO 316611 country / region; such as: CN, FR
  • 3 to 8 characters, the value may be registered in the IANA;
  • A single character, which is illegal situations.

4 State of the URI

4.1, URI set of characters

  1. URI allows the use of a subset of the ASCII character set
  2. URI escape character is divided into, three kinds, reserved and unreserved, shown in the following table:

 

URI escaping provides a safe way, you can insert reserved characters and original characters are not supported (such as various blank) within the URI.

Each escape sequence of characters is a group of three , two character hexadecimal digits by the keep (%) percent sign behind which two hexadecimal digits to represent the code of a US-ASCII characters

 

When internal processing, HTTP applications should remain unchanged in transmission and escaped URI when forwarding. HTTP URI fishes application should only be escaped when needed data.

The application should ensure that any URI will not be unescapes twice, because at the time escaped percent-encoding will likely go after unescapes out a sub righteous will result in data loss.

Note: the value itself should be escaped within the scope of US-ASCII code values ​​(0 to 127)

 

Guess you like

Origin www.cnblogs.com/liuzhiyun/p/11517757.html