HTTP Lecture 13 - HTTP Entity Data

Data Types and Encodings

In the TCP/IP protocol stack, the transmitted data is basically in the format of "header+body". However, because TCP and UDP are transport layer protocols, they don't care what the body data is, as long as the data is sent to the other party, the task is completed.
The HTTP protocol is different. It is an application layer protocol. After the data arrives, the work can only be said to be half completed. It is necessary to tell the upper layer application what data it is, otherwise the upper layer application will be "at a loss".
You can imagine, if HTTP does not have the function of notifying the data type, the server sends a "big lump" of data to the browser, and the browser sees a "black box", what should we do at this time?
Of course, it can "guess". Because a lot of data has a fixed format, you may be able to know that it is a GIF picture or an MP3 music file by checking the first few bytes of the data, but this method is undoubtedly very inefficient, and there is a high probability that it will fail. The file type cannot be checked.
Fortunately, there was a solution to this problem long before the birth of the HTTP protocol, but it was used in the email system, allowing email to send arbitrary data other than ASCII codes. The name of the solution is " Multipurpose Internet Mail Extensions (Multipurpose Internet Mail Extensions), referred to as MIME.
MIME is a large standard specification, but HTTP only takes a part of it "hands-on" to mark the data type of the body. This is the "MIME type" we usually hear.
MIME divides data into eight categories, and each category is further subdivided into multiple subcategories in the form of "type/subtype" strings. Coincidentally, it also meets the characteristics of HTTP plaintext, so it can be easily included in the HTTP header field.
Just to list a few categories that are often encountered in HTTP:
1.text: Readable data in text format. We are most familiar with text/html, which means hypertext document. In addition, there are plain text text/plain, style sheet text/css, etc.
2. image: image files, including image/gif, image/jpeg, image/png, etc.
3.audio/video: Audio and video data, such as audio/mpeg, video/mp4, etc.
4. application: The data format is not fixed, it may be text or binary, and must be interpreted by the upper application. Common ones are application/json, application/javascript, application/pdf, etc. In addition, if you really don’t know what type of data is, like the “black box” just mentioned, it will be application/octet-stream, that is, opaque binary data.
But only the MIME type is not enough, because HTTP sometimes compresses data in order to save bandwidth during transmission. In order not to let the browser continue to "guess", there needs to be an "Encoding type" to tell what encoding the data uses format, so that the other party can correctly decompress and restore the original data.
Compared with MIME type, Encoding type is much less, and only the following three are commonly used:

  1. gzip: GNU zip compression format, also the most popular compression format on the Internet;
  2. deflate: zlib (deflate) compression format, second only to gzip in popularity;
  3. br: A new compression algorithm optimized specifically for HTTP (Brotli).

header fields used by the data type

With the MIME type and Encoding type, both the browser and the server can easily identify the type of the body and process the data correctly.
For this purpose, the HTTP protocol defines two Accept request header fields and two Content entity header fields, which are used for "content negotiation" between the client and the server. That is to say, the client uses the Accept header to tell the server what kind of data it wants to receive, and the server uses the Content header to tell the client what kind of data it actually sent.
insert image description here
The Accept field marks the MIME type that the client can understand, and you can use "," as a separator to list multiple types, so that the server has more options, such as the following header:

Accept: text/html,application/xml,image/webp,image/png

This is to tell the server: "I can understand HTML, XML text, and webp and png images, please give me the data in these four formats".
Correspondingly, the server will use the header field Content-Type in the response message to tell the real type of the entity data:

Content-Type: text/html
Content-Type: image/png

In this way, when the browser sees that the type in the message is "text/html", it will know that it is an HTML file, and it will call the typesetting engine to render the page. When it sees "image/png", it will know that it is a PNG file, and it will display it on the page. out the image.
The Accept-Encoding field marks the compression format supported by the client, such as gzip, deflate, etc. mentioned above. You can also use "," to list multiple formats. The server can choose one of them to compress the data. The actual compression format used Put it in the Content-Encoding of the response header field.

Accept-Encoding: gzip, deflate, br
Content-Encoding: gzip

However, these two fields can be omitted. If there is no Accept-Encoding field in the request message, it means that the client does not support compressed data; if there is no Content-Encoding field in the response message, it means that the response data is not compressed.

Language type and encoding

MIME type and Encoding type solve the problem of computers understanding body data, but the Internet is spread all over the world, and people in different countries and regions use many different languages. Although they are all text/html, how to make the browser display that everyone can What about understanding readable language?
This is actually a question of "internationalization". HTTP adopts a solution similar to data types, and introduces two more concepts: language type and character set.
The so-called "language type" is the natural language used by humans, such as English, Chinese, Japanese, etc., and these natural languages ​​may have subordinate regional dialects, so the "type-subtype" should also be used when a clear distinction is required form, but the format here is different from the data type, the delimiter is not "/", but "-".
To give a few examples: en means any English, en-US means American English, en-GB means British English, and zh-CN means our most commonly used Chinese.
There is a more troublesome thing about computer processing of natural language called "character set".
In the early days of computer development, people in various countries and regions "acted independently" and invented many character encoding methods to process text, such as ASCII used in the English-speaking world, GBK and BIG5 used in the Chinese world, and Shift_JIS used in the Japanese world. The same piece of text, displayed normally in one encoding, may become messed up in another encoding.
So Unicode and UTF-8 appeared later, which accommodated all languages ​​in the world in one encoding scheme, and the Unicode character set following the UTF-8 character encoding method also became the standard character set on the Internet.

Header fields used by the language type

The HTTP protocol also uses the Accept request header field and the Content entity header field for client and server "content negotiation" on language and encoding.
The Accept-Language field marks the natural language that the client can understand, and also allows multiple types to be listed with "," as a separator, for example:

Accept-Language: zh-CN, zh, en

This request header will tell the server: "It is best to give me the Chinese characters of zh-CN, if not, use other Chinese dialects, and if not, give English."
Correspondingly, the server should use the header field Content-Language in the response message to tell the client the actual language type used by the entity data:

Content-Language: zh-CN

The request header field used by the character set in HTTP is Accept-Charset, but there is no corresponding Content-Charset in the response header. Instead, it is represented by "charset=xxx" after the data type of the Content-Type field. This requires pay attention.
For example, the browser requests the character set of GBK or UTF-8, and then the server returns UTF-8 encoding, which is as follows:

Accept-Charset: gbk, utf-8
Content-Type: text/html; charset=utf-8

However, current browsers support multiple character sets, and usually do not send Accept-Charset, and the server does not send Content-Language, because the language used can be inferred from the character set, so in the request header generally only There is an Accept-Language field, and there will only be a Content-Type field in the response header.
insert image description here

Quality value for content negotiation

When using Accept, Accept-Encoding, Accept-Language and other request header fields in the HTTP protocol for content negotiation, you can also use a special "q" parameter to indicate the weight to set the priority, where "q" is " quality factor" means.
The maximum value of the weight is 1, the minimum value is 0.01, the default value is 1, and a value of 0 means rejection. The specific form is to add a ";" after the data type or language code, and then "q=value".
What I want to remind here is the usage of ";". In most programming languages, ";" has a stronger sentence tone than ",", but in HTTP content negotiation, it is just the opposite. The meaning of ";" is less than " ,"of.
Example:

Accept: text/html,application/xml;q=0.9,*/*;q=0.8

It indicates that the browser expects HTML files most, with a weight of 1, followed by XML files, with a weight of 0.9, and finally any data type, with a weight of 0.8. After the server receives the request header, it will calculate the weight, and then output HTML or XML first according to its actual situation.

The result of content negotiation

The process of content negotiation is opaque, and the algorithm used by each Web server is different. But sometimes, the server will add a Vary field in the response header to record the request header field that the server refers to during content negotiation, and give some information, for example:

Vary: Accept-Encoding,User-Agent,Accept

The Vary field indicates that the server determines the response message sent back based on the three header fields Accept-Encoding, User-Agent and Accept.
The Vary field can be considered as a special "version mark" of the response message. Whenever the request header such as Accept changes, Vary will also change with the response message. That is to say, the same URI may have multiple different "versions", which are mainly used by proxy servers in the middle of the transmission link to implement caching services.

summary

insert image description here
1. The data type indicates what the content of the entity data is, using MIME type, and the relevant header fields are Accept and Content-Type; 2. The data
encoding indicates the compression method of the entity data, and the relevant header fields are Accept-Encoding and Content -Encoding;
3. The language type indicates the natural language of the entity data, and the relevant header fields are Accept-Language and Content-Language;
4. The character set indicates the encoding method of the entity data, and the relevant header fields are Accept-Charset and Content-Type ;
5. The client needs to use header fields such as Accept in the request header to perform "content negotiation" with the server, asking the server to return the most appropriate data; 6. The header
fields such as Accept can list multiple possible options in order of "," , you can also use the ";q=" parameter to specify the weight precisely.

PS: This article is a note after watching Geek.

Guess you like

Origin blog.csdn.net/Elon15/article/details/130708880