Analysis of bytes and str types in Python3 (transfer)

Original source: https://www.cnblogs.com/chownjy/p/6625299.html#undefined

One of the most important new features of Python 3 is a clear distinction between strings and binary data streams. Text is always represented Unicodeby strtype, and binary data bytesis represented by type. strPython 3 doesn't mix and match in arbitrary implicit ways bytes, you can't concatenate strings and byte streams, you can't search for strings in byte streams (and vice versa), and you can't pass strings as arguments as bytes Functions of streams (and vice versa).

Let's take a closer look at the differences and connections between the two.

History of Coding Development

Before we talk about bytespeace str, we need to talk about how coding has evolved. .

In the early days of computer history, English-speaking countries represented by the United States dominated the entire computer industry, and 26 English letters formed various English words, sentences, and articles. Therefore, the earliest character encoding specification is ASCII code, an encoding specification of 8 bits or 1 byte , which can cover the encoding needs of the entire English department.

What is the encoding? Encoding is the representation of a character in a binary. We all know that all things, whether in English, Chinese or symbols, etc., are ultimately stored on disk as 01010101. Inside a computer, reading and storing data boils down to a bit stream of 0s and 1s. The question is, humans can't understand these bit streams, how to make these 010101 readable to humans? So there is the character encoding, which is a translator, somewhere inside the computer, it transparently helps us translate the bit stream into words that humans can directly understand. For ordinary users, there is no need to know what the principle of this process is and how it is performed. But for programmers, it is a problem that must be clarified.

Taking ASCIIencoding as an example, it stipulates that 8 bits of 1 byte represent the encoding of 1 character, that is, "00000000" is so wide, and it can be interpreted one byte at a time. For example: 01000001 represents the capital letter A, and sometimes we "lazy" use the decimal 65 to represent ASCIIthe encoding of A in . 8 bits, which can represent up to 2 to the 8th power (255) characters without repetition.

Later, when computers became popularized, the characters of Chinese, Japanese, Korean and other countries needed to be represented in the computer. The 255 bits of ASCII were far from enough, so the standard organization developed a universal code called UNICODE , which stipulates any character (regardless of which country) is represented by at least 2 bytes, and can be more . Among them, English letters use 2 bytes, while Chinese characters are 3 bytes. Although this code is good and meets everyone's requirements, it is not compatible ASCIIand takes up a lot of space and memory. Because, in the computer world, more characters are English letters, which can be represented by 1 byte, but must be 2.

So UTF-8the encoding came into being, which stipulates that the English alphabet series is represented by 1 byte, the Chinese character is represented by 3 bytes, and so on. Therefore, it is compatible ASCIIand can decode earlier documents. UTF-8It was soon widely used.

In the course of the development of coding, China has also created its own coding methods, for example GBK, GB2312, BIG5. They are limited to domestic use and are not recognized abroad. In the GBKencoding, Chinese characters occupy 2 bytes.

Similarities and differences between bytes and str

Back to bytesHe str. bytesIt is a bit stream, and its existence form is 01010001110. Whether we are writing code or reading an article, no one will read this bitstream directly. It must have an encoding method that makes it a meaningful bitstream instead of a bunch of obscure bits. 01 combination. Because of the different encoding methods, the interpretation of this bit stream will also be different, causing great trouble for actual use. Let's see how Python handles this series of coding problems:

>>> s = "中文"
>>> s
'中文'
>>> type(s)
<class 'str'>
>>> b = bytes(s, encoding='utf-8')
>>> b
b'\xe4\xb8\xad\xe6\x96\x87'
>>> type(b)
<class 'bytes'>

 

As can be seen from the example, it sis a string type. Python has a built-in function that bytes()can convert a string strtype into a bytestype, bwhich is actually a combination of a string of 01, but in order to allow us to observe it relatively intuitively in the IDE environment, it is represented in b'\xe4\xb8\xad\xe6\x96\x87'this form, the beginning bmeans this is a bytestype. \xe4It is a hexadecimal representation, which occupies a length of 1 byte, so after "Chinese" is encoded utf-8, we can count 6 bytes in total, and each Chinese character occupies 3, which confirms that the above discussion. When using built-in functions bytes(), the parameters must be clear encodingand cannot be omitted.

We all know that strthere is a encode()method in the string class, which is the encoding process from a string to a bit stream . And bytestypes happen to have a decode()method, which is the process of decoding from a bitstream to a string . Other than that, when we look at the Python source code, we will find that we byteshave stralmost the same list of methods as we have, the biggest difference being encodeand decode.

In essence, the storage form of the string on the disk is also a combination of 01, which also needs to be encoded and decoded.

If the above explanation still can't make you understand the difference between the two, then remember the following two sentences:

  1. In the process of saving and reading strings to and from disk, Python automatically does the encoding and decoding for you, and you don't need to care about the process.

  2. Using bytestypes essentially tells Python that you don't need it to do the encoding and decoding automatically for you, but the user does it manually and specifies the encoding format.

  3. Python has made a strict distinction between bytesand strtwo data types, you can't bytesuse a parameter when a type parameter is required str, and vice versa. This is easy to encounter when reading and writing disk files.

In the process of mutual conversion between bytes and str, it is actually the process of encoding and decoding, and the encoding format must be specified explicitly.

>>> b
b'\xe4\xb8\xad\xe6\x96\x87'
>>> type(b)
<class 'bytes'>
>>> s1 = str(b)
>>> s1
"b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'"
>>> type(s1)
<class 'str'>
>>> s1 = str(b, encoding='utf-8')
>>> s1
'中文'
>>> type(s1)
<class 'str'>

 

Let's convert the string s1 to the bytes type encoded by gbk:

>>> s1
'中文'
>>> type(s1)
<class 'str'>
>>> b =  bytes(s1, encoding='gbk')
>>> b
b'\xd6\xd0\xce\xc4'

 

The harder, the more fortunate.

One of the most important new features of Python 3 is a clear distinction between strings and binary data streams. Text is always represented Unicodeby strtype, and binary data bytesis represented by type. strPython 3 doesn't mix and match in arbitrary implicit ways bytes, you can't concatenate strings and byte streams, you can't search for strings in byte streams (and vice versa), and you can't pass strings as arguments as bytes Functions of streams (and vice versa).

Let's take a closer look at the differences and connections between the two.

History of Coding Development

Before we talk about bytespeace str, we need to talk about how coding has evolved. .

In the early days of computer history, English-speaking countries represented by the United States dominated the entire computer industry, and 26 English letters formed various English words, sentences, and articles. Therefore, the earliest character encoding specification is ASCII code, an encoding specification of 8 bits or 1 byte , which can cover the encoding needs of the entire English department.

What is the encoding? Encoding is the representation of a character in a binary. We all know that all things, whether in English, Chinese or symbols, etc., are ultimately stored on disk as 01010101. Inside a computer, reading and storing data boils down to a bit stream of 0s and 1s. The question is, humans can't understand these bit streams, how to make these 010101 readable to humans? So there is the character encoding, which is a translator, somewhere inside the computer, it transparently helps us translate the bit stream into words that humans can directly understand. For ordinary users, there is no need to know what the principle of this process is and how it is performed. But for programmers, it is a problem that must be clarified.

Taking ASCIIencoding as an example, it stipulates that 8 bits of 1 byte represent the encoding of 1 character, that is, "00000000" is so wide, and it can be interpreted one byte at a time. For example: 01000001 represents the capital letter A, and sometimes we "lazy" use the decimal 65 to represent ASCIIthe encoding of A in . 8 bits, which can represent up to 2 to the 8th power (255) characters without repetition.

Later, when computers became popularized, the characters of Chinese, Japanese, Korean and other countries needed to be represented in the computer. The 255 bits of ASCII were far from enough, so the standard organization developed a universal code called UNICODE , which stipulates any character (regardless of which country) is represented by at least 2 bytes, and can be more . Among them, English letters use 2 bytes, while Chinese characters are 3 bytes. Although this code is good and meets everyone's requirements, it is not compatible ASCIIand takes up a lot of space and memory. Because, in the computer world, more characters are English letters, which can be represented by 1 byte, but must be 2.

So UTF-8the encoding came into being, which stipulates that the English alphabet series is represented by 1 byte, the Chinese character is represented by 3 bytes, and so on. Therefore, it is compatible ASCIIand can decode earlier documents. UTF-8It was soon widely used.

In the course of the development of coding, China has also created its own coding methods, for example GBK, GB2312, BIG5. They are limited to domestic use and are not recognized abroad. In the GBKencoding, Chinese characters occupy 2 bytes.

Similarities and differences between bytes and str

Back to bytesHe str. bytesIt is a bit stream, and its existence form is 01010001110. Whether we are writing code or reading an article, no one will read this bitstream directly. It must have an encoding method that makes it a meaningful bitstream instead of a bunch of obscure bits. 01 combination. Because of the different encoding methods, the interpretation of this bit stream will also be different, causing great trouble for actual use. Let's see how Python handles this series of coding problems:

>>> s = "中文"
>>> s
'中文'
>>> type(s)
<class 'str'>
>>> b = bytes(s, encoding='utf-8')
>>> b
b'\xe4\xb8\xad\xe6\x96\x87'
>>> type(b)
<class 'bytes'>

 

As can be seen from the example, it sis a string type. Python has a built-in function that bytes()can convert a string strtype into a bytestype, bwhich is actually a combination of a string of 01, but in order to allow us to observe it relatively intuitively in the IDE environment, it is represented in b'\xe4\xb8\xad\xe6\x96\x87'this form, the beginning bmeans this is a bytestype. \xe4It is a hexadecimal representation, which occupies a length of 1 byte, so after "Chinese" is encoded utf-8, we can count 6 bytes in total, and each Chinese character occupies 3, which confirms that the above discussion. When using built-in functions bytes(), the parameters must be clear encodingand cannot be omitted.

We all know that strthere is a encode()method in the string class, which is the encoding process from a string to a bit stream . And bytestypes happen to have a decode()method, which is the process of decoding from a bitstream to a string . Other than that, when we look at the Python source code, we will find that we byteshave stralmost the same list of methods as we have, the biggest difference being encodeand decode.

In essence, the storage form of the string on the disk is also a combination of 01, which also needs to be encoded and decoded.

If the above explanation still can't make you understand the difference between the two, then remember the following two sentences:

  1. In the process of saving and reading strings to and from disk, Python automatically does the encoding and decoding for you, and you don't need to care about the process.

  2. Using bytestypes essentially tells Python that you don't need it to do the encoding and decoding automatically for you, but the user does it manually and specifies the encoding format.

  3. Python has made a strict distinction between bytesand strtwo data types, you can't bytesuse a parameter when a type parameter is required str, and vice versa. This is easy to encounter when reading and writing disk files.

In the process of mutual conversion between bytes and str, it is actually the process of encoding and decoding, and the encoding format must be specified explicitly.

>>> b
b'\xe4\xb8\xad\xe6\x96\x87'
>>> type(b)
<class 'bytes'>
>>> s1 = str(b)
>>> s1
"b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'"
>>> type(s1)
<class 'str'>
>>> s1 = str(b, encoding='utf-8')
>>> s1
'中文'
>>> type(s1)
<class 'str'>

 

Let's convert the string s1 to the bytes type encoded by gbk:

>>> s1
'中文'
>>> type(s1)
<class 'str'>
>>> b =  bytes(s1, encoding='gbk')
>>> b
b'\xd6\xd0\xce\xc4'

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324858938&siteId=291194637