Original source: https://www.cnblogs.com/chownjy/p/6625299.html#undefined
One of the most important new features of Python 3 is a clear distinction between strings and binary data streams. Text is always represented Unicode
by str
type, and binary data bytes
is represented by type. str
Python 3 doesn't mix and match in arbitrary implicit ways bytes
, you can't concatenate strings and byte streams, you can't search for strings in byte streams (and vice versa), and you can't pass strings as arguments as bytes Functions of streams (and vice versa).
Let's take a closer look at the differences and connections between the two.
History of Coding Development
Before we talk about bytes
peace str
, we need to talk about how coding has evolved. .
In the early days of computer history, English-speaking countries represented by the United States dominated the entire computer industry, and 26 English letters formed various English words, sentences, and articles. Therefore, the earliest character encoding specification is ASCII code, an encoding specification of 8 bits or 1 byte , which can cover the encoding needs of the entire English department.
What is the encoding? Encoding is the representation of a character in a binary. We all know that all things, whether in English, Chinese or symbols, etc., are ultimately stored on disk as 01010101. Inside a computer, reading and storing data boils down to a bit stream of 0s and 1s. The question is, humans can't understand these bit streams, how to make these 010101 readable to humans? So there is the character encoding, which is a translator, somewhere inside the computer, it transparently helps us translate the bit stream into words that humans can directly understand. For ordinary users, there is no need to know what the principle of this process is and how it is performed. But for programmers, it is a problem that must be clarified.
Taking ASCII
encoding as an example, it stipulates that 8 bits of 1 byte represent the encoding of 1 character, that is, "00000000" is so wide, and it can be interpreted one byte at a time. For example: 01000001 represents the capital letter A, and sometimes we "lazy" use the decimal 65 to represent ASCII
the encoding of A in . 8 bits, which can represent up to 2 to the 8th power (255) characters without repetition.
Later, when computers became popularized, the characters of Chinese, Japanese, Korean and other countries needed to be represented in the computer. The 255 bits of ASCII were far from enough, so the standard organization developed a universal code called UNICODE , which stipulates any character (regardless of which country) is represented by at least 2 bytes, and can be more . Among them, English letters use 2 bytes, while Chinese characters are 3 bytes. Although this code is good and meets everyone's requirements, it is not compatible ASCII
and takes up a lot of space and memory. Because, in the computer world, more characters are English letters, which can be represented by 1 byte, but must be 2.
So UTF-8
the encoding came into being, which stipulates that the English alphabet series is represented by 1 byte, the Chinese character is represented by 3 bytes, and so on. Therefore, it is compatible ASCII
and can decode earlier documents. UTF-8
It was soon widely used.
In the course of the development of coding, China has also created its own coding methods, for example GBK
, GB2312
, BIG5
. They are limited to domestic use and are not recognized abroad. In the GBK
encoding, Chinese characters occupy 2 bytes.
Similarities and differences between bytes and str
Back to bytes
He str
. bytes
It is a bit stream, and its existence form is 01010001110. Whether we are writing code or reading an article, no one will read this bitstream directly. It must have an encoding method that makes it a meaningful bitstream instead of a bunch of obscure bits. 01 combination. Because of the different encoding methods, the interpretation of this bit stream will also be different, causing great trouble for actual use. Let's see how Python handles this series of coding problems:
>>> s = "中文" >>> s '中文' >>> type(s) <class 'str'> >>> b = bytes(s, encoding='utf-8') >>> b b'\xe4\xb8\xad\xe6\x96\x87' >>> type(b) <class 'bytes'>
As can be seen from the example, it s
is a string type. Python has a built-in function that bytes()
can convert a string str
type into a bytes
type, b
which is actually a combination of a string of 01, but in order to allow us to observe it relatively intuitively in the IDE environment, it is represented in b'\xe4\xb8\xad\xe6\x96\x87'
this form, the beginning b
means this is a bytes
type. \xe4
It is a hexadecimal representation, which occupies a length of 1 byte, so after "Chinese" is encoded utf-8
, we can count 6 bytes in total, and each Chinese character occupies 3, which confirms that the above discussion. When using built-in functions bytes()
, the parameters must be clear encoding
and cannot be omitted.
We all know that str
there is a encode()
method in the string class, which is the encoding process from a string to a bit stream . And bytes
types happen to have a decode()
method, which is the process of decoding from a bitstream to a string . Other than that, when we look at the Python source code, we will find that we bytes
have str
almost the same list of methods as we have, the biggest difference being encode
and decode
.
In essence, the storage form of the string on the disk is also a combination of 01, which also needs to be encoded and decoded.
If the above explanation still can't make you understand the difference between the two, then remember the following two sentences:
-
In the process of saving and reading strings to and from disk, Python automatically does the encoding and decoding for you, and you don't need to care about the process.
-
Using
bytes
types essentially tells Python that you don't need it to do the encoding and decoding automatically for you, but the user does it manually and specifies the encoding format. -
Python has made a strict distinction between
bytes
andstr
two data types, you can'tbytes
use a parameter when a type parameter is requiredstr
, and vice versa. This is easy to encounter when reading and writing disk files.
In the process of mutual conversion between bytes and str, it is actually the process of encoding and decoding, and the encoding format must be specified explicitly.
>>> b b'\xe4\xb8\xad\xe6\x96\x87' >>> type(b) <class 'bytes'> >>> s1 = str(b) >>> s1 "b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'" >>> type(s1) <class 'str'> >>> s1 = str(b, encoding='utf-8') >>> s1 '中文' >>> type(s1) <class 'str'>
Let's convert the string s1 to the bytes type encoded by gbk:
>>> s1 '中文' >>> type(s1) <class 'str'> >>> b = bytes(s1, encoding='gbk') >>> b b'\xd6\xd0\xce\xc4'
One of the most important new features of Python 3 is a clear distinction between strings and binary data streams. Text is always represented Unicode
by str
type, and binary data bytes
is represented by type. str
Python 3 doesn't mix and match in arbitrary implicit ways bytes
, you can't concatenate strings and byte streams, you can't search for strings in byte streams (and vice versa), and you can't pass strings as arguments as bytes Functions of streams (and vice versa).
Let's take a closer look at the differences and connections between the two.
History of Coding Development
Before we talk about bytes
peace str
, we need to talk about how coding has evolved. .
In the early days of computer history, English-speaking countries represented by the United States dominated the entire computer industry, and 26 English letters formed various English words, sentences, and articles. Therefore, the earliest character encoding specification is ASCII code, an encoding specification of 8 bits or 1 byte , which can cover the encoding needs of the entire English department.
What is the encoding? Encoding is the representation of a character in a binary. We all know that all things, whether in English, Chinese or symbols, etc., are ultimately stored on disk as 01010101. Inside a computer, reading and storing data boils down to a bit stream of 0s and 1s. The question is, humans can't understand these bit streams, how to make these 010101 readable to humans? So there is the character encoding, which is a translator, somewhere inside the computer, it transparently helps us translate the bit stream into words that humans can directly understand. For ordinary users, there is no need to know what the principle of this process is and how it is performed. But for programmers, it is a problem that must be clarified.
Taking ASCII
encoding as an example, it stipulates that 8 bits of 1 byte represent the encoding of 1 character, that is, "00000000" is so wide, and it can be interpreted one byte at a time. For example: 01000001 represents the capital letter A, and sometimes we "lazy" use the decimal 65 to represent ASCII
the encoding of A in . 8 bits, which can represent up to 2 to the 8th power (255) characters without repetition.
Later, when computers became popularized, the characters of Chinese, Japanese, Korean and other countries needed to be represented in the computer. The 255 bits of ASCII were far from enough, so the standard organization developed a universal code called UNICODE , which stipulates any character (regardless of which country) is represented by at least 2 bytes, and can be more . Among them, English letters use 2 bytes, while Chinese characters are 3 bytes. Although this code is good and meets everyone's requirements, it is not compatible ASCII
and takes up a lot of space and memory. Because, in the computer world, more characters are English letters, which can be represented by 1 byte, but must be 2.
So UTF-8
the encoding came into being, which stipulates that the English alphabet series is represented by 1 byte, the Chinese character is represented by 3 bytes, and so on. Therefore, it is compatible ASCII
and can decode earlier documents. UTF-8
It was soon widely used.
In the course of the development of coding, China has also created its own coding methods, for example GBK
, GB2312
, BIG5
. They are limited to domestic use and are not recognized abroad. In the GBK
encoding, Chinese characters occupy 2 bytes.
Similarities and differences between bytes and str
Back to bytes
He str
. bytes
It is a bit stream, and its existence form is 01010001110. Whether we are writing code or reading an article, no one will read this bitstream directly. It must have an encoding method that makes it a meaningful bitstream instead of a bunch of obscure bits. 01 combination. Because of the different encoding methods, the interpretation of this bit stream will also be different, causing great trouble for actual use. Let's see how Python handles this series of coding problems:
>>> s = "中文" >>> s '中文' >>> type(s) <class 'str'> >>> b = bytes(s, encoding='utf-8') >>> b b'\xe4\xb8\xad\xe6\x96\x87' >>> type(b) <class 'bytes'>
As can be seen from the example, it s
is a string type. Python has a built-in function that bytes()
can convert a string str
type into a bytes
type, b
which is actually a combination of a string of 01, but in order to allow us to observe it relatively intuitively in the IDE environment, it is represented in b'\xe4\xb8\xad\xe6\x96\x87'
this form, the beginning b
means this is a bytes
type. \xe4
It is a hexadecimal representation, which occupies a length of 1 byte, so after "Chinese" is encoded utf-8
, we can count 6 bytes in total, and each Chinese character occupies 3, which confirms that the above discussion. When using built-in functions bytes()
, the parameters must be clear encoding
and cannot be omitted.
We all know that str
there is a encode()
method in the string class, which is the encoding process from a string to a bit stream . And bytes
types happen to have a decode()
method, which is the process of decoding from a bitstream to a string . Other than that, when we look at the Python source code, we will find that we bytes
have str
almost the same list of methods as we have, the biggest difference being encode
and decode
.
In essence, the storage form of the string on the disk is also a combination of 01, which also needs to be encoded and decoded.
If the above explanation still can't make you understand the difference between the two, then remember the following two sentences:
-
In the process of saving and reading strings to and from disk, Python automatically does the encoding and decoding for you, and you don't need to care about the process.
-
Using
bytes
types essentially tells Python that you don't need it to do the encoding and decoding automatically for you, but the user does it manually and specifies the encoding format. -
Python has made a strict distinction between
bytes
andstr
two data types, you can'tbytes
use a parameter when a type parameter is requiredstr
, and vice versa. This is easy to encounter when reading and writing disk files.
In the process of mutual conversion between bytes and str, it is actually the process of encoding and decoding, and the encoding format must be specified explicitly.
>>> b b'\xe4\xb8\xad\xe6\x96\x87' >>> type(b) <class 'bytes'> >>> s1 = str(b) >>> s1 "b'\\xe4\\xb8\\xad\\xe6\\x96\\x87'" >>> type(s1) <class 'str'> >>> s1 = str(b, encoding='utf-8') >>> s1 '中文' >>> type(s1) <class 'str'>
Let's convert the string s1 to the bytes type encoded by gbk:
>>> s1 '中文' >>> type(s1) <class 'str'> >>> b = bytes(s1, encoding='gbk') >>> b b'\xd6\xd0\xce\xc4'