在Python 3中将字符串转换为字节的最佳方法?

本文翻译自:Best way to convert string to bytes in Python 3?

There appears to be two different ways to convert a string to bytes, as seen in the answers to TypeError: 'str' does not support the buffer interface TypeError的答案中可以看出,有两种不同的方法可以将字符串转换为字节:'str'不支持缓冲区接口

Which of these methods would be better or more Pythonic? 以下哪种方法更好或更Pythonic? Or is it just a matter of personal preference? 还是仅仅是个人喜好问题?

b = bytes(mystring, 'utf-8')

b = mystring.encode('utf-8')

#1楼

参考:https://stackoom.com/question/VpJj/在Python-中将字符串转换为字节的最佳方法


#2楼

It's easier than it is thought: 比想像的要容易:

my_str = "hello world"
my_str_as_bytes = str.encode(my_str)
type(my_str_as_bytes) # ensure it is byte representation
my_decoded_str = my_str_as_bytes.decode()
type(my_decoded_str) # ensure it is string representation

#3楼

so_string = 'stackoverflow'
so_bytes = so_string.encode( )

#4楼

The absolutely best way is neither of the 2, but the 3rd. 绝对最佳的方法不是2,而是3。 The first parameter to encode defaults to 'utf-8' ever since Python 3.0. 自Python 3.0以来,第一个用于encode参数默认为 'utf-8' Thus the best way is 因此,最好的方法是

b = mystring.encode()

This will also be faster, because the default argument results not in the string "utf-8" in the C code, but NULL , which is much faster to check! 这也将更快,因为默认参数的结果不是C代码中的字符串"utf-8" ,而是NULL ,它的检查快得多!

Here be some timings: 以下是一些时间安排:

In [1]: %timeit -r 10 'abc'.encode('utf-8')
The slowest run took 38.07 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 183 ns per loop

In [2]: %timeit -r 10 'abc'.encode()
The slowest run took 27.34 times longer than the fastest. 
This could mean that an intermediate result is being cached.
10000000 loops, best of 10: 137 ns per loop

Despite the warning the times were very stable after repeated runs - the deviation was just ~2 per cent. 尽管发出警告,但重复运行后时间仍然非常稳定-偏差仅为〜2%。


Using encode() without an argument is not Python 2 compatible, as in Python 2 the default character encoding is ASCII . 使用不带参数的encode()与Python 2不兼容,因为在Python 2中,默认字符编码为ASCII

>>> 'äöä'.encode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

#5楼

You can simply convert string to bytes using: 您可以使用以下命令将字符串简单地转换为字节:

a_string.encode()

and you can simply convert bytes to string using: 您可以使用以下命令简单地将字节转换为字符串:

some_bytes.decode()

bytes.decode and str.encode have encoding='utf-8' as default value. bytes.decodestr.encode默认值是encoding='utf-8'

The following functions (taken from Effective Python ) might be useful to convert str to bytes and bytes to str : 以下功能(摘自有效的Python )可能是有用的转换strbytesbytesstr

def to_bytes(bytes_or_str):
    if isinstance(bytes_or_str, str):
        value = bytes_or_str.encode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of bytes


def to_str(bytes_or_str):
    if isinstance(bytes_or_str, bytes):
        value = bytes_or_str.decode() # uses 'utf-8' for encoding
    else:
        value = bytes_or_str
    return value # Instance of str

#6楼

If you look at the docs for bytes , it points you to bytearray : 如果查看bytes文档,它将指向bytearray

bytearray([source[, encoding[, errors]]]) bytearray([源[,编码[,错误]]])

Return a new array of bytes. 返回一个新的字节数组。 The bytearray type is a mutable sequence of integers in the range 0 <= x < 256. It has most of the usual methods of mutable sequences, described in Mutable Sequence Types, as well as most methods that the bytes type has, see Bytes and Byte Array Methods. 字节数组类型是一个可变的整数序列,范围为0 <= x <256。它具有可变序列类型中介绍的大多数可变序列的常用方法,以及字节类型具有的大多数方法,请参见字节和。字节数组方法。

The optional source parameter can be used to initialize the array in a few different ways: 可选的source参数可以通过几种不同的方式用于初始化数组:

If it is a string, you must also give the encoding (and optionally, errors) parameters; 如果是字符串,则还必须提供编码(以及可选的错误)参数; bytearray() then converts the string to bytes using str.encode(). 然后,bytearray()使用str.encode()将字符串转换为字节。

If it is an integer, the array will have that size and will be initialized with null bytes. 如果它是整数,则数组将具有该大小,并将使用空字节初始化。

If it is an object conforming to the buffer interface, a read-only buffer of the object will be used to initialize the bytes array. 如果它是符合缓冲区接口的对象,则该对象的只读缓冲区将用于初始化bytes数组。

If it is an iterable, it must be an iterable of integers in the range 0 <= x < 256, which are used as the initial contents of the array. 如果是可迭代的,则它必须是0 <= x <256范围内的整数的可迭代对象,这些整数用作数组的初始内容。

Without an argument, an array of size 0 is created. 没有参数,将创建大小为0的数组。

So bytes can do much more than just encode a string. 因此, bytes可以对字符串进行编码,还可以做更多的事情。 It's Pythonic that it would allow you to call the constructor with any type of source parameter that makes sense. 这是Pythonic的用法,它允许您使用有意义的任何类型的源参数来调用构造函数。

For encoding a string, I think that some_string.encode(encoding) is more Pythonic than using the constructor, because it is the most self documenting -- "take this string and encode it with this encoding" is clearer than bytes(some_string, encoding) -- there is no explicit verb when you use the constructor. 对于编码字符串,我认为some_string.encode(encoding)比使用构造函数更具Python some_string.encode(encoding) ,因为它是最易于说明的文档-“使用此字符串并使用此编码对其进行编码”比bytes(some_string, encoding)更清晰bytes(some_string, encoding) -使用构造函数时没有显式动词。

Edit: I checked the Python source. 编辑:我检查了Python源。 If you pass a unicode string to bytes using CPython, it calls PyUnicode_AsEncodedString , which is the implementation of encode ; 如果使用CPython将unicode字符串传递给bytes ,则它将调用PyUnicode_AsEncodedString ,它是encode的实现; so you're just skipping a level of indirection if you call encode yourself. 因此,如果您自己encode那么您将跳过间接级别。

Also, see Serdalis' comment -- unicode_string.encode(encoding) is also more Pythonic because its inverse is byte_string.decode(encoding) and symmetry is nice. 另外,请参见Serdalis的评论unicode_string.encode(encoding)也是Pythonic的,因为它的倒数是byte_string.decode(encoding)并且对称性很好。

发布了0 篇原创文章 · 获赞 137 · 访问量 84万+

猜你喜欢

转载自blog.csdn.net/xfxf996/article/details/105466908