Effective Python(3): 了解bytes、str

2019-09-26 本文已影响0人 warmsirius

一、Python3、Python2编码类型

1. Python3

Python3 有两种表示字符序列的类型：bytes 和 str

'hello' # str类型, 即Unicode类型
u'hello' # str类型, 也就是Unicode类型
b'hello' # bytes类型

bytes 的实例包含原始的8位值
str 的实例包含 Unicode字符

# PY2
>>> type('hello')
<type 'str'>
>>> type(b'hello')
<type 'str'>
>>> type(u'hello')
<type 'unicode'>

注意：编写Python程序时，一定要把编码和解码操作放在界面最外围来做，程序的核心部分应该使用Unicode字符类型，而且不要对字符编码做任何假设.

2. Python2

Python2 也有两种表示字符序列的类型：str和unicode

'hello' # str类型, 即Bytes类型
b'hello' # str类型, 即Bytes类型
u'hello' # unicode类型

str的实例包含原始的8位值
unicode的实例包含Unicode字符

例如：

# PY3
>>> type(b'hello')
<class 'bytes'>
>>> type('hello')
<class 'str'>
>>> type(u'hello')
<class 'str'>

二、Unicode字符⇔ Bytes

1. Unicode字符 ⇒ Bytes

最常见的编码方式就是 UTF-8
这是因为 UTF-8对于各种特殊字符提供更好显示，全世界通用，但是对于某些功能不支持UTF8的编码，只能支持Bytes(二进制)格式的数据，比如图片，视频等。

Python3的str实例和Python2的unicode实例都是 unicode 的方式，而不是bytes类型，要把Unicode字符转成 Bytes数据，必须使用encode方法。

例如：

Py2

>>> u'hello'.encode('utf8')
'hello'

Py3

>>> 'hello'.encode('utf8')
b'hello'

2. Bytes ⇒ Unicode字符

Bytes数据转换为Unicode字符串，必须使用decode方法。

Python3的bytes和Python2的str都是Bytes类型，可通过decode方法转换为unicode类型。

Py2

>>> 'hello'.decode('utf8')
u'hello'

Py3

>>> b'hello'.decode('utf8')
'hello'

三、Py2和Py3中: 编写bytes和unicode转换的辅助函数

1. Python3

函数接受str(unicode)或bytes
函数并且总是返回str(unicode)方法

def to_str(bytes_or_str):
    if isinstance(byets_or_str, bytes):
        value = bytes_or_str.decode("utf-8")
    else:
        value = bytes_or_str
    return value # Instance of str

函数接受str(unicode)或bytes
函数并且总是返回bytes方法

def to_str(bytes_or_str):
    if isinstance(byets_or_str, str):
        value = bytes_or_str.encode("utf-8")
    else:
        value = bytes_or_str
    return value # Instance of bytes

2. Python2

函数接受str(bytes)或unicode
函数并且总是返回str(bytes)方法

def to_str(unicode_or_str):
    if isinstance(unicode_or_str, unicode):
        value = bytes_or_str.encode("utf-8")
    else:
        value = bytes_or_str
    return value # Instance of unicode

函数接受str(bytes)或unicode
函数并且总是返回bytes方法

def to_str(unicode_or_str):
    if isinstance(unicode_or_str, str):
        value = bytes_or_str.decode("utf-8")
    else:
        value = bytes_or_str
    return value # Instance of str

四、使用Unicode、Bytes注意的问题

1. Python2

a. Python2中，如果str(bytes)只包含7位ASCII字符，那么unicode和str实例似乎就成了同一种类型，在这种情况下：

可以用+操作把str与unicode连接起来
可以用等价于不等价操作符，在这种str实例和unicode实例之间进行比较
在格式字符串总，可以用"%s"等形式来代表unicode实例

这些行为意味着，在只处理7位的ASCII的情景下，如果某函数接受str，那么可以给它传入unicode；如果某函数接受unicode，那么也可以给它传入str。

b. Python2中如果通过内置的open函数获取了文件句柄，那么该句柄默认会采用Bytes编码格式来操作文件。

2. Python3

a. 在Python3中，bytes与str实例绝对不会等价，即使是空字符串也不行。所以，在传入字符序列的时候必须留意其类型。
b. Python3中如果通过内置的open函数获取了文件句柄，那么该句柄默认会采用UTF-8编码格式来操作文件。

五、要点

在Python3中，str是Unicode形式，bytes是二进制编码形式
在Python2中，str是Bytes形式，Unicode是UTF8编码形式

如果str只含有7位ASCII字符，那么可以通过相关的操作符来同时使用str与unicode

在对输入的数据进行操作之前，使用辅助函数来保证字符序列的类型与开发者的期望相符

有时候，开发者想操作以UTF8格式来编码的8位值，有时候则想操作Unicode字符

从文件中读取二进制数据，或向其中写入二进制数据时，总应该以'rb'或'wb'等二进制模式来开启文件

Effective Python(3): 了解bytes、str

一、Python3、Python2编码类型

1. Python3

2. Python2

二、Unicode字符⇔ Bytes

1. Unicode字符 ⇒ Bytes

2. Bytes ⇒ Unicode字符

三、Py2和Py3中: 编写bytes和unicode转换的辅助函数

1. Python3

2. Python2

四、使用Unicode、Bytes注意的问题

1. Python2

2. Python3

五、要点

猜你喜欢

热点阅读