python编码问题

2019-05-17 本文已影响0人堆雪人的小朋友

Python中读取文件时报错UnicodeDecodeError

场景：文本处理

平台：linux red-hat python3.5

情景1：

with open(filename, 'r') as f: 
  for line in f: 
    print(line)

出现错误：

Traceback (most recent call last): 
File "count_bleu.py", line 16, in <module> 
for line in f: 
File "/nfs/project/tools/env/tf1.8_py3.5_all_luban/lib/python3.5/encodings/ascii.py", line 26, in decode 
return codecs.ascii_decode(input, self.errors)[0] 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe5 in position 1: ordinal not in range(128)

原因：
当使用open方法时，若不设定参数encoding的值时，该方法会默认encoding为系统默认的编码格式，也即是在平台不指定编码方式的时候，平台默认的编码方式查看方式为以下两种。

import locale 
print(locale.getpreferredencoding())

import sys 
print(sys.platform) 
print(sys.getdefaultencoding())

原文件的编码格式为UTF-8，系统默认的编码格式为ANSI，有些中文字符在两种编码格式下不统一，因此读文件时报错。

with open(filename, encoding='utf-8') as f: 
  for line in f: 
    print(line)

情景3：

a = b'\xe5\x94\xb1\xe6\xad\x8c' 
a = a.decode("utf-8") 
print(a)

解决办法：
使用PYTHONIOENCODING
运行python的时候加上PYTHONIOENCODING=utf-8，即

PYTHONIOENCODING=utf-8 python your_script.py1

重新定义标准输出：标准输出的定义如下：

sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

打印日志的方法：

sys.stdout.write("Your content....")

平台： linux red-hat python2.7

情景1：

with open(filename , 'r') as f: 
  for line in f: 
    print(line)

使用open的默认方法，不会出问错
情景2：

with open(filename, 'r', encoding='utf-8') as f: 
  for line in f: 
    print(line)

此时会出现以下错误：

Traceback (most recent call last): 
File "count_bleu.py.1", line 13, in <module> 
with open(filename, encoding='utf-8') as f: 
TypeError: 'encoding' is an invalid keyword argument for this function

原因：
open方法在python2版本里无encoding参数
解决办法：

import io 
with io.open(filename, 'r', encoding='utf-8') as f: 
    for line in f: 
    print(line)

情景3：

s = 'π排球の' 
b = s.encode('ascii')

出错：

Traceback (most recent call last): 
File "<stdin>", line 1, in <module> 
UnicodeDecodeError: 'ascii' codec can't decode byte 0xcf in position 0: ordinal not in range(128)

原因：
由于python基于ASCII处理字符的，当出现不属于ASCII的字符时，会出现错误信息
非ASCII字符无法使用ASCII编码转换成字节字符串且Latin-1和unicode编码方式不兼容
只有ascii字符集中的字符，三种编码方式得到的结果才完全一致。
一旦你的 Python 代码是用管道 / 子进程方式运行，sys.stdout.encoding 就会失效（ascii 字符集不能用来表示中文字符），让你重新遇到 UnicodeEncodeError
会出现：
第一：sys.stdout.encoding 的值变成了 None；
第二：Python 在 print 时会尝试用 ascii 去编码 unicode.

>>> s = 'abc' 
>>> print(s.encode()) 
b'abc' 
>>> print(s.encode('latin')) 
b'abc' 
>>> print(s.encode('utf-8')) 
b'abc'

在进行同时包含 str 与 unicode 的运算时，Python 一律都把 str 转换成 unicode 再运算，当然，运算结果也都是 unicode。
Python2情况下的解决办法：

import sys 
reload(sys) 
sys.setdefaultencoding('utf8')

背景知识

Python有两种字符串，文本字符串和字节字符串。其中文本字符串类型直接被命名为str，内部采用Unicode字符集（兼容ASCII码），而字节字符串直接用来表示原始的字节序列（用print函数来打印字节字符串时，若字节在ASCII码范围内，则显示为ASCII码对应的字符，其余的则直接显示为16进制数），该类型的被命名为bytes
编码和解码就是str和bytes这两种字符串之间的互相转换
Str包含一个encode方法，用于使用特定编码将其转换成bytes，这称之为编码，bytes包含一个decode方法，也接受一个编码作为其必要参数，返回一个str，这一过程称为解码
为了解决编码问题，在python3中，所有的字符串都是使用Unicode编码，统一使用str类型来保存，而str类型没有decode方法
查看了输出编码

>>>import sys 
>>>sys.stdout.encoding

‘ANSI_X3.4-1968’
为了保证输出不会在 linux 终端上显示乱码，需要设置好 linux 的环境变量：export LANG=en_US.UTF-8
—
参考链接
http://in355hz.iteye.com/blog/1860787
http://monsterhuan.iteye.com/blog/1948945
http://perfyy.blog.sohu.com/145845129.html
http://in355hz.iteye.com/blog/1860787