UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in positio...

令狐少侠 · 发表于 2018-9-18 14:54:03

本帖最后由令狐少侠于 2018-9-18 17:50 编辑

UnicodeEncodeError: 'ascii' codec can't encode character u'\ufeff' in position 155: ordinal not in range(128)

不确定是什么u'\ufeff'，它在网络抓取时显示出来。我该如何纠正这种情况？该.replace()字符串的方法不能进行这项工作。

强人锁男 · 发表于 2018-9-18 14:55:17

Unicode字符U+FEFF是字节顺序标记或BOM，用于区分大端和小端UTF-16编码。如果使用正确的编解码器解码网页，Python将为您删除它。例子：
#!python2
#coding: utf8
u = u'ABC'
e8 = u.encode('utf-8')       # encode without BOM
e8s = u.encode('utf-8-sig') # encode with BOM
e16 = u.encode('utf-16')    # encode with BOM
e16le = u.encode('utf-16le')  # encode without BOM
e16be = u.encode('utf-16be')  # encode without BOM
print 'utf-8    %r' % e8
print 'utf-8-sig %r' % e8s
print 'utf-16 %r' % e16
print 'utf-16le  %r' % e16le
print 'utf-16be  %r' % e16be
print
print 'utf-8  w/ BOM decoded with utf-8    %r' % e8s.decode('utf-8')
print 'utf-8  w/ BOM decoded with utf-8-sig %r' % e8s.decode('utf-8-sig')
print 'utf-16 w/ BOM decoded with utf-16 %r' % e16.decode('utf-16')
print 'utf-16 w/ BOM decoded with utf-16le  %r' % e16.decode('utf-16le')
请注意，这EF BB BF是一个UTF-8编码的BOM。它不是UTF-8所必需的，但仅作为签名（通常在Windows上）。
输出：
utf-8    'ABC'
utf-8-sig '\xef\xbb\xbfABC'
utf-16 '\xff\xfeA\x00B\x00C\x00' # Adds BOM and encodes using native processor endian-ness.
utf-16le  'A\x00B\x00C\x00'
utf-16be  '\x00A\x00B\x00C'

utf-8  w/ BOM decoded with utf-8    u'\ufeffABC' # doesn't remove BOM if present.
utf-8  w/ BOM decoded with utf-8-sig u'ABC'       # removes BOM if present.
utf-16 w/ BOM decoded with utf-16 u'ABC'       # *requires* BOM to be present.
utf-16 w/ BOM decoded with utf-16le  u'\ufeffABC' # doesn't remove BOM if present.
请注意，utf-16编码需要 BOM存在，否则Python将不知道数据是大端还是小端。

天使与魔鬼 · 发表于 2018-9-18 14:56:16

我在Python 3上遇到了这个问题并找到了解决方案。打开文件时，Python 3支持encoding关键字以自动处理编码。
没有它，BOM将包含在读取结果中：
>>> f = open('file', mode='r')
>>> f.read()
'\ufefftest'
给出正确的编码，结果中省略了BOM：
>>> f = open('file', mode='r', encoding='utf-8-sig')
>>> f.read()
'test'

社会诚哥 · 发表于 2018-9-18 15:00:18

你正在抓取的内容是以unicode而不是ascii文本编码的，并且您获得的字符不会转换为ascii。 Python的unicode页面给出了它的工作原理。