python 3.x - python3 interpret ascii string as unicode string -
i have text file, when opened, looks this:
\xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x83 \xf0\x9f\x98\x84 \xf0\x9f\x98\x85
the hexdump looks this:
0000000 5c 78 46 30 5c 78 39 46 5c 78 39 38 5c 78 38 31 0000010 0a 5c 78 46 30 5c 78 39 46 5c 78 39 38 5c 78 38 0000020 32 0a 5c 78 46 30 5c 78 39 46 5c 78 39 38 5c 78 0000030 38 33 0a 5c 78 46 30 5c 78 39 46 5c 78 39 38 5c 0000040 78 38 34 0a 5c 78 46 30 5c 78 39 46 5c 78 39 38
i trying print strings in python though unicode strings. following things fail:
with open ("file") f: row in f: x = row.split() in x: print(i) print(bytes(i, encoding='utf-8')) print(bytes(i, encoding='utf-8').decode('unicode-escape'))
prints
\xf0\x9f\x98\x81 b'\\xf0\\x9f\\x98\\x81' ð \xf0\x9f\x98\x82 b'\\xf0\\x9f\\x98\\x82' ð \xf0\x9f\x98\x83 b'\\xf0\\x9f\\x98\\x83' ð \xf0\x9f\x98\x84 b'\\xf0\\x9f\\x98\\x84' ð \xf0\x9f\x98\x85 b'\\xf0\\x9f\\x98\\x85' ð
what trying achieve if typed following directly:
print(b'\xf0\x9f\x98\x81'.decode('utf-8')) print(b'\xf0\x9f\x98\x82'.decode('utf-8')) print(b'\xf0\x9f\x98\x83'.decode('utf-8')) print(b'\xf0\x9f\x98\x84'.decode('utf-8')) print(b'\xf0\x9f\x98\x85'.decode('utf-8'))
😁 😂 😃 😄 😅
unicode-escape
gives unicode string codepoints specified. latin1
converts directly byte string because there 1:1 mapping between latin1
, first 256 codepoints. decode unicode using utf-8.
data = rb'''\xf0\x9f\x98\x81 \xf0\x9f\x98\x82 \xf0\x9f\x98\x83 \xf0\x9f\x98\x84 \xf0\x9f\x98\x85''' data = data.decode('unicode-escape').encode('latin-1').decode('utf8') print(ascii(data)) print(data)
output:
'\u0001f601\n\u0001f602\n\u0001f603\n\u0001f604\n\u0001f605' 😁 😂 😃 😄 😅
note: font didn't support characters.