HTML entity handling in Python 3/BeautifulSoup on Windows -
i'm having trouble handling html containing escaped unicode characters (in chinese range) in python3/beautifulsoup on windows. beautifulsoup seems function correctly, until try print extracted tag, or write out file. have default encoding set utf-8, yet cp1252 codec seems getting selected...
to reproduce:
soup = beautifulsoup("隱") f = open("out.html", "w") f.write(soup.text) f.close()
stack trace attached.
traceback (most recent call last): file "scrape.py", line 143, in <module> test_uni() file "scrape.py", line 126, in test_uni f.write(soup.text) file "c:\venv\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] unicodeencodeerror: 'charmap' codec can't encode character '\u96b1' in position 0: character maps <undefined>
you trying write non-english (unicode) string file python expects ascii bytestring @ default. not windows environment.
encode text before writing file should work, , utf-8
should fine chinese characters:
f.write(soup.text.encode('utf-8'))