HTML entity handling in Python 3/BeautifulSoup on Windows -


i'm having trouble handling html containing escaped unicode characters (in chinese range) in python3/beautifulsoup on windows. beautifulsoup seems function correctly, until try print extracted tag, or write out file. have default encoding set utf-8, yet cp1252 codec seems getting selected...

to reproduce:

soup = beautifulsoup("隱")  f = open("out.html", "w") f.write(soup.text) f.close() 

stack trace attached.

traceback (most recent call last):   file "scrape.py", line 143, in <module>     test_uni()   file "scrape.py", line 126, in test_uni     f.write(soup.text)   file "c:\venv\lib\encodings\cp1252.py", line 19, in encode     return codecs.charmap_encode(input,self.errors,encoding_table)[0] unicodeencodeerror: 'charmap' codec can't encode character '\u96b1' in position 0: character maps <undefined> 

you trying write non-english (unicode) string file python expects ascii bytestring @ default. not windows environment.

encode text before writing file should work, , utf-8 should fine chinese characters:

f.write(soup.text.encode('utf-8')) 

Popular posts from this blog