Twig's Tech Tips: Python: lxml "Unicode strings with encoding declaration are not supported" error

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Ran into this interesting quirk when fixing some UTF8 issues. Actually it's not really a quirk but by design, since encoding gives most people a headache.

The reason behind this is lxml just doesn't trust people to give it properly encoded strings, and rightly so.

So simply just give them the raw input or a file handle and it'll handle the encoding itself.

Don't do things like:

 str = u"%s" % input
 # or
 str = file_content.encode("utf-8")

Things like this will work much better:

 from lxml import etree
 file_content = urlopen(link).read()
 
 parser = etree.XMLParser(recover=True)
 xml = etree.fromstring(file_content, parser)
 
 # or
 
 from StringIO import StringIO
 parser = etree.XMLParser(recover=True, encoding='utf-8')
 xml = etree.parse(StringIO(file_content), parser)

If the XML string already declares an encoding type then you don't need to provide any encoding. It's smart like that.

So kick back and relax, lxml's got this.

Twig's Tech Tips

Pages

Search Twig's Tips

About Me

Contact

My Pet Projects

Links

Labels

Blog Archive

Python: lxml "Unicode strings with encoding declaration are not supported" error

Source