Python: lxml "Unicode strings with encoding declaration are not supported" error

ValueError: Unicode strings with encoding declaration are not supported. Please use bytes input or XML fragments without declaration.

Ran into this interesting quirk when fixing some UTF8 issues. Actually it's not really a quirk but by design, since encoding gives most people a headache.

The reason behind this is lxml just doesn't trust people to give it properly encoded strings, and rightly so.

So simply just give them the raw input or a file handle and it'll handle the encoding itself.

Don't do things like:

str = u"%s" % input
# or
str = file_content.encode("utf-8")

Things like this will work much better:

from lxml import etree
file_content = urlopen(link).read()

parser = etree.XMLParser(recover=True)
xml = etree.fromstring(file_content, parser)

# or

from StringIO import StringIO
parser = etree.XMLParser(recover=True, encoding='utf-8')
xml = etree.parse(StringIO(file_content), parser)

If the XML string already declares an encoding type then you don't need to provide any encoding. It's smart like that.

So kick back and relax, lxml's got this.


Copyright © Twig's Tech Tips
Theme by BloggerThemes & TopWPThemes Sponsored by iBlogtoBlog