Python/Django: Dealing with UTF8 in URLs/URIs

UTF8 is a nice set of characters to use, but one must remember that the standard for URL encoding has to be in ASCII. You've probably run into it before with the Python urllib and urllib2 libraries when encoding issues are raised. These are valid exceptions!

92JJn
Yeah that's right bitch, urllib ain't taking your UTF8 shit.

"But it just works in my browser!" you may be thinking. Yes, most modern browsers will automatically convert the UTF8 URL into an ASCII request behind the scenes without showing it to you.

So as a developer, if you have a URL containing UTF8 characters then you'll have to convert it to ASCII before you can send a request.

If you're using Django, then you have some nice helper functions to deal with this. By using django.utils.encoding.iri_to_uri(), you can simply convert the UTF8 portions of the URL into ASCII and keeping everything else unmodified.

If you're not using Django... I guess you can try to incorporate their madness from their encoding module source file.

Examples

First make sure your URLs are properly encoded in Unicode (note the little "u" in front of the string when defining "url").

Take for example this URL which contains UTF8 characters. A quick test shows: http://twigstechtips.blogspot.com/seârch/labél/pythön

>>> url = 'http://twigstechtips.blogspot.com/seârch/labél/pythön'
>>> print url
# http://twigstechtips.blogspot.com/se├órch/lab├®l/pyth├Ân


>>> url = u'http://twigstechtips.blogspot.com/seârch/labél/pythön'
>>> print url
# http://twigstechtips.blogspot.com/seârch/labél/pythön

But what about GET args? Don't worry, iri_to_url() deals with them too. For example: http://twigstechtips.blogspot.com/seârch/labél/pythön?query=djångõ

>>> url = u'http://twigstechtips.blogspot.com/seârch/labél/pythön?query=djångõ'
>>> print iri_to_uri(url)
# http://twigstechtips.blogspot.com/se%C3%A2rch/lab%C3%A9l/pyth%C3%B6n?query=dj%C3%A5ng%C3%B5

And it also works for UTF8 domains too (yes, they exist!). In this instance, we'll use http://camtasia教程网.com:

>>> url = u'http://camtasia教程网.com'
>>> print url
# http://camtasia教程网.com


>>> print iri_to_uri(url)
# http://camtasia%E6%95%99%E7%A8%8B%E7%BD%91.com

I'll be honest, I'm not quite sure what the heck an IRI is but this magic function works as advertised.

39gyh1taajden89mwf8v331ug
BOOM! It just works.

On a final note, if you're after UTF8 friendly versions of urllib.quote() and urllib.quote_plus(), then there are also:

The functions django.utils.http.urlquote() and django.utils.http.urlquote_plus() are versions of Python’s standard urllib.quote() and urllib.quote_plus() that work with non-ASCII characters. (The data is converted to UTF-8 prior to encoding.)

Source

 
Copyright © Twig's Tech Tips
Theme by BloggerThemes & TopWPThemes Sponsored by iBlogtoBlog