Python: How to parse XML/RSS feeds with namespaces using lxml.etree

Alright, this one had me stumped for a good hour or two.

Take for example a Flickr RSS feed.

image

Those namespaces are a pain, but it's not too bad if you can sort them out before you use them.

# Some basic setup
from urllib2 import urlopen
from lxml import etree

# Namespaces copied straight out of the feed source
namespaces = {
'media': "http://search.yahoo.com/mrss/",
'dc': "http://purl.org/dc/elements/1.1/",
'creativeCommons': "http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html",
}

# Untested fetching code, just for understanding
file = urlopen(feed_url)
xml = etree.parse(file)
item = xml.get_root().find('channel')[0]

Now here's the basic structure of an RSS feed item.

<item>
<title>For the lazy</title>
<link>http://www.flickr.com/photos/handles/7429952158/</link>
<description>blah blah blah</description>
<pubDate>Sat, 23 Jun 2012 21:59:07 -0700</pubDate>
<dc:date.Taken>2012-06-24T14:58:54-08:00</dc:date.Taken>
<author flickr:profile="http://www.flickr.com/people/handles/">Handles</author>
<guid isPermaLink="false">tag:flickr.com,2004:/photo/7429952158</guid>
<media:content url="http://farm6.staticflickr.com/5347/7429952158_962a849b30_b.jpg" type="image/jpeg" height="1024" width="768"/>
<media:title>For the lazy</media:title>
<media:thumbnail url="http://farm6.staticflickr.com/5347/7429952158_962a849b30_s.jpg" height="75" width="75" />
<media:credit role="photographer">Handles</media:credit>
</item>

To get information off those elements, you'll need some slightly different syntax.

# Now to fetch the data from the namespaced elements
media_title = item.find("{%s}title" % namespaces['media']).text

media_thumbnail = media_title = item.find("{%s}thumbnail" % namespaces['media'])
thumbnail = {
'url': media_thumbnail.get('url'),
'width': media_thumbnail.get('width'),
'height': media_thumbnail.get('height'),
}

taTtM
Problem solved, like a boss!

Source

 
Copyright © Twig's Tech Tips
Theme by BloggerThemes & TopWPThemes Sponsored by iBlogtoBlog