Python: How to parse XML/RSS feeds with namespaces using lxml.etree

0 Comments

Alright, this one had me stumped for a good hour or two.

Take for example a Flickr RSS feed.

image

Those namespaces are a pain, but it's not too bad if you can sort them out before you use them.

01.# Some basic setup
02.from urllib2 import urlopen
03.from lxml import etree
04. 
05.# Namespaces copied straight out of the feed source
06.namespaces = {
10.}
11. 
12.# Untested fetching code, just for understanding
13.file = urlopen(feed_url)
14.xml = etree.parse(file)
15.item = xml.get_root().find('channel')[0]

Now here's the basic structure of an RSS feed item.

01.<item>
02.  <title>For the lazy</title>
04.  <description>blah blah blah</description>
05.  <pubDate>Sat, 23 Jun 2012 21:59:07 -0700</pubDate>
06.  <dc:date.Taken>2012-06-24T14:58:54-08:00</dc:date.Taken>
07.  <author flickr:profile="http://www.flickr.com/people/handles/">Handles</author>
08.  <guid isPermaLink="false">tag:flickr.com,2004:/photo/7429952158</guid>
09.  <media:content url="http://farm6.staticflickr.com/5347/7429952158_962a849b30_b.jpg" type="image/jpeg" height="1024" width="768"/>
10.  <media:title>For the lazy</media:title>
11.  <media:thumbnail url="http://farm6.staticflickr.com/5347/7429952158_962a849b30_s.jpg" height="75" width="75" />
12.  <media:credit role="photographer">Handles</media:credit>
13.</item>

To get information off those elements, you'll need some slightly different syntax.

01.# Now to fetch the data from the namespaced elements
02.media_title = item.find("{%s}title" % namespaces['media']).text
03. 
04.media_thumbnail = media_title = item.find("{%s}thumbnail" % namespaces['media'])
05.thumbnail = {
06.  'url': media_thumbnail.get('url'),
07.  'width': media_thumbnail.get('width'),
08.  'height': media_thumbnail.get('height'),
09.}

taTtM
Problem solved, like a boss!

Source

 
Copyright © Twig's Tech Tips
Theme by BloggerThemes & TopWPThemes Sponsored by iBlogtoBlog