Twig's Tech Tips: Python: How to parse XML/RSS feeds with namespaces using lxml.etree

Alright, this one had me stumped for a good hour or two.

Those namespaces are a pain, but it's not too bad if you can sort them out before you use them.

 # Some basic setup
 from urllib2 import urlopen
 from lxml import etree
 
 # Namespaces copied straight out of the feed source
 namespaces = {
   'media': "http://search.yahoo.com/mrss/",
   'dc': "http://purl.org/dc/elements/1.1/",
   'creativeCommons': "http://cyber.law.harvard.edu/rss/creativeCommonsRssModule.html",
 }
 
 # Untested fetching code, just for understanding
 file = urlopen(feed_url)
 xml = etree.parse(file)
 item = xml.get_root().find('channel')[0]

Now here's the basic structure of an RSS feed item.

 <item>
   <title>For the lazy</title> 
   <link>http://www.flickr.com/photos/handles/7429952158/</link> 
   <description>blah blah blah</description> 
   <pubDate>Sat, 23 Jun 2012 21:59:07 -0700</pubDate> 
   <dc:date.Taken>2012-06-24T14:58:54-08:00</dc:date.Taken> 
   <author flickr:profile="http://www.flickr.com/people/handles/">Handles</author> 
   <guid isPermaLink="false">tag:flickr.com,2004:/photo/7429952158</guid> 
   <media:content url="http://farm6.staticflickr.com/5347/7429952158_962a849b30_b.jpg" type="image/jpeg" height="1024" width="768"/> 
   <media:title>For the lazy</media:title> 
   <media:thumbnail url="http://farm6.staticflickr.com/5347/7429952158_962a849b30_s.jpg" height="75" width="75" /> 
   <media:credit role="photographer">Handles</media:credit> 
 </item>

To get information off those elements, you'll need some slightly different syntax.

 # Now to fetch the data from the namespaced elements
 media_title = item.find("{%s}title" % namespaces['media']).text
 
 media_thumbnail = media_title = item.find("{%s}thumbnail" % namespaces['media'])
 thumbnail = {
   'url': media_thumbnail.get('url'),
   'width': media_thumbnail.get('width'),
   'height': media_thumbnail.get('height'),
 }

Problem solved, like a boss!

Source

Simple XML parsing with Python and LXML

Twig's Tech Tips

Pages

Search Twig's Tips

About Me

Contact

My Pet Projects

Links

Labels

Blog Archive

Python: How to parse XML/RSS feeds with namespaces using lxml.etree

Source

Twig's Tech Tips

Pages

Search Twig's Tips

About Me

Contact

My Pet Projects

Links

Labels

Blog Archive

Python: How to parse XML/RSS feeds with namespaces using lxml.etree

Source

Related Posts