Python: How to parse XML

2 comments

In a sample XML string such as:

<?xml version="1.0" encoding="utf-8" ?>
<YourRoot>
<Group Found="1000">
<Item ID="1">
<Title>Something silly</Title>
<Summary><![CDATA[Blah blah blah...]]></Summary>
<Location>
<Country>Australia</Country>
<State>NSW</State>
<City>Sydney</City>
<PostalCode>2002</PostalCode>
</Location>
</Item>
<Item ID="2">
...
</Item>
<Item ID="3">
...
</Item>
</Group>
</YourRoot>

Reading the data

Using ElementTree, you can either read directly from the file or load it into a string first. Include the following import.

from xml.etree.ElementTree import ElementTree
from xml.parsers.expat import ExpatError

If you are using a string:

from xml.etree.ElementTree import fromstring

try:
tree = fromstring(xml_data)
except ExpatData:
print "Unable to parse XML data from string"

Otherwise, to load it directly:

try:
tree = ElementTree(file = "filename")
except ExpatData:
print "Unable to parse XML from file"

Once you have the tree initialised, you can begin parsing the information.

As ElementTree is quite versatile, there are a few options for parsing the data. You can choose between being lazy or specific.

Being lazy and get all "Item" elements from XML

For simple XML data like RSS feeds, this is usually enough to get by.

def parse_results(self, tree):
results = []

for item in tree.getiterator('Item'):
location = element.find('Location')

results.append({ 'id': element.get('ID'),
'title': element.find('Title').text,
'summary': element.find('Summary').text,
'location': {
'country': location.find('Country').text if location.find('Country') is not None else '',
'state': location.find('State').text if location.find('State') is not None else '',
'city': location.find('City').text if location.find('City') is not None else '',
'postcode': location.find('PostalCode').text if location.find('PostalCode') is not None else '',
},
})

From the example, element.get('ID') reads the element attribute and element.find('Title').text returns the element value.

The code checks for information within location before reading from it, otherwise it defaults to an empty string.

Being picky and navigating the XML paths manually

Depending on how complex the XML structure is, you may have to navigate some of it manually.

def parse_results(self, tree):
results = []
group = tree.find("YourRoot/Group")

for item in group.getiterator('Item'):
location = element.find('Location')

results.append({
# Exactly the same as above...
},
})

This time we navigate the tree a little by using tree.find("YourRoot/Group") to tell ElementTree that we want the specific element.

Then we iterate all "Item" elements in "Group" as per usual.

[ ElementTree Documentation, Source ]

Related Posts

2 comments:

  1. I could be wrong (I mean, I probably wouldn't be reading this if I weren't new to ElementTree), but I think you might have made a bit of a mistake in your for loops.

    e.g.

    04. for item in tree.getiterator('Item'):
    05. location = element.find('Location')

    line 4, the name item is used, but line 5, the name element is used.

    Whether I am wrong or not, thanks for the introduction to ElementTree. Just the quick and simple kind of intro I was after.

    ReplyDelete
  2. Ahh good call, thanks for that!

    ReplyDelete

Leave your thoughts ...
---
If you are having trouble with copy/pasting in comments, you need to sign in or click 'Preview'. For more information about this Firefox bug, see here.

 
Copyright © Twig's Tech Tips
Theme by BloggerThemes & TopWPThemes Sponsored by iBlogtoBlog