Twig's Tech Tips: Python: How to parse XML

In a sample XML string such as:

 <?xml version="1.0" encoding="utf-8" ?>
   <YourRoot>
     <Group Found="1000">
       <Item ID="1">
         <Title>Something silly</Title>
         <Summary><![CDATA[Blah blah blah...]]></Summary>
         <Location>
           <Country>Australia</Country>
           <State>NSW</State>
           <City>Sydney</City>
           <PostalCode>2002</PostalCode>
         </Location>
       </Item>
       <Item ID="2">
         ...
       </Item>
       <Item ID="3">
         ...
       </Item>
     </Group>
 </YourRoot>

Reading the data

Using ElementTree, you can either read directly from the file or load it into a string first. Include the following import.

 from xml.etree.ElementTree import ElementTree
 from xml.parsers.expat import ExpatError

If you are using a string:

 from xml.etree.ElementTree import fromstring
 
 try:
   tree = fromstring(xml_data)
 except ExpatData:
   print "Unable to parse XML data from string"

Otherwise, to load it directly:

 try:
   tree = ElementTree(file = "filename")
 except ExpatData:
   print "Unable to parse XML from file"

Once you have the tree initialised, you can begin parsing the information.

As ElementTree is quite versatile, there are a few options for parsing the data. You can choose between being lazy or specific.

Being lazy and get all "Item" elements from XML

For simple XML data like RSS feeds, this is usually enough to get by.

 def parse_results(self, tree):
   results = []
 
   for item in tree.getiterator('Item'):
     location = element.find('Location')
 
     results.append({       'id': element.get('ID'),
       'title': element.find('Title').text,
       'summary': element.find('Summary').text,
       'location': {
         'country': location.find('Country').text if location.find('Country') is not None else '',
         'state': location.find('State').text if location.find('State') is not None else '',
         'city': location.find('City').text if location.find('City') is not None else '',
         'postcode': location.find('PostalCode').text if location.find('PostalCode') is not None else '',
       },
     })

From the example, element.get('ID') reads the element attribute and element.find('Title').text returns the element value.

The code checks for information within location before reading from it, otherwise it defaults to an empty string.

Being picky and navigating the XML paths manually

Depending on how complex the XML structure is, you may have to navigate some of it manually.

 def parse_results(self, tree):
   results = []
   group = tree.find("YourRoot/Group")
 
   for item in group.getiterator('Item'):
     location = element.find('Location')
 
     results.append({
       # Exactly the same as above...
     },
   })

This time we navigate the tree a little by using tree.find("YourRoot/Group") to tell ElementTree that we want the specific element.

Then we iterate all "Item" elements in "Group" as per usual.

[ ElementTree Documentation, Source ]

Twig's Tech Tips

Pages

Search Twig's Tips

About Me

Contact

My Pet Projects

Links

Labels

Blog Archive

Python: How to parse XML

Reading the data

Being lazy and get all "Item" elements from XML

Being picky and navigating the XML paths manually

Twig's Tech Tips

Pages

Search Twig's Tips

About Me

Contact

My Pet Projects

Links

Labels

Blog Archive

Python: How to parse XML

Reading the data

Being lazy and get all "Item" elements from XML

Being picky and navigating the XML paths manually

Related Posts