Extract file list from SWAC collection index

The Shtooka Project offers audio material for learning languages with their swac-collections. The data is provided under a free licences and can be downloaded from the site. Data sets come with a xml index, for which I created a simple Python extractor:

>>> import xml.sax
>>> import bz2
>>>
>>> class SwacXMLIndexHandler(xml.sax.ContentHandler):
...     """Extracts a list of pronunciation and file name pairs."""
...     def __init__(self, fileList):
...         self.fileList = fileList
...     def startDocument(self):
...         self.currentFilePath = None
...     def startElement(self, name, attrs):
...         if name == 'file':
...             self.currentFilePath = None
...             for key, value in attrs.items():
...                 if key == 'path':
...                     self.currentFilePath = value
...         elif name == 'tag':
...             if self.currentFilePath:
...                 pronunciation = None
...                 for key, value in attrs.items():
...                     if key == 'swac_pron_phon':
...                         pronunciation = value
...                 if pronunciation:
...                     self.fileList.append(
...                         (pronunciation, self.currentFilePath))
...
>>>
>>> xmlFile = bz2.BZ2File('index.xml.bz2')
>>> fileList = []
>>> indexHandler = SwacXMLIndexHandler(fileList)
>>>
>>> saxparser = xml.sax.make_parser()
>>> saxparser.setContentHandler(indexHandler)
>>> # don't check DTD as this raises an exception
... saxparser.setFeature(xml.sax.handler.feature_external_ges, False)
>>> saxparser.parse(xmlFile)
>>>
>>> print len(fileList)
1000


Here I loaded the index of the "Free audio base of Chinese words". I used "swac_pron_phon" as a key, which is not destinct for the Chinese collection, it can be changed to "swac_text" alternatively.

It is important to not load the given DTD as I received an error during the parsing:

Traceback (most recent call last):
  File "", line 2, in
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: http://shtooka.net/project/swac/index.dtd:1:0:
error in processing external entity reference