Extract file list from SWAC collection index
Submitted by Christoph on 27 November, 2008 - 10:25
The Shtooka Project offers audio material for learning languages with their swac-collections. The data is provided under a free licences and can be downloaded from the site. Data sets come with a xml index, for which I created a simple Python extractor:
>>> import xml.sax
>>> import bz2
>>>
>>> class SwacXMLIndexHandler(xml.sax.ContentHandler):
... """Extracts a list of pronunciation and file name pairs."""
... def __init__(self, fileList):
... self.fileList = fileList
... def startDocument(self):
... self.currentFilePath = None
... def startElement(self, name, attrs):
... if name == 'file':
... self.currentFilePath = None
... for key, value in attrs.items():
... if key == 'path':
... self.currentFilePath = value
... elif name == 'tag':
... if self.currentFilePath:
... pronunciation = None
... for key, value in attrs.items():
... if key == 'swac_pron_phon':
... pronunciation = value
... if pronunciation:
... self.fileList.append(
... (pronunciation, self.currentFilePath))
...
>>>
>>> xmlFile = bz2.BZ2File('index.xml.bz2')
>>> fileList = []
>>> indexHandler = SwacXMLIndexHandler(fileList)
>>>
>>> saxparser = xml.sax.make_parser()
>>> saxparser.setContentHandler(indexHandler)
>>> # don't check DTD as this raises an exception
... saxparser.setFeature(xml.sax.handler.feature_external_ges, False)
>>> saxparser.parse(xmlFile)
>>>
>>> print len(fileList)
1000
Here I loaded the index of the "Free audio base of Chinese words". I used "swac_pron_phon" as a key, which is not destinct for the Chinese collection, it can be changed to "swac_text" alternatively.
It is important to not load the given DTD as I received an error during the parsing:
Traceback (most recent call last):
File "", line 2, in
File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse
xmlreader.IncrementalParser.parse(self, source)
File "/usr/lib/python2.5/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse
self.feed(buffer)
File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed
self._err_handler.fatalError(exc)
File "/usr/lib/python2.5/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
raise exception
xml.sax._exceptions.SAXParseException: http://shtooka.net/project/swac/index.dtd:1:0:
error in processing external entity reference