Extract file list from SWAC collection index
Submitted by Christoph on 27 November, 2008 - 10:25
The Shtooka Project offers audio material for learning languages with their swac-collections. The data is provided under a free licences and can be downloaded from the site. Data sets come with a xml index, for which I created a simple Python extractor:
>>> import xml.sax >>> import bz2 >>> >>> class SwacXMLIndexHandler(xml.sax.ContentHandler): ... """Extracts a list of pronunciation and file name pairs.""" ... def __init__(self, fileList): ... self.fileList = fileList ... def startDocument(self): ... self.currentFilePath = None ... def startElement(self, name, attrs): ... if name == 'file': ... self.currentFilePath = None ... for key, value in attrs.items(): ... if key == 'path': ... self.currentFilePath = value ... elif name == 'tag': ... if self.currentFilePath: ... pronunciation = None ... for key, value in attrs.items(): ... if key == 'swac_pron_phon': ... pronunciation = value ... if pronunciation: ... self.fileList.append( ... (pronunciation, self.currentFilePath)) ... >>> >>> xmlFile = bz2.BZ2File('index.xml.bz2') >>> fileList = [] >>> indexHandler = SwacXMLIndexHandler(fileList) >>> >>> saxparser = xml.sax.make_parser() >>> saxparser.setContentHandler(indexHandler) >>> # don't check DTD as this raises an exception ... saxparser.setFeature(xml.sax.handler.feature_external_ges, False) >>> saxparser.parse(xmlFile) >>> >>> print len(fileList) 1000
Here I loaded the index of the "Free audio base of Chinese words". I used "swac_pron_phon" as a key, which is not destinct for the Chinese collection, it can be changed to "swac_text" alternatively.
It is important to not load the given DTD as I received an error during the parsing:
Traceback (most recent call last): File "", line 2, in File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib/python2.5/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.5/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: http://shtooka.net/project/swac/index.dtd:1:0: error in processing external entity reference