Tomoe for Debian
Submitted by Christoph on 11 February, 2009 - 04:47Tomoe is a handwriting recognition engine for Japanese Kanji and Chinese Hanzi. It is written in C for the Gtk library and includes Ruby and Python bindings.
Mayor Linux distributions currently seem not to ship any packages so building your own packages is the only way to do it. Debian Etch Installation & Configuration is a nice article about how to install Tomoe on Debian. I changed the last step though to use checkinstall which gives you a Debian package allowing for easy deinstalling. I included Python bindings (no Ruby though), excluded the Unihan database and configured Tomoe to create the documentation in HTML though the install process does not include them currently.
./autogen.sh
./configure --enable-gtk-doc --disable-unihan
checkinstall -D make install
I'll upload the .deb package here but first of all it doesn't state the dependencies and secondly you are advised to build the package yourself using the steps mentioned in the link and above.
Update: You can now use Tegaki, the successor of Tomoe, which was introduced into Debian recently.
Attachment | Size |
---|---|
tomoe_0.6.0.svn20090210-1_i386.deb | 1.93 MB |
Batch-Downloading from Wikimedia servers (2)
Submitted by Christoph on 17 December, 2008 - 08:15Some time ago I wrote how to download a category of files from Wikipedia. As the API got updates and my program now can download more than the maximum page size of 500 entries, I'll repost the script:
#!/usr/bin/python # -*- coding: utf8 -*- # # Christoph Burgmer, 2008 # Released unter the MIT License. # import urllib import sys import re import os prependURL = "http://commons.wikimedia.org/w/api.php" \ + "?action=query&prop=imageinfo&iiprop=url&format=xml&titles=" maxFiles = 500 class AppURLopener(urllib.FancyURLopener): version="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)" urllib._urlopener = AppURLopener() cat = urllib.quote(sys.argv[1].replace('Category:', '')) baseUrl = "http://commons.wikimedia.org/w/api.php" \ + "?action=query&list=categorymembers&cmtitle=Category:" \ + cat + "&cmnamespace=6&format=xml&cmlimit=" + str(maxFiles) print "getting cat", cat, "(maximum "+ str(maxFiles) + ")" continueRegex = re.compile('<query-continue>' \ + '<categorymembers cmcontinue="([^\>"]+)" />' + '</query-continue>') continueParam = None while True: if continueParam: url = baseUrl + '&cmcontinue=' + urllib.quote(continueParam) else: url = baseUrl print "retrieving category page url", url f = urllib.urlopen(url) content = f.read() for imageName in re.findall(r'<cm[^>]+title="([^\>"]+)" />', content): imageDescriptionUrl = prependURL + imageName matchObj = re.search("File:([^/]+)$", imageName) if matchObj: fileName = matchObj.group(1).strip("\n") if os.path.exists(fileName): print "skipping", fileName else: print "getting file description page", imageName d = urllib.urlopen(imageDescriptionUrl) matchObj = re.search('<ii[^>]*?url="([^\>"]+)[^>]*>', d.read()) if matchObj: fileUrl = matchObj.group(1) print "getting", fileName, fileUrl urllib.urlretrieve(fileUrl, fileName) matchObj = continueRegex.search(content) if matchObj: continueParam = matchObj.group(1) else: break
Automating Updating Process of ISO 639 Language Tables
Submitted by Christoph on 30 November, 2008 - 19:26After not manually updating the tables for some months I finally wrote a Makefile and a short Python program helping in updating the ISO 639-1/-2/-3 tables provided by the LoC and SIL. See the attached archive for Makefile, download script and patch files. Check the README file and hit "make" to get started.
As the LoC now started to deprecate ISO 639-2(B) codes, the columns Part2B and Part2T which before served as foreign keys are not directly useable anymore: the ISO_639_3 table continues to contain the old Part2B entries which makes JOINS a hassle. I therefore changed the ISO_639_2 table to include an own "Id" column serving as a key. Now a JOIN can simply be done on this new column, which both tables integrate.
On a side note: recently MySQL must have changed something is now I have to filter for the character set when exporting:
cat iso639codes.sql | grep -v "LOCK TABLES" | grep -v "UNLOCK TABLES" | grep -v "character_set_client" > iso639codes_clean.sql
Furthermore ISO 639-3 tables don't show the deprecation of ISO 639-2(B) codes for hrv and srp, which forces me to change my statistics a bit.
Attachment | Size |
---|---|
makeISO639Tables.tar.gz | 4.12 KB |
Extract file list from SWAC collection index
Submitted by Christoph on 27 November, 2008 - 10:25The Shtooka Project offers audio material for learning languages with their swac-collections. The data is provided under a free licences and can be downloaded from the site. Data sets come with a xml index, for which I created a simple Python extractor:
>>> import xml.sax >>> import bz2 >>> >>> class SwacXMLIndexHandler(xml.sax.ContentHandler): ... """Extracts a list of pronunciation and file name pairs.""" ... def __init__(self, fileList): ... self.fileList = fileList ... def startDocument(self): ... self.currentFilePath = None ... def startElement(self, name, attrs): ... if name == 'file': ... self.currentFilePath = None ... for key, value in attrs.items(): ... if key == 'path': ... self.currentFilePath = value ... elif name == 'tag': ... if self.currentFilePath: ... pronunciation = None ... for key, value in attrs.items(): ... if key == 'swac_pron_phon': ... pronunciation = value ... if pronunciation: ... self.fileList.append( ... (pronunciation, self.currentFilePath)) ... >>> >>> xmlFile = bz2.BZ2File('index.xml.bz2') >>> fileList = [] >>> indexHandler = SwacXMLIndexHandler(fileList) >>> >>> saxparser = xml.sax.make_parser() >>> saxparser.setContentHandler(indexHandler) >>> # don't check DTD as this raises an exception ... saxparser.setFeature(xml.sax.handler.feature_external_ges, False) >>> saxparser.parse(xmlFile) >>> >>> print len(fileList) 1000
Here I loaded the index of the "Free audio base of Chinese words". I used "swac_pron_phon" as a key, which is not destinct for the Chinese collection, it can be changed to "swac_text" alternatively.
It is important to not load the given DTD as I received an error during the parsing:
Traceback (most recent call last): File "", line 2, in File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse xmlreader.IncrementalParser.parse(self, source) File "/usr/lib/python2.5/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse self.feed(buffer) File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed self._err_handler.fatalError(exc) File "/usr/lib/python2.5/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError raise exception xml.sax._exceptions.SAXParseException: http://shtooka.net/project/swac/index.dtd:1:0: error in processing external entity reference
Random Shots of Hong Kong
Submitted by Christoph on 2 November, 2008 - 10:52I will put up random pictures of Hong Kong to Flickr just to give you an impression of the city.
