Christoph's CJK-centered concerns

Tomoe for Debian

Submitted by Christoph on 11 February, 2009 - 04:47

Tomoe is a handwriting recognition engine for Japanese Kanji and Chinese Hanzi. It is written in C for the Gtk library and includes Ruby and Python bindings.

Mayor Linux distributions currently seem not to ship any packages so building your own packages is the only way to do it. Debian Etch Installation & Configuration is a nice article about how to install Tomoe on Debian. I changed the last step though to use checkinstall which gives you a Debian package allowing for easy deinstalling. I included Python bindings (no Ruby though), excluded the Unihan database and configured Tomoe to create the documentation in HTML though the install process does not include them currently.

./autogen.sh
./configure --enable-gtk-doc --disable-unihan
checkinstall -D make install

I'll upload the .deb package here but first of all it doesn't state the dependencies and secondly you are advised to build the package yourself using the steps mentioned in the link and above.

Update: You can now use Tegaki, the successor of Tomoe, which was introduced into Debian recently.

Attachment	Size
tomoe_0.6.0.svn20090210-1_i386.deb	1.93 MB

Christoph's blog

Batch-Downloading from Wikimedia servers (2)

Submitted by Christoph on 17 December, 2008 - 08:15

Some time ago I wrote how to download a category of files from Wikipedia. As the API got updates and my program now can download more than the maximum page size of 500 entries, I'll repost the script:

#!/usr/bin/python
# -*- coding: utf8 -*-
#
# Christoph Burgmer, 2008
# Released unter the MIT License.
#

import urllib
import sys
import re
import os

prependURL = "http://commons.wikimedia.org/w/api.php" \
    + "?action=query&prop=imageinfo&iiprop=url&format=xml&titles="
maxFiles = 500

class AppURLopener(urllib.FancyURLopener):
    version="Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)"

urllib._urlopener = AppURLopener()

cat = urllib.quote(sys.argv[1].replace('Category:', ''))
baseUrl = "http://commons.wikimedia.org/w/api.php" \
    + "?action=query&list=categorymembers&cmtitle=Category:" \
    + cat + "&cmnamespace=6&format=xml&cmlimit=" + str(maxFiles)

print "getting cat", cat, "(maximum "+ str(maxFiles) + ")"

continueRegex = re.compile('<query-continue>' \
    + '<categorymembers cmcontinue="([^\>"]+)" />' + '</query-continue>')

continueParam = None

while True:
    if continueParam:
        url = baseUrl + '&cmcontinue=' + urllib.quote(continueParam)
    else:
        url = baseUrl
    print "retrieving category page url", url
    f = urllib.urlopen(url)
    content = f.read()

    for imageName in re.findall(r'<cm[^>]+title="([^\>"]+)" />', content):
        imageDescriptionUrl = prependURL + imageName
        matchObj = re.search("File:([^/]+)$", imageName)
        if matchObj:
            fileName = matchObj.group(1).strip("\n")
            if os.path.exists(fileName):
                print "skipping", fileName
            else:
                print "getting file description page", imageName
                d = urllib.urlopen(imageDescriptionUrl)
                matchObj = re.search('<ii[^>]*?url="([^\>"]+)[^>]*>', d.read())
                if matchObj:
                    fileUrl = matchObj.group(1)
                    print "getting", fileName, fileUrl
                    urllib.urlretrieve(fileUrl, fileName)

    matchObj = continueRegex.search(content)
    if matchObj:
        continueParam = matchObj.group(1)
    else:
        break

Christoph's blog

Automating Updating Process of ISO 639 Language Tables

Submitted by Christoph on 30 November, 2008 - 19:26

After not manually updating the tables for some months I finally wrote a Makefile and a short Python program helping in updating the ISO 639-1/-2/-3 tables provided by the LoC and SIL. See the attached archive for Makefile, download script and patch files. Check the README file and hit "make" to get started.

As the LoC now started to deprecate ISO 639-2(B) codes, the columns Part2B and Part2T which before served as foreign keys are not directly useable anymore: the ISO_639_3 table continues to contain the old Part2B entries which makes JOINS a hassle. I therefore changed the ISO_639_2 table to include an own "Id" column serving as a key. Now a JOIN can simply be done on this new column, which both tables integrate.

On a side note: recently MySQL must have changed something is now I have to filter for the character set when exporting:
cat iso639codes.sql | grep -v "LOCK TABLES" | grep -v "UNLOCK TABLES" | grep -v "character_set_client" > iso639codes_clean.sql
Furthermore ISO 639-3 tables don't show the deprecation of ISO 639-2(B) codes for hrv and srp, which forces me to change my statistics a bit.

Attachment	Size
makeISO639Tables.tar.gz	4.12 KB

Christoph's blog

Extract file list from SWAC collection index

Submitted by Christoph on 27 November, 2008 - 10:25

The Shtooka Project offers audio material for learning languages with their swac-collections. The data is provided under a free licences and can be downloaded from the site. Data sets come with a xml index, for which I created a simple Python extractor:

>>> import xml.sax
>>> import bz2
>>>
>>> class SwacXMLIndexHandler(xml.sax.ContentHandler):
...     """Extracts a list of pronunciation and file name pairs."""
...     def __init__(self, fileList):
...         self.fileList = fileList
...     def startDocument(self):
...         self.currentFilePath = None
...     def startElement(self, name, attrs):
...         if name == 'file':
...             self.currentFilePath = None
...             for key, value in attrs.items():
...                 if key == 'path':
...                     self.currentFilePath = value
...         elif name == 'tag':
...             if self.currentFilePath:
...                 pronunciation = None
...                 for key, value in attrs.items():
...                     if key == 'swac_pron_phon':
...                         pronunciation = value
...                 if pronunciation:
...                     self.fileList.append(
...                         (pronunciation, self.currentFilePath))
...
>>>
>>> xmlFile = bz2.BZ2File('index.xml.bz2')
>>> fileList = []
>>> indexHandler = SwacXMLIndexHandler(fileList)
>>>
>>> saxparser = xml.sax.make_parser()
>>> saxparser.setContentHandler(indexHandler)
>>> # don't check DTD as this raises an exception
... saxparser.setFeature(xml.sax.handler.feature_external_ges, False)
>>> saxparser.parse(xmlFile)
>>>
>>> print len(fileList)
1000

Here I loaded the index of the "Free audio base of Chinese words". I used "swac_pron_phon" as a key, which is not destinct for the Chinese collection, it can be changed to "swac_text" alternatively.

It is important to not load the given DTD as I received an error during the parsing:

Traceback (most recent call last):
  File "", line 2, in
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 109, in parse
    xmlreader.IncrementalParser.parse(self, source)
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/xmlreader.py", line 123, in parse
    self.feed(buffer)
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/expatreader.py", line 220, in feed
    self._err_handler.fatalError(exc)
  File "/usr/lib/python2.5/site-packages/_xmlplus/sax/handler.py", line 38, in fatalError
    raise exception
xml.sax._exceptions.SAXParseException: http://shtooka.net/project/swac/index.dtd:1:0:
error in processing external entity reference

Christoph's blog

Random Shots of Hong Kong

Submitted by Christoph on 2 November, 2008 - 10:52

Hong Kong

I will put up random pictures of Hong Kong to Flickr just to give you an impression of the city.

Christoph's blog

Navigation

tags in site content

Archive

Blogs I read

Tomoe for Debian

Batch-Downloading from Wikimedia servers (2)

Automating Updating Process of ISO 639 Language Tables

Extract file list from SWAC collection index

Random Shots of Hong Kong