A technical post on transliterations

This blog entry provides a nice mix of transliterations, C++, cjklib, ICU and language bindings in Python.

ICU (International Components for Unicode) is an Unicode support Open Source library from IBM. I would call it the Unicode implementation. It is used by IBM, Google, Apple and others. SQLite which only offers narrow script support points to ICU for full Unicode support, the Python team decided to only implement basic Unicode support and leave locale based handling to ICU and PHP's intl library builds on top of ICU. ICU offers many functions aside from what one would expect from a library for an "encoding". Unicode though is far from only being an encoding, offering multilingual solutions together with the encoding of the world's scripts.

ICU is written in C/C++ and Java. A Python binding of the C++ implementation is offered via PyICU.

So, I already covered three of the five words above, let's see what's missing.

One feature I set my eyes upon recently is the Transliterations module of ICU. While the name is a bit of a misnomer, as it supports a wide range of text transformations, its initial task was to translate one script into another. As this is one of cjklib's strong points, I took a deeper look at it.

The Transliteration module implements an interface to a wide range of transformations and also offers on-the-fly transformations for keyboard input. I was twittering (actually i was denting) about a similar Google project recently and I'd be surprised if Google doesn't build on the support offered by ICU here.

You can try out a nice demo. For example use my backwards transformation called "Back". It's modeled after a script I wrote some time ago. Sadly it seems there's no transformation to reverse a string (which wouldn't make much sense for interactive mode as a processed string cannot be touched again). Just enter "Back" into the small text area above "Output 1" and enter lets say "no devil lived on" into the text area below "Input". When you click on "Transform" you should see the characters flipped: uo pǝʌıl lıʌǝp ou.

The beauty of the Transliterator interface is that combinations of transforms are made very easy. Transforms can be limited to certain scripts and the invers form can be automatically provided. For example NFD; [:Nonspacing Mark:] Remove; NFC "extracts" diacritical marks, applies the Remove transform to them and builds the remaining marks back onto the characters: Bēijǐng transforms to Beijing.

You can even find transliterations for Japanese, Korean and Chinese (Pinyin). The latter produces standard Pinyin from tones given by numbers (set aside wrong transformation for infrequent forms ng and hng). This is clearly a service that intersects with cjklib. But as ICU provides a standard interface for transliterations, why not register conversions implemented by cjklib to ICU?

No sooner said than done. PyICU just recently got support for the Transliterations interface, and as instantiation of the class in Python was not yet supported I got out my C++ foo (didn't really know I had any) and got the C++ layer to call a Python implementation. My changes (branch PythonTransliterator on github) already went into the main version halfway, making it available in standard installations.

There's a short example that registers any cjklib conversion with ICU and makes it available through the standard methods. I don't know if there is a reasonable combination of existing transforms from ICU and transliterations currently offered by cjklib. That could be the icing on the cake.


I am looking eclectus and if i find time I might translate it in french. My wife is chinese and my son will learn it (he is 3 and half) chinese is too hard for me but i used to speak german : I worked 2 years in germany : in Landau and in Lindau (lol) but long time ago ! (1967-1968) as chef konditor.
Where can I send you the fr.po file ?
I only use email as correspondance so if you can send me one of yours
thanks for your good job :)

Eclectus translations

Hi Aishen, I'm very happy you consider translating Eclectus into French. The preferred way would be to use the Launchpad site for translations (https://translations.launchpad.net/eclectus) But if you prefer to send in a .po file, that should be fine, too. Use the mailinglist as email for now: eclectus-dict@googlegroups.com. Hope your son can also benefit from Eclectus!