A technical post on transliterations

This blog entry provides a nice mix of transliterations, C++, cjklib, ICU and language bindings in Python.

ICU (International Components for Unicode) is an Unicode support Open Source library from IBM. I would call it the Unicode implementation. It is used by IBM, Google, Apple and others. SQLite which only offers narrow script support points to ICU for full Unicode support, the Python team decided to only implement basic Unicode support and leave locale based handling to ICU and PHP's intl library builds on top of ICU. ICU offers many functions aside from what one would expect from a library for an "encoding". Unicode though is far from only being an encoding, offering multilingual solutions together with the encoding of the world's scripts.

ICU is written in C/C++ and Java. A Python binding of the C++ implementation is offered via PyICU.

So, I already covered three of the five words above, let's see what's missing.

One feature I set my eyes upon recently is the Transliterations module of ICU. While the name is a bit of a misnomer, as it supports a wide range of text transformations, its initial task was to translate one script into another. As this is one of cjklib's strong points, I took a deeper look at it.

The Transliteration module implements an interface to a wide range of transformations and also offers on-the-fly transformations for keyboard input. I was twittering (actually i was denting) about a similar Google project recently and I'd be surprised if Google doesn't build on the support offered by ICU here.

You can try out a nice demo. For example use my backwards transformation called "Back". It's modeled after a script I wrote some time ago. Sadly it seems there's no transformation to reverse a string (which wouldn't make much sense for interactive mode as a processed string cannot be touched again). Just enter "Back" into the small text area above "Output 1" and enter lets say "no devil lived on" into the text area below "Input". When you click on "Transform" you should see the characters flipped: uo pǝʌıl lıʌǝp ou.

The beauty of the Transliterator interface is that combinations of transforms are made very easy. Transforms can be limited to certain scripts and the invers form can be automatically provided. For example NFD; [:Nonspacing Mark:] Remove; NFC "extracts" diacritical marks, applies the Remove transform to them and builds the remaining marks back onto the characters: Bēijǐng transforms to Beijing.

You can even find transliterations for Japanese, Korean and Chinese (Pinyin). The latter produces standard Pinyin from tones given by numbers (set aside wrong transformation for infrequent forms ng and hng). This is clearly a service that intersects with cjklib. But as ICU provides a standard interface for transliterations, why not register conversions implemented by cjklib to ICU?

No sooner said than done. PyICU just recently got support for the Transliterations interface, and as instantiation of the class in Python was not yet supported I got out my C++ foo (didn't really know I had any) and got the C++ layer to call a Python implementation. My changes (branch PythonTransliterator on github) already went into the main version halfway, making it available in standard installations.

There's a short example that registers any cjklib conversion with ICU and makes it available through the standard methods. I don't know if there is a reasonable combination of existing transforms from ICU and transliterations currently offered by cjklib. That could be the icing on the cake.

Eclectus screenshot (made with screenie)

Eclectus screenshot (made with screenie)

New screenshot for the google code page for Eclectus. Made using screenie.

Collaborative Work and Openness

I hope you forgive me for letting this blog start with a rant into the new year. But this topic actually has bothered me for some time now, so I'll hope you will bear with me.

Wikis are well known today, even though most only know it from Wikipedia the biggest wiki there is. Most will know about the fact that Wikipedia is community driven and a "collaboratively edited encyclopedia to which you can contribute". Most will agree that this concept is, or at least was, radical at this time. But most will also agree that it (somehow) works.

Wikipedia is not the only online collaborative project, in fact many other projects work that way. And to those I would add the Japanese and Chinese dictionaries EDICT, CEDICT, HanDeDict, to name few. There are though various degrees of how the collaborative concept is employed or enforced. And one particular implementation has me up in arms, the Chinese-German dictionary HanDeDict.

The HanDeDict team actually did a very good job in the past creating a dictionary under a Creative Commons license out of nothing. The license was well chosen; too many other projects actually try to come up with their own wording, which leads to nothing. A bootstrapping process made sure you would actually find good words from the beginning on. Their concept of having the online folks help out was pretty future-oriented considering how many people in some parts of academia view this thing "Internet".

I am not sure where the project is now, today, though. The early discussion board was moved to Google Groups out of SPAM reasons. The group is moderated and posts seem to only seldom go trough, it is practically dead. Large deletions of entries that where copyright infringements added by a single user leave many basic entries missing. Communication to the outside is basically nonexistent, and criticism, from my point of view, only marginally considered.

In particular I remember adding some special entries, as their reading have very peculiar forms. I am not clear today if it was ê/ei or n/ng, but I remember adding a bunch of entries for one of this form, if not all. Today a short SQL query comes up with these sad remains:

sq lite> select * from HanDeDict where Reading like 'ê%' limit 30 offset 0;
sq lite> select * from HanDeDict where Reading like 'ei_' limit 30 offset 0;
誒|诶|ei1|/He! Hey! (u.E.) (Int)/|
誒|诶|ei1|/Hey! Hi! He! (u.E.) (Int)/|
sq lite> select * from HanDeDict where Reading like 'n_' limit 30 offset 0;
sq lite> select * from HanDeDict where Reading like 'ng_' limit 30 offset 0;
嗯|嗯|ng2|/Interjektion: erstaunt fragend - Hä? (u.E.) (Int)/|1
哼|哼|ng5|/(drückt Unzufriedenheit oder Zweifel aus) (u.E.) (Int)/|2

For me it is sad to see my contribution lost, no way I can get it back easily. I took the time to consult two dictionaries to fill in the missing information. And I am particular sad as I am missing the tools to research this removal of content: my proposition in the past to use a wiki-like editing tool was turned down with the words "the Wikipedia principle does not work for a dictionary". Well, Wikipedia would allow me though to find out why my contribution was lost.

The fact that contribution to HanDeDict is not open makes it impossible for outsiders to judge and control content, coordinate their own work, or develop a responsibility for the project's content.

I already considered forking this project by moving stuff to a MediaWiki installation, but I have to admit that this task moved pretty far behind other more urgent things on my list. I can currently only hope that either the HanDeDict people come back to revive the project or somebody else is willing to fork the project. If you decide to, I'll be happy to help.

Graduation

So it is official. I graduated and can now call myself a "Dipl. Inform.", the German equivalence to a Master in Computer Science (and the minimum grade admitting you to a Ph.D. in CS). While my University (that is Karlsruhe) still keeps me busy with slight corrections of my final thesis, I at least can slowly set my mind on new things. I will be on the look out for a job, but who knows, I might even consider continuing with a Ph.D. I just hope people don't be mixed up by my Uni's recent name change to KIT, Kaustubh Institute of Training Karlsruhe Institute of Technology.

Oh, and once my Thesis is final, I'll post an update on my publications page.

Update: Corrected my "title", three additional characters got lost on the way.

Beta out for Eclectus

So a release is only out once the developer blogs about it, they say.

This afternoon I tagged and packaged what is now 0.2beta of Eclectus. I also created a page on the KDE application site as I believe it is now ready for wider consumption. Packages for Linux distributions should now make it easy for ppl to install it.

While I have many ideas for Eclectus, even some I haven't seen in any electronic dictionary so far, I am currently left with only few time. I'd also like to cater to Windows and Mac users in the future, but porting Eclectus to a Qt-only version will surely need a week of work, something I currently cannot affort.

Goals in the near future will be stabilizing the application and creating a nice dictionary abstraction layer to offer better integration for the different dictionaries around.

I'm so far maybe my happiest user. Eclectus helps me in my daily learning routine, and I never look back to the tools I used (or tried to use) before. I hope others have similar experiences.

Syndicate content