New release of cjklib with dictionary support
We would like to announce the new 0.3 release of cjklib, a Python-based programming library providing higher-level support of Chinese characters, also called Han characters.
After adding lots of new features and fixing some bugs we believe that what is now version 0.3 is fit for public consumption. To name a few additions and changes:
- easy and powerful dictionary access to EDICT and CEDICT style dictionaries,
- support for MS Windows including exe installer,
- new documentation under cjklib.org ,
- a new wiki for contributing language data,
- pronunciation data for Shanghainese, and finally
- broad coverage of stroke order data and corrections to component data.
See  for an extensive list of changes.
Several people contributed to the newest release:
- Uriah Eisenstein
- Hugo Lopez
- Hans-Jörg Happel
- Kellen Parker
- Allan Simon
- Christoph Burgmer
New dictionary support
An important resource for CJK languages is provided by bilingual dictionaries. With the newest release cjklib can now also offer easy access to EDICT, CEDICT and other compatible dictionaries. Exact searches in translations and wildcard searches in all columns provide a flexible and powerful way to query those sources. Mixing of characters and readings in a single request are also supported.
A new project called CharacterDB  was started to gather data on characters for cjklib. It is based on MediaWiki, known from Wikipedia, and will offer easy access to the data employed in cjklib. While still in beta stage we plan to manage all glyph information through the wiki by the next release.
Cjklib tries to fill a current void in supporting Chinese characters by focusing on visual appearance and reading-based data. While many lexical sources already exists, there is no layer which provides the data in an accessible and consistent way, burdening the developer with reinventing many basic functions. This project wants to channel different efforts in order to provide the developer with a consistent view, independent of the chosen language. This library directly targets developers and experienced users, its overall goal being to improve the coverage of applications for the end user.
Cjklib is open source, released under the GNU LGPL. You are free to use this software and invited to take part in its further development.
If you wish to know more about cjklib then its website  is a good starting point. To have a quick overview on some of the functions offered you might want to look at .
Packages are readily available. See  on how to install.
The cjklib developers
My thesis is finally printed and now also available as PDF. It is called Detecting Code-Switch Events Based on Textual Features and focuses on the detection of Code-Switching between English and Mandarin Chinese. I already found a typo, but I'd keep it with the Latin proverb, errare humanum est :)
I've already done a table on Pinyin for Mandarin,and Jyutping and Cantonese Yale for Cantonese. Here's now a table for Shanghainese. As there seems to be no accepted or standard romanisation for the Shanghai Wu dialect, IPA seems to be system at hand.
This table is based on the Wu Phonetics Corpus by Kellen Parker with contributions from Allan Simon.
There are surely some errors yet to be fixed. I'm not too happy with the ordering of initials and finals, but I can't work out a good order based on articulation characteristics.
Update:I removed some "duplicate" forms and regrouped initials and finals.
This blog entry provides a nice mix of transliterations, C++, cjklib, ICU and language bindings in Python.
ICU (International Components for Unicode) is an Unicode support Open Source library from IBM. I would call it the Unicode implementation. It is used by IBM, Google, Apple and others. SQLite which only offers narrow script support points to ICU for full Unicode support, the Python team decided to only implement basic Unicode support and leave locale based handling to ICU and PHP's intl library builds on top of ICU. ICU offers many functions aside from what one would expect from a library for an "encoding". Unicode though is far from only being an encoding, offering multilingual solutions together with the encoding of the world's scripts.
ICU is written in C/C++ and Java. A Python binding of the C++ implementation is offered via PyICU.
So, I already covered three of the five words above, let's see what's missing.
One feature I set my eyes upon recently is the Transliterations module of ICU. While the name is a bit of a misnomer, as it supports a wide range of text transformations, its initial task was to translate one script into another. As this is one of cjklib's strong points, I took a deeper look at it.
The Transliteration module implements an interface to a wide range of transformations and also offers on-the-fly transformations for keyboard input. I was twittering (actually i was denting) about a similar Google project recently and I'd be surprised if Google doesn't build on the support offered by ICU here.
You can try out a nice demo. For example use my backwards transformation called "Back". It's modeled after a script I wrote some time ago. Sadly it seems there's no transformation to reverse a string (which wouldn't make much sense for interactive mode as a processed string cannot be touched again). Just enter "Back" into the small text area above "Output 1" and enter lets say "no devil lived on" into the text area below "Input". When you click on "Transform" you should see the characters flipped: uo pǝʌıl lıʌǝp ou.
The beauty of the Transliterator interface is that combinations of transforms are made very easy. Transforms can be limited to certain scripts and the invers form can be automatically provided. For example
NFD; [:Nonspacing Mark:] Remove; NFC "extracts" diacritical marks, applies the
Remove transform to them and builds the remaining marks back onto the characters:
Bēijǐng transforms to
You can even find transliterations for Japanese, Korean and Chinese (Pinyin). The latter produces standard Pinyin from tones given by numbers (set aside wrong transformation for infrequent forms ng and hng). This is clearly a service that intersects with cjklib. But as ICU provides a standard interface for transliterations, why not register conversions implemented by cjklib to ICU?
No sooner said than done. PyICU just recently got support for the Transliterations interface, and as instantiation of the class in Python was not yet supported I got out my C++ foo (didn't really know I had any) and got the C++ layer to call a Python implementation. My changes (branch PythonTransliterator on github) already went into the main version halfway, making it available in standard installations.
There's a short example that registers any cjklib conversion with ICU and makes it available through the standard methods. I don't know if there is a reasonable combination of existing transforms from ICU and transliterations currently offered by cjklib. That could be the icing on the cake.