Moving from Tomoe to Tegaki

[img_assist|nid=199|title=Eclectus screenshot showing handwriting box|desc=|link=node|align=right|width=55|height=100]This morning I told Eclectus how to use Tegaki, the successor of Tomoe in handwriting recognition of Chinese characters (including Kanji). Motivated by Tegaki moving into Debian in the past few days, and Tomoe being only available in openSUSE but not in Debian or Ubuntu, I changed Eclectus' handwriting widget to support either of them, prefering now Tegaki. The diff is relatively small (1), hopefully we can totally drop Tomoe in the future making the code cleaner.

For Eclectus these changes will bring better recognition results and hopefully more features of the actively developed project. You can read more about Tegaki on the developer's blog under http://www.mblondel.org/journal/category/handwriting-recognition/.

My original widget will be maintained in the Eclectus project under (2). We'll see what more will change. Now, still missing is support for traditional Chinese, anyone?

Announcing Eclectus, a Han character dictionary

[img_assist|nid=194|title=Full Dictionary View|desc=|link=none|align=right|width=73|height=100]
May 27th, 2009

I would like to announce Eclectus, a Han character dictionary especially
suited for learners [1] (see screencast [2]).

Motivation

The lack of a good and user-friendly learner's dictionary for Chinese
motivated the work in the past months that went into Eclectus. With more than
40,000 mostly complexly shaped Chinese characters the Chinese script is rather
difficult to learn. This dictionary tries to acknowledge that and offers a wide
set of features including quick ways to find characters you don't know how to
input.

Implementation

Eclectus heavily builds on cjklib, a Python-based library for handling Chinese
characters I only announced last week [3]. While the library matures, Eclectus
will at the same time grow in functions and data.

While currently mostly only Chinese features are being implemented, the goal
is to provide the same level of features for Japanese and for other languages
relying on Chinese characters.

The dictionary currently relies on KDE-bindings, which for now limits its use
to Linux. One future goal is to provide the same functionalities under a pure
Qt foundation, and then making Eclectus available on Windows and Mac OS X.

Try it out

You are welcome to give Eclectus a try and join as a user or even a developer
to make this dictionary the best out there. Checkout svn or use the snapshot
packages available for download. Please note the list of current shortcomings
[4], which you may take as a starting point to get involved yourself :)

Now the mandatory warning for early adaptors: Eclectus is still in an early
development stage, expect errors and sparse data.

Tell me what you think

[1] http://code.google.com/p/eclectus/
[2] http://www.youtube.com/watch?v=pwDeUSkQugU
[3] http://www.stud.uni-karlsruhe.de/~uyhc/en/content/announcing-cjklib
[4] http://code.google.com/p/eclectus/issues/list

[img_assist|nid=193|title=Handwriting Recognition for Japanese|desc=|link=none|align=left|width=100|height=71][img_assist|nid=195|title=Radical Table|desc=|link=none|align=left|width=100|height=71][img_assist|nid=196|title=Multi-radical search|desc=|link=none|align=left|width=100|height=71]

Announcing cjklib

Announcing cjklib, a library for higher-level support of Chinese characters.

May 19th, 2009

(Hong Kong) We would like to announce the availability of cjklib, a new Python-based programming library providing higher-level support of Chinese characters, also called Han characters.

Chinese characters, in comparison to other scripts, have several distinctive features: more than 40,000 characters exist, they have a complex visual appearance, they to some extent contain meaning in their structure (ideographic characters), and they almost completely lack enunciative information. Chinese characters are employed in writing the Chinese, Japanese, Korean, and formerly the Vietnamese language, denoted in short by CJK or CJKV.

Cjklib tries to fill a current void in supporting Chinese characters by focusing on visual appearance and reading-based data. While many lexical sources already exists, there is no layer which provides the data in an accessible and consistent way, burdening the developer with reinventing many basic functions. This project wants to channel different efforts in order to provide the developer with a consistent view independent of the chosen language. This library directly targets developers and experienced users, its overall goal being to improve the coverage of applications for the end user.

Some features of cjklib:

  • Glyph-based functions
    • Radical index, residual stroke count
    • 'Breaking down' a character into a tree of its components
    • Stroke order
    • Locale based glyph layout
  • Reading-based function
    • Character to reading mapping
    • Conversion between readings (Mandarin Chinese: Pinyin, Gwoyeu Romatzyh, Wade-Giles, IPA; Cantonese: Jyutping, Cantonese Yale; Japanese: Kana; Korean: Hangul)
    • Translation between realizations of a reading, e.g. numbers to diacritics
  • Database back-end with powerful build system providing access for Unihan, Kanjidict, EDICT, CEDICT, HanDeDict
  • Command line tool to access the library's functions

The project was released recently and is still under heavy development. Although API changes might occur in the near future, the library is usable and already being employed in other software. Cjklib is released under the LGPL.

If you wish to know more about cjklib then its website [1] is a good starting point. Much documentation already exists and more is being added. To have a quick overview of some functions offered you might want to look at [2].
Download here [3].

The cjklib developers
cjklib-devel@googlegroups.com

[1] http://code.google.com/p/cjklib
[2] http://code.google.com/p/cjklib/wiki/Screenshots
[3] http://code.google.com/p/cjklib/downloads/list

Simple image segmenter in Python

口-bw.1.png, 口-bw.2.png, 口-bw.3.png
So I was looking for a simple segmenter to break down images containing several tiles into single pieces. I decided to write one myself, so here it is.

segmenttiles.py comes with a help page (python segmenttiles.py --help) which explains the parameters in short. Most important segmentation can be done either by using a window or by looking for whitespaces in the image. Giving the width/height ratio or more specifically the tilesize makes guessing more accurate by discarding solutions that don't fit the given sizes. Furthermore by selecting equaltiles the segmenter will try to find a segmenting solution that results in having exact same size tiles. This can even improve segmentation results.

So, as an example you can download stroke order images from Wikimedia Commons (for this task Category:Bw.png_stroke_order_images is suited) and start the segmentation on more than 1000 files:

$ python segmenttiles.py --segmentation=whitespace --equaltiles --tilesize=120x110
--test bw.png/?-bw.png > bw.png.segmentations

These images have perfect white tile borders and most of them can be separated into equally sized tiles. The average tile size is 120 (width) x 110 (height). By specifying --test no actual work on the files is done, but the proposed segmentation positions are printed to stdout.

bw.png/干-bw.png        [-5, 116, 237, 358, 479]        [0, 115]
bw.png/平-bw.png        [-6, 115, 236, 357, 478, 599, 720]      [0, 126]
bw.png/对-bw.png        [-2, 119, 240, 361, 482, 603, 724, 845] [0, 117]
bw.png/容-bw.png        [-5, 116, 237, 358, 479, 600, 721, 842, 963, 1084] [-2, 120, 242]
bw.png/她-bw.png        [-1, 120, 241, 362, 483, 604, 725, 846] [0, 110]
bw.png/凹-bw.png        [14, 110, 206, 302, 398, 494, 590]      [0, 86]
[...]

Finally using this list the actual breaking down can be started:

$ python segmenttiles.py --readfrom=bw.png.segmentations -v bw.png/?-bw.png
Read 995 entries, ignored 5 lines
Processing bw.png/对-bw.png...
Grid 7x1, (-2,119,240,361,482,603,724,845), (0,117)
Writing tiles... 0 1 2 3 4 5 6 finished
Processing bw.png/容-bw.png...
Grid 9x2, (-5,116,237,358,479,600,721,842,963,1084), (-2,120,242)
Writing tiles... 0 1 2 3 4 5 6 7 8 9 10 empty empty empty empty empty empty empty finished
[...]

Overall 5 files can not be segmented. A bit more do not result in equal sizes and few of them might have a wrong segmentation. Playing around with the parameters helps.

AttachmentSize
segmenttiles.py.txt32.54 KB

Tomoe handwriting widget for PyQt

Having built Tomoe for Debian I was ready to go to develop a nice widget implementing the basic features of Tomoe giving you the TomoeHandwritingWidget.

The widget allows for drawing with the mouse and shows a typical character grid to assist the input. Included is a demo application that is fully usable do to handwriting recognition for Chinese characters. The widget is released under the LGPL (license of Tomoe and future license of Qt), remember though that PyQt is currently GPL only.

AttachmentSize
tomoewidget.py.txt8.81 KB
Syndicate content