Christoph's CJK-centered concerns

Collaborative Work and Openness

Submitted by Christoph on 11 January, 2010 - 22:22

I hope you forgive me for letting this blog start with a rant into the new year. But this topic actually has bothered me for some time now, so I'll hope you will bear with me.

Wikis are well known today, even though most only know it from Wikipedia the biggest wiki there is. Most will know about the fact that Wikipedia is community driven and a "collaboratively edited encyclopedia to which you can contribute". Most will agree that this concept is, or at least was, radical at this time. But most will also agree that it (somehow) works.

Wikipedia is not the only online collaborative project, in fact many other projects work that way. And to those I would add the Japanese and Chinese dictionaries EDICT, CEDICT, HanDeDict, to name few. There are though various degrees of how the collaborative concept is employed or enforced. And one particular implementation has me up in arms, the Chinese-German dictionary HanDeDict.

The HanDeDict team actually did a very good job in the past creating a dictionary under a Creative Commons license out of nothing. The license was well chosen; too many other projects actually try to come up with their own wording, which leads to nothing. A bootstrapping process made sure you would actually find good words from the beginning on. Their concept of having the online folks help out was pretty future-oriented considering how many people in some parts of academia view this thing "Internet".

I am not sure where the project is now, today, though. The early discussion board was moved to Google Groups out of SPAM reasons. The group is moderated and posts seem to only seldom go trough, it is practically dead. Large deletions of entries that where copyright infringements added by a single user leave many basic entries missing. Communication to the outside is basically nonexistent, and criticism, from my point of view, only marginally considered.

In particular I remember adding some special entries, as their reading have very peculiar forms. I am not clear today if it was ê/ei or n/ng, but I remember adding a bunch of entries for one of this form, if not all. Today a short SQL query comes up with these sad remains:

sq lite> select * from HanDeDict where Reading like 'ê%' limit 30 offset 0;
sq lite> select * from HanDeDict where Reading like 'ei_' limit 30 offset 0;
誒|诶|ei1|/He! Hey! (u.E.) (Int)/|
誒|诶|ei1|/Hey! Hi! He! (u.E.) (Int)/|
sq lite> select * from HanDeDict where Reading like 'n_' limit 30 offset 0;
sq lite> select * from HanDeDict where Reading like 'ng_' limit 30 offset 0;
嗯|嗯|ng2|/Interjektion: erstaunt fragend - Hä? (u.E.) (Int)/|1
哼|哼|ng5|/(drückt Unzufriedenheit oder Zweifel aus) (u.E.) (Int)/|2

For me it is sad to see my contribution lost, no way I can get it back easily. I took the time to consult two dictionaries to fill in the missing information. And I am particular sad as I am missing the tools to research this removal of content: my proposition in the past to use a wiki-like editing tool was turned down with the words "the Wikipedia principle does not work for a dictionary". Well, Wikipedia would allow me though to find out why my contribution was lost.

The fact that contribution to HanDeDict is not open makes it impossible for outsiders to judge and control content, coordinate their own work, or develop a responsibility for the project's content.

I already considered forking this project by moving stuff to a MediaWiki installation, but I have to admit that this task moved pretty far behind other more urgent things on my list. I can currently only hope that either the HanDeDict people come back to revive the project or somebody else is willing to fork the project. If you decide to, I'll be happy to help.

Christoph's blog

Graduation

Submitted by Christoph on 3 December, 2009 - 23:38

So it is official. I graduated and can now call myself a "Dipl. Inform.", the German equivalence to a Master in Computer Science (and the minimum grade admitting you to a Ph.D. in CS). While my University (that is Karlsruhe) still keeps me busy with slight corrections of my final thesis, I at least can slowly set my mind on new things. I will be on the look out for a job, but who knows, I might even consider continuing with a Ph.D. I just hope people don't be mixed up by my Uni's recent name change to KIT, ~~Kaustubh Institute of Training~~ Karlsruhe Institute of Technology.

Oh, and once my Thesis is final, I'll post an update on my publications page.

Update: Corrected my "title", three additional characters got lost on the way.

Christoph's blog

Beta out for Eclectus

Submitted by Christoph on 1 December, 2009 - 22:33

Eclectus

So a release is only out once the developer blogs about it, they say.

This afternoon I tagged and packaged what is now 0.2beta of Eclectus. I also created a page on the KDE application site as I believe it is now ready for wider consumption. Packages for Linux distributions should now make it easy for ppl to install it.

While I have many ideas for Eclectus, even some I haven't seen in any electronic dictionary so far, I am currently left with only few time. I'd also like to cater to Windows and Mac users in the future, but porting Eclectus to a Qt-only version will surely need a week of work, something I currently cannot affort.

Goals in the near future will be stabilizing the application and creating a nice dictionary abstraction layer to offer better integration for the different dictionaries around.

I'm so far maybe my happiest user. Eclectus helps me in my daily learning routine, and I never look back to the tools I used (or tried to use) before. I hope others have similar experiences.

Christoph's blog

Improving recognition of handwriting with component samples

Submitted by Christoph on 20 November, 2009 - 01:58

Yet another post on handwriting recognition of Japanese and Chinese characters with Tegaki. This time I want to improve recognition rates of existing models.

Some weeks ago I proposed using component data for bootstrapping a completly new model. Characters are broken down into their components, for which handwriting models already exist. The character's model is then built from those of the components.

This time I want to improve an existing model, but use the same approach here. Many characters can be found multiple times in the existing models, most as components of another character. These occurences provide unique instances of handwriting that can be used to increase the number of training samples. More samples will add greater variation in the hope of improving recognition accuracy.

First of all we need to extract handwriting data "hidden" in components. Lets use the Japanese model here:

$ python hwr/tegaki-tools/src/tegaki-extractcomponents \
    -t hwr/tegaki-models/data/train/japanese/handwriting-ja.xml -m 5 \
    components.xml

Maximum 5 occurences of each character will be stored into components.xml. You can view those with tegaki-train easily. Some extracted characters will be wrong. In this case either handwriting samples were incorrect, or cjklib has false information. I didn't correct or delete any of them, leaving that for later.

While we now have 5 instances for basic characters, complex characters with components still need data. We can do this using the bootstrapping process. We will add full character versions to the component set, so that we try not to break a character down too much. As our building from components isn't perfect we try to minimize unneccessary build steps using a parent component if available. We then bootstrap a character collection for characters from JIS-X-0208 with maximum 5 instances per character. Giving option "-x" will make sure only component transformations are used so we by-pass using exact matches.

$ python hwr/tegaki-tools/src/tegaki-convert -m 5 \
    -t hwr/tegaki-models/data/train/japanese/handwriting-ja.xml \
    -c components.xml full_components.xml
$ python hwr/tegaki-tools/src/tegaki-bootstrap -x -l J -m 5 --domain=JISX0208 \
    -c full_components.xml handwriting_complex.xml

Results show that 81% of characters can be built using components. The others are either basic characters, cannot be composed due to their stucture, or just plainly lack component data. On avarage 36 instances could be provided (using cross product), of which we only use 5.

We will add those handwriting models built in this way to the existing Japanese model, extending the instance count. Finally, we train the model:

$ python hwr/tegaki-tools/src/tegaki-convert -m 5 \
    -c full_components.xml -c handwriting_complex.xml \
    handwriting_enhanced-ja.xml
$ python hwr/tegaki-tools/src/tegaki-build -c handwriting_enhanced-ja.xml \
    zinnia handwriting_enhanced-ja.meta

We now have a new "enhanced" Japanese model that we want to evaluate. I decided to use the KanjiVG data, which does not share a common source with the Tegaki data. A character collection can be built using Roger Braun's KVG-Tools. An integrated support with Tegaki is currently being worked on. To get meaningfull results we should limit the testing set to the same character domain, which can be done (ab-)using tegaki-bootstrap:

$ python hwr/tegaki-tools/src/tegaki-bootstrap -l J --domain=JISX0208 \
    -c kanjivg.xml kanjivg_jis.xml

First we run the evaluation on the old model:

$ python hwr/tegaki-tools/src/tegaki-eval zinnia Japanese \
    -c kanjivg_jis.xml
Overall results
        Recognizer: zinnia
        Number of characters evaluated: 6377

        Total time: 118.09 sec
        Average time per character: 0.02 sec
        Recognition speed: 53.82 char/sec

        match1
                Accuracy/Recall: 86.28
                Precision: 80.92
                F1 score: 83.51

        match5
                Accuracy/Recall: 93.84

        match10
                Accuracy/Recall: 94.92

We have a F1 score of 83.51 and a Recall of 86.28. This goes up when considering the first 5/first 10 characters in the result set.

Now the moment of truth:

$ python hwr/tegaki-tools/src/tegaki-eval zinnia JapaneseEnhanced \
    -c kanjivg_jis.xml
Overall results
        Recognizer: zinnia
        Number of characters evaluated: 6356

        Total time: 114.82 sec
        Average time per character: 0.02 sec
        Recognition speed: 55.36 char/sec

        match1
                Accuracy/Recall: 91.96
                Precision: 88.96
                F1 score: 90.44

        match5
                Accuracy/Recall: 96.27

        match10
                Accuracy/Recall: 96.93

With delight we see that F1 score goes up to 90.44 and Recall to 91.96. And even match5 and match10 go up, moving towards 100%. It seems that using handwriting data only from inside the model can help increase recognition rates. I think this eases the pressure to get more data by manually drawing characters. The sources can be found in my branch at Github.

Christoph's blog

The Revolution of Electronic Content

Submitted by Christoph on 6 November, 2009 - 03:21

Miscellaneous

I was looking for an academic paper from 2008 that was referenced somewhere. The title sounded interesting, though the abstract made me doubt if the work would really be interesting for me. Anyway, I checked where the paper was published and found out the name and reference.

I then logged on to my university library's website, to check if the work was available from the "digitial library". It seemed no license was owned, and no direct access was possible. I then filled out a form that was provided, and opted in to actually pay money to have a copy sent.

About a week later an email in my mailbox sadly stated that the university had no access to this journal, but the state's library in the same town would have a license allowing me to get hold of the paper.

Walking into the library I realised they had changed their member regulations. Right now an annual fee of 30 Euros needs to be payed, for getting the library pass renewed. Not wanting to pay that much I asked a friend to actually help me download the paper. After some hassle we finally got through to the publisher's website that then showed some administrative status text saying something about a config.txt that needed to be set up.

Inquiring at the front desk confirmed my fears. There's no quick solution to my problem. I then was told that if they have a license, my library should have one, too, so I set off and asked there again. Now in person. They actually have a institute that only recently was integrated into the university which has a license. I was told I could take the bus and drive out of down to enquire there. Not wanting to invest another two hours on this quest, and assuming that they would have the same technical access issues I gave up. I asked the lady though if they can't just order a copy and send it to me. She said, while it was technically possible, license restrictions would render this impossible.

Now, I'm a frequent user of interlending as my university's library has only few books on CJK, Chinese... I go online, search through the inter-library catalogue, and order the book, which, a week later then, shows up at my local library. Actually renewing is the only thing what is different to ordinary local books and that really bugs me sometimes. But then, they can just send me the exemplar. Sadly that does not apply to electronic copies, that do not suffer the same physical restrictions of ordinary books. But thanks, I'll stick to books if possible.

Christoph's blog

Navigation

tags in site content

Archive

Blogs I read

Collaborative Work and Openness

Graduation

Beta out for Eclectus

Improving recognition of handwriting with component samples

The Revolution of Electronic Content