Xiao'erjing in cjklib?

Xiao'erjing is a way of writing Chinese in Arabic script. Basically it is a transcription similar to Pinyin used by people with knowledge of the Arabic script to denote the sounds of Mandarin or another "dialect". It is written from right to left (RTL).

Universal Declaration of Human Rights in Xiao'erjing: Universal Declaration of Human Rights in Xiao'erjing under Public Domain taken from http://commons.wikimedia.org/wiki/Image:Xiaoerjing-Ekzemplafrazo.svgUniversal Declaration of Human Rights in Xiao'erjing: 人人生而自由…

So, why not make a conversion from Pinyin to Xiao'erjing, using cjklib's ReadingConverter paradigm and outdo the individualist named "PinyinBrailleConverter"? Well, it seems somebody already went half the way: converting pinyin to xiaoerjin.

Now, where do we get enough test cases to secure its correctness?


Hey. Great site. That's my converter you're referencing. Thanks for that.

Re accuracy, there are a few problems. First, xiao'erjing was never really standardised. You'll find multiple variations between different texts. Second, there is not a proper way to write -ng in arabic. In the case of Uyghur or Persian, the letter ڭ is used. When transliterating foreign words into Arabic, a combination of ن and غ, i.e. نغ is used. Xiao'erjing however uses the final mark ٍ (two small dashes below the baseline, pronounced "in") for both -in and -ing. thus 零 and 林 would both be written لٍ. In/ing endings are not the only case of this ambiguity imposed by the Arabic script.

Much of the data used for the conversion came from the wikipedia article and then checked against whatever examples I could find elsewhere. I've also taken some liberties when multiple options were available in order to clear up some ambiguity in the results but not so far as including -ng which was never part of it in any version i've ever seen.

Basically, it's accurate because it's a phonetic transcription that has never been standardised, and anyone who knew the Arabic script would be able to know what was being said as well as someone knowing the latin alphabet would be able to make sense of pinyin.

I've got a newer version of the conversion script that has greater support for tones and variant pinyin. It should be up there in about a week or two for anyone interested.

re: correctness

Your multitude of blogs makes me all mixed up :)

This seems like an even more interesting transcription with all the ambiguity and nonstandard usage. I guess Xiao'erjing was not only used for standard Mandarin. Are there any other major dialects? Guanhua seems to be different already, as historical Romanizations treat for example 京 specially. With dialects it should be much more difficult to find a unique mapping.

It seems that a Japanese group has some digitized data, but not openly available. That would be a good testing corpus.

I'll be interested in your updated support, especially tones. The only reference I found was on Wikipedia about (formerly) entering tones. I'll take a look into your Wu posts, too.