Gwoyeu Romatzyh and abbreviated spellings

Gwoyeu Romatzyh is a fairly complex Romanisation. Instead of using diacritic marks or appended digits, the creators decided to give each syllable-tone combintation a distinctive shape. So syllable guo (e.g. 国, 果) becomes guo, gwo, guoo, guoh for tones one to for.

This is actually the most prominent feature GR is known for. By the way, I believe it is mostly abbreviated as it's difficult to remember the correct spelling Gwoyeu Romatzyh, which itself comes from Guóyǔ Luómǎzì, "National Language Romanization", and would be strictly rendered as Gwoyeu Luomaatzyh in its own system.

While this kind of spelling seems highly complex, Yuen Ren Chao, one of its creators, argues that with the tone included in the syllable itself, the learner is forced to also learn the tone at the same time - stressing its importance for pronunciation.

There is another factor though special to GR. Chao uses a lot of abbreviations in his books on GR, substituting for example i.geh (yī ge, 一个, "one") with ig, or chi.tzyy (qīzǐ, 妻子, "wife") with chitz. This is convenient for the writer, but for conversion to other Romanisations this is a major obstacle. Below I want to give a list of abbreviated forms I came across in Chao's books.

In some special cases it is even unclear if the spellings he used aren't merely ad-hoc forms, for example j-h-eh, a form of jeh said with laughter.

List of abbreviated spellings

Yuen Ren Chao: A Grammar of Spoken Chinese. University of California Press, Berkeley, 1968, ISBN 0-520-00219-9, pp. xxx, xxxi.

  • a for .a (啊)
  • ba for .ba (吧 and 罢)
  • bu for bu, bwu, buh (不)
  • de for .de (的)
  • g for ₒgeh (个)
  • i for i, yi, yih (一)
  • ia for .ia (呀)
  • j for -.jy, -.je (着)
  • le for .le (了)
  • ma for .ma (嗎)
  • me for -.me (么) and .me (嚜)
  • men for -.men (们)
  • ne for .ne (呐)
  • sh for ₒshyh (是)
  • tz for -.tzy (子)

Yuen Ren Chao: Mandarin Primer: an intensive course in spoken Chinese. Harvard University Press, Cambridge, 1948.

  • -tz for .tzy (子)
  • -j for -.jy and .je (著)
  • g for -.geh (個)
  • de for .de (的)
  • sherm(.me) for shern.me (甚麼) (p. 123)
  • tzeem(.me) (怎麼) (p. 123)
  • tzemm(.me) (p. 123)
  • nemm(.me), also .ne.me (那麼) (p. 123, 138)
  • jemm.me (simmilar to tzemm.me) (這麼) (p. 137)

The following forms have yet to be evaluated:

  • V bu V for V .bu ₒV (pp. xxxi, first book)
  • -.men as in 我們 and 你們 etc. turns to -m before labials (p. 123, second book)
  • .èh (p. 22, 124), .oh (p. 22, 130), è (p. 153, second book) interjections
  • j-h-eh (jeh), tz-h-uoh (tzuoh) meng, (p. 162, second book) marking laughter
  • ss (p. 190, second book)

For an up-to-date list and license see http://code.google.com/p/cjklib/source/browse/trunk/cjklib/data/grabbreviation.csv.

Xiao'erjing in cjklib?

Xiao'erjing is a way of writing Chinese in Arabic script. Basically it is a transcription similar to Pinyin used by people with knowledge of the Arabic script to denote the sounds of Mandarin or another "dialect". It is written from right to left (RTL).

Universal Declaration of Human Rights in Xiao'erjing: Universal Declaration of Human Rights in Xiao'erjing under Public Domain taken from http://commons.wikimedia.org/wiki/Image:Xiaoerjing-Ekzemplafrazo.svgUniversal Declaration of Human Rights in Xiao'erjing: 人人生而自由…

So, why not make a conversion from Pinyin to Xiao'erjing, using cjklib's ReadingConverter paradigm and outdo the individualist named "PinyinBrailleConverter"? Well, it seems somebody already went half the way: converting pinyin to xiaoerjin.

Now, where do we get enough test cases to secure its correctness?

Cantonese Yale syllable table

Similar to the Jyutping syllable table here is a table of syllables of the Cantonese language written in Romanisation Cantonese Yale.

There are two sources: Research Centre for Humanities Computing of the Research Institute for the Humanities (RIH), Faculty of Arts, The Chinese University of Hong Kong - 粵音節表 (Table of Cantonese Syllables) and the Unihan table, both are in Jyutping. I used cjklib to convert those into Cantonese Yale.

Sources for the mapping are:

  • Stephen Matthews, Virginia Yip: Cantonese: A Comprehensive Grammar. Routledge, 1994, ISBN 0-415-08945-X.
  • Parker Po-fei Huang, Gerard P. Kok: Speak Cantonese (Book I). Revised Edition, Yale University, 1999, ISBN 0-88710-094-5.

The following Jyutping syllables are missing due to the lack of proper sources for a mapping between the two romanisations: lem, deu, gep, kep, loei, loet, pet, om (all in Jyutping). The table beneath thus is missing the Jyutping final set -oei, -oet, -om, -em, -ep, -et and -eu. Syllables found in the Unihan database are emphasised (italic), syllables from the table of the Centre for Humanities Computing marked with a 1.

bpmfdtnlgknghgwkwwjchsy
imi1ditini1li1wiji1chi1si1yi1
ipdip1tip1nip1lip1gip1kiphip1jip1chip1sip1yip1
itbit1pit1mit1dit1tit1nitlit1git1kit1ngit1hit1jit1chit1sit1yit1
ikbik1pik1mik1dik1tik1nik1lik1gik1gwik1kwikwik1jik1chik1sik1yik1
imdim1tim1nim1lim1gim1kim1him1jim1chim1sim1yim1
inbin1pin1min1din1tin1nin1lin1gin1kin1hin1jin1chin1sin1yin1
ingbing1ping1ming1fingding1ting1ning1ling1ging1king1hing1gwing1wing1jing1ching1sing1ying1
iubiu1piu1miu1fiudiu1tiu1niu1liu1giu1kiu1hiu1jiu1chiu1siu1yiu1
yujyu1chyu1syu1yu1
yutdyut1tyut1lyut1gyut1kyut1hyut1jyut1chyut1syut1yut1
yundyun1tyun1nyun1lyun1gyun1kyun1hyun1jyun1chyun1syun1yun1
ubufu1gu1ku1wu1
utbut1put1mut1fut1gutkut1wut1
ukuk1buk1puk1muk1fuk1duk1tuk1nuk1luk1guk1kuk1nguk1huk1juk1chuk1suk1yuk1
unbun1pun1mun1fun1gun1kwunwun1chun
ungung1bung1pung1mung1fung1dung1tung1nung1lung1gung1kung1ngung1hung1jung1chung1sung1yung1
uibui1pui1mui1fui1gui1kui1kwuiwui1jui
ee1be1peme1fede1ne1le1ge1ke1heweje1che1se1ye1
ekbek1pek1dek1tek1lek1kek1hek1jek1chek1sek1
engbeng1peng1meng1deng1teng1leng1geng1heng1jeng1cheng1seng1yeng1
eiei1bei1pei1mei1fei1dei1nei1lei1gei1kei1hei1sei1
eutdeut1neut1leut1jeut1cheut1seut1
eundeun1teun1leun1jeun1cheun1seun1yeun1
euideui1teui1neui1leui1geui1keui1heui1jeui1cheui1seui1yeui1
eueudeu1teu1geu1keuheu1jeu
eukdeuk1leuk1geuk1keuk1jeuk1cheuk1seuk1yeuk1
eungdeungneung1leung1geung1keung1heung1jeung1cheung1seung1yeung1
oo1bo1po1mo1fo1do1to1no1lo1go1ko1ngo1ho1gwo1wo1jo1cho1so1yo1
otgot1hot1
okok1bok1pok1mok1fok1dok1tok1nok1lok1gok1kok1ngok1hok1gwok1kwok1wok1jok1chok1sok1
onon1gon1ngon1hon1
ongong1bong1pong1mong1fong1dong1tong1nong1long1gong1kong1ngong1hong1gwong1kwong1wong1jong1chong1song1
oioi1moidoi1toi1noi1loi1goi1koi1ngoi1hoi1joi1choi1soi1
ouou1bou1pou1mou1dou1tou1nou1lou1gou1ngou1hou1jou1chou1sou1
apapdaptapnap1lap1gap1kap1ngaphap1jap1chap1sap1yap1
atatbat1pat1mat1fat1dat1tatnat1lat1gat1kat1ngat1hat1gwat1wat1jat1chat1sat1yat1
akak1bak1pakmak1dak1lak1gakkakngak1hak1wakjak1chak1sak1
amam1bam1dam1tamnam1lam1gam1kam1ngam1ham1jam1cham1sam1yam1
anan1ban1pan1man1fan1dan1tan1nan1langan1kan1ngan1han1gwan1kwan1wan1jan1chan1san1yan1
angang1bang1pang1mang1fang1dang1tang1nang1langgang1kang1nganghang1gwang1wang1jang1chang1sang1
aiai1bai1pai1mai1fai1dai1tai1nai1lai1gai1kai1ngai1hai1gwai1kwai1wai1jai1chai1sai1yai1
auau1baupau1mau1fau1dau1tau1nau1lau1gau1kau1ngau1hau1waujau1chau1sau1yau1
aa1ba1pa1ma1fa1da1ta1na1la1ga1ka1nga1ha1gwa1kwa1wa1ja1cha1sa1ya1
aapaap1daap1taap1naap1laap1gaap1kaapngaaphaap1jaap1chaap1saap1
aataat1baat1paatmaat1faat1daat1taat1naat1laat1gaat1kaat1ngaat1haatgwaat1waat1jaat1chaat1saat1
aakaak1baak1paak1maak1faakdaak1laak1gaak1kaak1ngaak1haak1gwaak1waak1jaak1chaak1saak1yaak1
aamaam1daam1taam1naam1laam1gaam1kaamngaam1haam1jaam1chaam1saam1
aanaan1baan1paan1maan1faan1daan1taan1naan1laan1gaan1kaanngaan1haan1gwaan1kwaanwaan1jaan1chaan1saan1
aangaang1baang1paang1maang1daangtaangnaanglaang1gaang1ngaang1haang1gwaang1kwaang1waang1jaang1chaang1saang1yaang
aaiaai1baai1paai1maai1faai1daai1taai1naai1laai1gaai1kaai1ngaai1haai1gwaai1kwaai1waai1jaai1chaai1saai1yaai1
aauaau1baau1paau1maau1faaudaautaaunaau1laaugaau1kaau1ngaau1haau1jaau1chaau1saau1yaau
mm1hm1
ngng1hng1

cjklib comes with rewritten and more extensive unit tests

[img_assist|nid=205|title=The First "Computer Bug"|link=none|align=right|width=200|height=158]
I finally got to one particular weakness of cjklib and being motivated enough to tackle the problem of weak unit tests I rewrote the current test and added some more.

Test cases now are much clearer and should motivate the addition of further test cases. Well, actually some flaws hidden before came up now, for example the CantoneseIPAOperator class was never well tested, as its corresponding JyutpingIPAConverter class is still not implemented and thus not very useful up to now. I can not stress enough how important unit tests are as some bugs I fixed yesterday were really small corner cases hard to find.

Test cases so far cover consistency tests for ReadingOperator and ReadingConverter classes and for most of them additional references are given, some being more some less extensive. The CharacterLookup class still needs much more tests, something that should be easier now after the rewrite.

A tool new to the development chain now eases the whole testing task: nosetests can easily select tests from regular expressions and at the same time create coverage and profiling information. Happy bug squashing!

(Natural) language in the world of programming

When it comes to writing code, directives and commands are dictated by the programming language (e.g. if ... then ... else), which then is for most programming languages English[1], but when it comes to writing comments the programmer is free to choose which language he uses.

Well, it seems that English unites the programming world and famous Hackers like Eric S. Raymond advise every beginning Hacker to first gain a good command of English [2]. Learning a programming language which employs keywords taken from English or any other language though is totally possible without knowing the language itself.

Now, I've read Raymond's view on learning English [2] and I just re-read the "Style Guide for Python Code" [3]. The latter states:

Python coders from non-English speaking countries: please write
your comments in English, unless you are 120% sure that the code
will never be read by people who don't speak your language.

I have to say I honestly disagree with both. I don't doubt the importance of speaking the same language for communication and I agree that English is most likely the language to choose, but the fact that both want to tell the programmer which language to use is a sign of ignorance towards other speakers: Programming is in no way special to any other area where people individually decide which language is appropriate and have been doing so for ages.

Don't tell us which language we should use, we should now best.

You might argue "what's the point anyway" and "we all know it finally boils down to English", but I believe at this level you should accept that people want to use the language they think is most appropriate, and either way choosing one language means excluding others, no matter which one it finally is.

[1] Python actually has a "translation" to Chinese which translates reserved keywords and built in types and allows Chinese variable names: http://sourceforge.net/projects/chinesepython
[2] http://www.catb.org/~esr/faqs/hacker-howto.html#skills4
[3] http://www.python.org/dev/peps/pep-0008/

Syndicate content