Gwoyeu Romatzyh and abbreviated spellings
Submitted by Christoph on 12 July, 2009 - 17:50Gwoyeu Romatzyh is a fairly complex Romanisation. Instead of using diacritic marks or appended digits, the creators decided to give each syllable-tone combintation a distinctive shape. So syllable guo (e.g. 国, 果) becomes guo, gwo, guoo, guoh for tones one to for.
This is actually the most prominent feature GR is known for. By the way, I believe it is mostly abbreviated as it's difficult to remember the correct spelling Gwoyeu Romatzyh, which itself comes from Guóyǔ Luómǎzì, "National Language Romanization", and would be strictly rendered as Gwoyeu Luomaatzyh in its own system.
While this kind of spelling seems highly complex, Yuen Ren Chao, one of its creators, argues that with the tone included in the syllable itself, the learner is forced to also learn the tone at the same time - stressing its importance for pronunciation.
There is another factor though special to GR. Chao uses a lot of abbreviations in his books on GR, substituting for example i.geh (yī ge, 一个, "one") with ig, or chi.tzyy (qīzǐ, 妻子, "wife") with chitz. This is convenient for the writer, but for conversion to other Romanisations this is a major obstacle. Below I want to give a list of abbreviated forms I came across in Chao's books.
In some special cases it is even unclear if the spellings he used aren't merely ad-hoc forms, for example j-h-eh, a form of jeh said with laughter.
List of abbreviated spellings
Yuen Ren Chao: A Grammar of Spoken Chinese. University of California Press, Berkeley, 1968, ISBN 0-520-00219-9, pp. xxx, xxxi.
- a for .a (啊)
- ba for .ba (吧 and 罢)
- bu for bu, bwu, buh (不)
- de for .de (的)
- g for ₒgeh (个)
- i for i, yi, yih (一)
- ia for .ia (呀)
- j for -.jy, -.je (着)
- le for .le (了)
- ma for .ma (嗎)
- me for -.me (么) and .me (嚜)
- men for -.men (们)
- ne for .ne (呐)
- sh for ₒshyh (是)
- tz for -.tzy (子)
Yuen Ren Chao: Mandarin Primer: an intensive course in spoken Chinese. Harvard University Press, Cambridge, 1948.
- -tz for .tzy (子)
- -j for -.jy and .je (著)
- g for -.geh (個)
- de for .de (的)
- sherm(.me) for shern.me (甚麼) (p. 123)
- tzeem(.me) (怎麼) (p. 123)
- tzemm(.me) (p. 123)
- nemm(.me), also .ne.me (那麼) (p. 123, 138)
- jemm.me (simmilar to tzemm.me) (這麼) (p. 137)
The following forms have yet to be evaluated:
- V bu V for V .bu ₒV (pp. xxxi, first book)
- -.men as in 我們 and 你們 etc. turns to -m before labials (p. 123, second book)
- .èh (p. 22, 124), .oh (p. 22, 130), è (p. 153, second book) interjections
- j-h-eh (jeh), tz-h-uoh (tzuoh) meng, (p. 162, second book) marking laughter
- ss (p. 190, second book)
For an up-to-date list and license see http://code.google.com/p/cjklib/source/browse/trunk/cjklib/data/grabbreviation.csv.
Xiao'erjing in cjklib?
Submitted by Christoph on 10 July, 2009 - 09:41Xiao'erjing is a way of writing Chinese in Arabic script. Basically it is a transcription similar to Pinyin used by people with knowledge of the Arabic script to denote the sounds of Mandarin or another "dialect". It is written from right to left (RTL).
Universal Declaration of Human Rights in Xiao'erjing: 人人生而自由…
So, why not make a conversion from Pinyin to Xiao'erjing, using cjklib's ReadingConverter paradigm and outdo the individualist named "PinyinBrailleConverter"? Well, it seems somebody already went half the way: converting pinyin to xiaoerjin.
Now, where do we get enough test cases to secure its correctness?
Cantonese Yale syllable table
Submitted by Christoph on 4 July, 2009 - 11:54Similar to the Jyutping syllable table here is a table of syllables of the Cantonese language written in Romanisation Cantonese Yale.
There are two sources: Research Centre for Humanities Computing of the Research Institute for the Humanities (RIH), Faculty of Arts, The Chinese University of Hong Kong - 粵音節表 (Table of Cantonese Syllables) and the Unihan table, both are in Jyutping. I used cjklib to convert those into Cantonese Yale.
Sources for the mapping are:
- Stephen Matthews, Virginia Yip: Cantonese: A Comprehensive Grammar. Routledge, 1994, ISBN 0-415-08945-X.
- Parker Po-fei Huang, Gerard P. Kok: Speak Cantonese (Book I). Revised Edition, Yale University, 1999, ISBN 0-88710-094-5.
The following Jyutping syllables are missing due to the lack of proper sources for a mapping between the two romanisations: lem, deu, gep, kep, loei, loet, pet, om (all in Jyutping). The table beneath thus is missing the Jyutping final set -oei, -oet, -om, -em, -ep, -et and -eu. Syllables found in the Unihan database are emphasised (italic), syllables from the table of the Centre for Humanities Computing marked with a 1.
b | p | m | f | d | t | n | l | g | k | ng | h | gw | kw | w | j | ch | s | y | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
i | mi1 | di | ti | ni1 | li1 | wi | ji1 | chi1 | si1 | yi1 | ||||||||||
ip | dip1 | tip1 | nip1 | lip1 | gip1 | kip | hip1 | jip1 | chip1 | sip1 | yip1 | |||||||||
it | bit1 | pit1 | mit1 | dit1 | tit1 | nit | lit1 | git1 | kit1 | ngit1 | hit1 | jit1 | chit1 | sit1 | yit1 | |||||
ik | bik1 | pik1 | mik1 | dik1 | tik1 | nik1 | lik1 | gik1 | gwik1 | kwik | wik1 | jik1 | chik1 | sik1 | yik1 | |||||
im | dim1 | tim1 | nim1 | lim1 | gim1 | kim1 | him1 | jim1 | chim1 | sim1 | yim1 | |||||||||
in | bin1 | pin1 | min1 | din1 | tin1 | nin1 | lin1 | gin1 | kin1 | hin1 | jin1 | chin1 | sin1 | yin1 | ||||||
ing | bing1 | ping1 | ming1 | fing | ding1 | ting1 | ning1 | ling1 | ging1 | king1 | hing1 | gwing1 | wing1 | jing1 | ching1 | sing1 | ying1 | |||
iu | biu1 | piu1 | miu1 | fiu | diu1 | tiu1 | niu1 | liu1 | giu1 | kiu1 | hiu1 | jiu1 | chiu1 | siu1 | yiu1 | |||||
yu | jyu1 | chyu1 | syu1 | yu1 | ||||||||||||||||
yut | dyut1 | tyut1 | lyut1 | gyut1 | kyut1 | hyut1 | jyut1 | chyut1 | syut1 | yut1 | ||||||||||
yun | dyun1 | tyun1 | nyun1 | lyun1 | gyun1 | kyun1 | hyun1 | jyun1 | chyun1 | syun1 | yun1 | |||||||||
u | bu | fu1 | gu1 | ku1 | wu1 | |||||||||||||||
ut | but1 | put1 | mut1 | fut1 | gut | kut1 | wut1 | |||||||||||||
uk | uk1 | buk1 | puk1 | muk1 | fuk1 | duk1 | tuk1 | nuk1 | luk1 | guk1 | kuk1 | nguk1 | huk1 | juk1 | chuk1 | suk1 | yuk1 | |||
un | bun1 | pun1 | mun1 | fun1 | gun1 | kwun | wun1 | chun | ||||||||||||
ung | ung1 | bung1 | pung1 | mung1 | fung1 | dung1 | tung1 | nung1 | lung1 | gung1 | kung1 | ngung1 | hung1 | jung1 | chung1 | sung1 | yung1 | |||
ui | bui1 | pui1 | mui1 | fui1 | gui1 | kui1 | kwui | wui1 | jui | |||||||||||
e | e1 | be1 | pe | me1 | fe | de1 | ne1 | le1 | ge1 | ke1 | he | we | je1 | che1 | se1 | ye1 | ||||
ek | bek1 | pek1 | dek1 | tek1 | lek1 | kek1 | hek1 | jek1 | chek1 | sek1 | ||||||||||
eng | beng1 | peng1 | meng1 | deng1 | teng1 | leng1 | geng1 | heng1 | jeng1 | cheng1 | seng1 | yeng1 | ||||||||
ei | ei1 | bei1 | pei1 | mei1 | fei1 | dei1 | nei1 | lei1 | gei1 | kei1 | hei1 | sei1 | ||||||||
eut | deut1 | neut1 | leut1 | jeut1 | cheut1 | seut1 | ||||||||||||||
eun | deun1 | teun1 | leun1 | jeun1 | cheun1 | seun1 | yeun1 | |||||||||||||
eui | deui1 | teui1 | neui1 | leui1 | geui1 | keui1 | heui1 | jeui1 | cheui1 | seui1 | yeui1 | |||||||||
eu | eu | deu1 | teu1 | geu1 | keu | heu1 | jeu | |||||||||||||
euk | deuk1 | leuk1 | geuk1 | keuk1 | jeuk1 | cheuk1 | seuk1 | yeuk1 | ||||||||||||
eung | deung | neung1 | leung1 | geung1 | keung1 | heung1 | jeung1 | cheung1 | seung1 | yeung1 | ||||||||||
o | o1 | bo1 | po1 | mo1 | fo1 | do1 | to1 | no1 | lo1 | go1 | ko1 | ngo1 | ho1 | gwo1 | wo1 | jo1 | cho1 | so1 | yo1 | |
ot | got1 | hot1 | ||||||||||||||||||
ok | ok1 | bok1 | pok1 | mok1 | fok1 | dok1 | tok1 | nok1 | lok1 | gok1 | kok1 | ngok1 | hok1 | gwok1 | kwok1 | wok1 | jok1 | chok1 | sok1 | |
on | on1 | gon1 | ngon1 | hon1 | ||||||||||||||||
ong | ong1 | bong1 | pong1 | mong1 | fong1 | dong1 | tong1 | nong1 | long1 | gong1 | kong1 | ngong1 | hong1 | gwong1 | kwong1 | wong1 | jong1 | chong1 | song1 | |
oi | oi1 | moi | doi1 | toi1 | noi1 | loi1 | goi1 | koi1 | ngoi1 | hoi1 | joi1 | choi1 | soi1 | |||||||
ou | ou1 | bou1 | pou1 | mou1 | dou1 | tou1 | nou1 | lou1 | gou1 | ngou1 | hou1 | jou1 | chou1 | sou1 | ||||||
ap | ap | dap | tap | nap1 | lap1 | gap1 | kap1 | ngap | hap1 | jap1 | chap1 | sap1 | yap1 | |||||||
at | at | bat1 | pat1 | mat1 | fat1 | dat1 | tat | nat1 | lat1 | gat1 | kat1 | ngat1 | hat1 | gwat1 | wat1 | jat1 | chat1 | sat1 | yat1 | |
ak | ak1 | bak1 | pak | mak1 | dak1 | lak1 | gak | kak | ngak1 | hak1 | wak | jak1 | chak1 | sak1 | ||||||
am | am1 | bam1 | dam1 | tam | nam1 | lam1 | gam1 | kam1 | ngam1 | ham1 | jam1 | cham1 | sam1 | yam1 | ||||||
an | an1 | ban1 | pan1 | man1 | fan1 | dan1 | tan1 | nan1 | lan | gan1 | kan1 | ngan1 | han1 | gwan1 | kwan1 | wan1 | jan1 | chan1 | san1 | yan1 |
ang | ang1 | bang1 | pang1 | mang1 | fang1 | dang1 | tang1 | nang1 | lang | gang1 | kang1 | ngang | hang1 | gwang1 | wang1 | jang1 | chang1 | sang1 | ||
ai | ai1 | bai1 | pai1 | mai1 | fai1 | dai1 | tai1 | nai1 | lai1 | gai1 | kai1 | ngai1 | hai1 | gwai1 | kwai1 | wai1 | jai1 | chai1 | sai1 | yai1 |
au | au1 | bau | pau1 | mau1 | fau1 | dau1 | tau1 | nau1 | lau1 | gau1 | kau1 | ngau1 | hau1 | wau | jau1 | chau1 | sau1 | yau1 | ||
a | a1 | ba1 | pa1 | ma1 | fa1 | da1 | ta1 | na1 | la1 | ga1 | ka1 | nga1 | ha1 | gwa1 | kwa1 | wa1 | ja1 | cha1 | sa1 | ya1 |
aap | aap1 | daap1 | taap1 | naap1 | laap1 | gaap1 | kaap | ngaap | haap1 | jaap1 | chaap1 | saap1 | ||||||||
aat | aat1 | baat1 | paat | maat1 | faat1 | daat1 | taat1 | naat1 | laat1 | gaat1 | kaat1 | ngaat1 | haat | gwaat1 | waat1 | jaat1 | chaat1 | saat1 | ||
aak | aak1 | baak1 | paak1 | maak1 | faak | daak1 | laak1 | gaak1 | kaak1 | ngaak1 | haak1 | gwaak1 | waak1 | jaak1 | chaak1 | saak1 | yaak1 | |||
aam | aam1 | daam1 | taam1 | naam1 | laam1 | gaam1 | kaam | ngaam1 | haam1 | jaam1 | chaam1 | saam1 | ||||||||
aan | aan1 | baan1 | paan1 | maan1 | faan1 | daan1 | taan1 | naan1 | laan1 | gaan1 | kaan | ngaan1 | haan1 | gwaan1 | kwaan | waan1 | jaan1 | chaan1 | saan1 | |
aang | aang1 | baang1 | paang1 | maang1 | daang | taang | naang | laang1 | gaang1 | ngaang1 | haang1 | gwaang1 | kwaang1 | waang1 | jaang1 | chaang1 | saang1 | yaang | ||
aai | aai1 | baai1 | paai1 | maai1 | faai1 | daai1 | taai1 | naai1 | laai1 | gaai1 | kaai1 | ngaai1 | haai1 | gwaai1 | kwaai1 | waai1 | jaai1 | chaai1 | saai1 | yaai1 |
aau | aau1 | baau1 | paau1 | maau1 | faau | daau | taau | naau1 | laau | gaau1 | kaau1 | ngaau1 | haau1 | jaau1 | chaau1 | saau1 | yaau | |||
m | m1 | hm1 | ||||||||||||||||||
ng | ng1 | hng1 |
cjklib comes with rewritten and more extensive unit tests
Submitted by Christoph on 28 June, 2009 - 13:11[img_assist|nid=205|title=The First "Computer Bug"|link=none|align=right|width=200|height=158]
I finally got to one particular weakness of cjklib and being motivated enough to tackle the problem of weak unit tests I rewrote the current test and added some more.
Test cases now are much clearer and should motivate the addition of further test cases. Well, actually some flaws hidden before came up now, for example the CantoneseIPAOperator class was never well tested, as its corresponding JyutpingIPAConverter class is still not implemented and thus not very useful up to now. I can not stress enough how important unit tests are as some bugs I fixed yesterday were really small corner cases hard to find.
Test cases so far cover consistency tests for ReadingOperator and ReadingConverter classes and for most of them additional references are given, some being more some less extensive. The CharacterLookup class still needs much more tests, something that should be easier now after the rewrite.
A tool new to the development chain now eases the whole testing task: nosetests can easily select tests from regular expressions and at the same time create coverage and profiling information. Happy bug squashing!
(Natural) language in the world of programming
Submitted by Christoph on 20 June, 2009 - 18:15When it comes to writing code, directives and commands are dictated by the programming language (e.g. if ... then ... else
), which then is for most programming languages English[1], but when it comes to writing comments the programmer is free to choose which language he uses.
Well, it seems that English unites the programming world and famous Hackers like Eric S. Raymond advise every beginning Hacker to first gain a good command of English [2]. Learning a programming language which employs keywords taken from English or any other language though is totally possible without knowing the language itself.
Now, I've read Raymond's view on learning English [2] and I just re-read the "Style Guide for Python Code" [3]. The latter states:Python coders from non-English speaking countries: please write
your comments in English, unless you are 120% sure that the code
will never be read by people who don't speak your language.
I have to say I honestly disagree with both. I don't doubt the importance of speaking the same language for communication and I agree that English is most likely the language to choose, but the fact that both want to tell the programmer which language to use is a sign of ignorance towards other speakers: Programming is in no way special to any other area where people individually decide which language is appropriate and have been doing so for ages.
Don't tell us which language we should use, we should now best.
You might argue "what's the point anyway" and "we all know it finally boils down to English", but I believe at this level you should accept that people want to use the language they think is most appropriate, and either way choosing one language means excluding others, no matter which one it finally is.
[1] Python actually has a "translation" to Chinese which translates reserved keywords and built in types and allows Chinese variable names: http://sourceforge.net/projects/chinesepython
[2] http://www.catb.org/~esr/faqs/hacker-howto.html#skills4
[3] http://www.python.org/dev/peps/pep-0008/
