How to transcribe Chinese Characters in Pinyin, or: The Mayor of Nanjing

In many cases one wishes for automatic transcription of Chinese Characters to Pinyin.

How to do it, or - how not to do it?

Take the following Chinese Joke:

A问:南京市市长是不是叫江大桥?
B答:不是。
A说:那我坐火车在南京过江的时候怎么看到一个广告牌上写着:南京市长江大桥欢迎您!

Translation

A asks: Isn't the name of the mayor of Nanjing Jiang Daqiao?
B answers: No.
A says: Well, when I passed Yangtze River near Nanjing sitting in a train, how come I saw a sign saying "The mayor of Nanjing City Jiang Daqiao welcomes you"?

Actually this joke is a frequent example for the ambiguity of Chinese segmentation. Chinese doesn't use whitespaces to mark word breaks (the term "word" even being difficult to define in Chinese, if not in any language), so one needs to deduce them from context.

In this joke the play is on 南京市长江大桥 which can mean 1) The mayor of Nanjing "Jiang Daqiao" (南京 市长 江大桥) or 2) Nanjing Yangtze River Bridge (南京市 长江大桥). The terms in brackets are separated by spaces to show the word boundaries. So if you still didn't get the joke - The sign says: "The Nanjing Yangtze River Bridge welcomes you".

Well, this joke confuses people and it will definitely confuse computers.

So a search on different annotation websites (Search results from del.icio.us) offers different solutions:

  1. Nánjīngshì (南京市) chángjiāng (长江) dà (大) qiáo (桥)
  2. [南na1;nan2] 京jing1 市shi4 [长chang2;zhang3] 江jiang1 [大da4;dai4] 桥qiao2

Both aren't satisfying, the first gives a nice segmentation, but omits the second possibility, the second solution gives a character by character transcription giving all readings there are.

Someone do this better please!

Addendum: I should have pointed out that in this case it isn't only for segmentation but even for the character 长, which in the first case of the Mayor is pronounced zhǎng (chief, head) and in the second case for the Bridge is pronounced cháng (length, long). In Chinese characters cannot only have multiple meanings but even multiple pronunciations.