Segmenting Pinyin through regular expressions
Playing a bit with segmenting strings written in Pinyin I came up with a regular expression (regex) doing the job.
It's important to respect the vouls a, e, o which can stand on their own and can be written with an apostrophe before. Furthermore it is important to know which voul combinantions account for one syllable, which for more. Example: aa is equivalent to two characters, but ai only to one.
There are two final sounds n and ng where either n or g can already be the initial sound of the following syllable.
The whole complex regex, where tone marks are not respected:(?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)(?:i(?:ao|[uae])?|u(?:ai|[iaeo])?|üe?))|
(?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)|')?(?:a[oi]?|ei?|ou?)))(?:(?:ng|n)(?![aeo]))?
We can break it down into major parts first:
(?:
- Get syllables with vouls starting with i, u, ü first. All consonants except v can show up, sh, ch, zh are initial sounds with two consonants. Make sure voul combinations like ii or üa can't come together:
(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)(?:i(?:ao|[uae])?|u(?:ai|[iaeo])?|üe?))
|
- Now get syllables starting with a, e, o. Consonants and voul combinations as above, but deal with an apostrophe:
(?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)|')?(?:a[oi]?|ei?|ou?)))
- Get syllables with vouls starting with i, u, ü first. All consonants except v can show up, sh, ch, zh are initial sounds with two consonants. Make sure voul combinations like ii or üa can't come together:
- Get finals n, ng only if no vouls a, e, o follow:
(?:(?:ng|n)(?![aeo]))?
- )
This regex works with the "garbage in garbage out" principle, syllables that can't occur in Pinyin might not be reported. Further more I can't gurantee this regex is free of errors, you might want to test it by yourself before using it.
Here's the code in Python:
>>> decomp = re.compile(u"((?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)" \
... + u"(?:i(?:ao|[uae])?|u(?:ai|[iaeo])?|üe?))" \
... + u"|(?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)|')?(?:a[oi]?|ei?|ou?)))" \
... + u"(?:(?:ng|n)(?![aeo]))?)")
>>> decomp.split(u"changan")
[u'', u'chan', u'', u'gan', u'']
>>> decomp.split(u"chang'an")
[u'', u'chang', u'', u"'an", u'']
>>> decomp.split(u"tiananmen")
[u'', u'tia', u'', u'nan', u'', u'men', u'']
>>> decomp.split(u"tian'anmen")
[u'', u'tian', u'', u"'an", u'', u'men', u'']
Chomsky Type-3
A regular grammar is equivalent to a finit state automaton, so applying a pure regular expression to a Pinyin string won't do the trick. In the above example I already applied a look-ahead for the final sounds 'n' and 'ng'. 'er' will need a voul look-ahead: erao (a quite unlikely example) should fall into e'rao instead of er'ao according to the apostrophe rules.
It is questionable if all possible pinyin strings can be simply detected by a regular expression with look-ahead assertion.
还有:Syllable er
I forgot to include the one special syllable er, the only one ending with something that is not n or ng. Characters like 而,儿,二,尔,耳 are pronounced that. But if you understood the regex above it should be easy for you to add this syllable ^_^
Further improvements
The regex can be further improved to accept fewer wrong pinyin input.
E.g. the vouls a, e, o don't directly follow an x, j, q. So the second line can exclude these three consonsants.
ü and üe can only follow l and n, so these can be given an extra entry and be excluded from the first row.
In fact a lot of or-concatenated parts can be given to deal with the different patterns. The more rules the more the regex will look like a big database of all possible syllables though.