cjklib.reading.operator.PinyinOperator is a complete implementation of the standard Chinese Pinyin romanisation (Hanyu Pinyin Fang’an, 汉语拼音方案, standardised in ISO 7098).
Features:
Pinyin syllables need to be separated by an apostrophe in case their decomposition will get ambiguous. A famous example might be the city Xi’an, which if written xian would be read as one syllable, meaning e.g. ‘fresh’. Another example would be Chang’an which could be read chan’gan if no delimiter is used in at least one of both cases.
Different rules exist where to place apostrophes. A simple yet sufficient rule is implemented in aeoApostropheRule() which is used as default in this class. Syllables starting with one of the three vowels a, e, o will be separated. Remember that vowels [i], [u], [y] are represented as yi, wu, yu respectively, thus making syllable boundaries clear. compose() will place apostrophes where required when composing the reading string.
An alternative rule can be specified to the constructor passing a function as an option pinyinApostropheFunction. A possible function could be a rule separating all syllables by an apostrophe thus simplifying the reading process for beginners.
On decomposition of strings it is important to check which of the possibly several choices will be the one actually meant. E.g. syllable xian given above should always be segmented into one syllable, solution xi’an is not an option in this case. Therefore an alternative to aeoApostropheRule() should make sure it guarantees proper decomposition, which is tested through isStrictDecomposition().
Last but not least compose(decompose(string)) will only be the identity if apostrophes are applied properly according to the rule as wrongly placed apostrophes will be kept when composing. Use removeApostrophes() to remove separating apostrophes.
>>> def noToneApostropheRule(opInst, precedingEntity, followingEntity):
... return precedingEntity and precedingEntity[0].isalpha() \
... and not precedingEntity[-1].isdigit() \
... and followingEntity[0].isalpha()
...
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('an3ma5mi5ba5ni2mou1', 'Pinyin', 'Pinyin',
... sourceOptions={'toneMarkType': 'numbers'},
... targetOptions={'toneMarkType': 'numbers',
... 'missingToneMark': 'fifth',
... 'pinyinApostropheFunction': noToneApostropheRule})
u"an3ma'mi'ba'ni2mou1"
The phenomenon Erhua (兒化音/儿化音, Erhua yin), i.e. the r-colouring of syllables, is found in the northern Chinese dialects and results from merging the formerly independent sound er with the preceding syllable. In written form a word is followed by the character 兒/儿, e.g. 頭兒/头儿.
In Pinyin the Erhua sound is quite often expressed by appending a single r to the syllable of the character preceding 兒/儿, e.g. tóur for 頭兒/头儿, to stress the monosyllabic nature and in contrast to words like 兒子/儿子 ér’zi where 兒/儿 ér constitutes a single syllable.
For decomposing syllables in Pinyin it is thus important to decide if the r marking r-colouring should be an entity on its own account stressing the representation in the character string with an own character or rather stressing the monosyllabic nature and being part of a syllable of the foregoing character. This can be configured at instantiation time. By default the two-syllable form is chosen, which is more general as both examples are allowed: banr and ban r (i.e. one without delimiter, one with; both though being two entities in this representation).
Tone marks, if using the standard form with diacritics, are placed according to official Pinyin rules. The PinyinOperator by default tries to work around misplaced tone marks though, e.g. *tīan’ānmén (correct: tiān’ānmén), to ease handling of malformed input. There are cases though, where this generous behaviour leads to a different segmentation compared to the strict interpretation, as for *hónglùo which can fall into hóng *lùo (correct: hóng luò) or hóng lù o (also, using the first example, tī an ān mén). As the latter result also stems from a wrong transcription, no means are implemented to disambiguate between both solutions. The general behaviour is controlled with option 'strictDiacriticPlacement'.
Pinyin allows to shorten two-letter pairs ng, zh, ch and sh to ŋ, ẑ, ĉ and ŝ. This behaviour can be controlled by option 'shortenedLetters'.
See also
Bases: cjklib.reading.operator.TonalRomanisationOperator
Provides an operator for the Mandarin romanisation Hanyu Pinyin. It can be configured to cope with different representations (“dialects”) of Pinyin. For conversion between different representations the PinyinDialectConverter can be used.
Todo
Parameters: |
|
---|
Checks if the given entities need to be separated by an apostrophe.
Returns true for syllables starting with one of the three vowels a, e, o having a preceding syllable. Additionally forms n and ng are separated from preceding syllables. Furthermore corner case e’r will handled to distinguish from er.
This function serves as the default apostrophe rule.
Parameters: |
|
---|---|
Return type: | bool |
Returns: | true if the syllables need to be separated, false otherwise |
Composes the given list of basic entities to a string. Applies an apostrophe between syllables if needed using default implementation aeoApostropheRule().
Parameter: | readingEntities (list of str) – list of basic syllables or other content |
---|---|
Return type: | str |
Returns: | composed entities |
Converts the alternative syllable representation from the current dialect to the given target, or by default to the standard representation. Erhua forms will not be converted.
Use the PinyinDialectConverter for conversions in general.
Parameters: |
|
---|---|
Return type: | str |
Returns: | converted entity |
Splits the given plain syllable into onset (initial) and rhyme (final).
Pinyin can’t be separated into onset and rhyme clearly within its own system. There are syllables with same finals written differently (e.g. wei and dui both ending in a final that can be described by uei) and reduction of vowels (same example: dui which is pronounced with vowels uei). This method will use three forms not found as substrings in Pinyin (uei, uen and iou) and substitutes (pseudo) initials w and y with its vowel equivalents.
Furthermore final i will be distinguished in three forms given by the following three examples: yi, zhi and zi to express phonological difference.
Returned strings will be lowercase.
Parameter: | plainSyllable (str) – syllable without tone marks |
---|---|
Return type: | tuple of str |
Returns: | tuple of entity onset and rhyme |
Raises InvalidEntityError: | |
if the entity is invalid. | |
Raises UnsupportedError: | |
for entity r when Erhua is handled as separate entity. |
Gets the list of plain entities supported by this reading. Different to getReadingEntities() the entities will carry no tone mark.
Depending on the type of Erhua support either additional syllables with an ending -r are added, or a single r is included. The user specified character for vowel ü will be used.
Return type: | set of str |
---|---|
Returns: | set of supported syllables |
Todo
Takes a string written in Pinyin and guesses the reading dialect.
The basic options 'toneMarkType', 'pinyinDiacritics', 'yVowel', 'erhua', 'pinyinApostrophe' and 'shortenedLetters' are guessed. Unless 'includeToneless' is set to True only the tone mark types 'diacritics' and 'numbers' are considered as the latter one can also represent the state of missing tones. Strings tested for 'yVowel' are ü, v and u:. 'erhua' is set to 'twoSyllables' by default and only tested when 'toneMarkType' is assumed to be set to 'numbers'.
Parameters: |
|
---|---|
Return type: | dict |
Returns: | dictionary of basic keyword settings |
Checks if the given decomposition follows the Pinyin format strictly for unambiguous decomposition: syllables have to be preceded by an apostrophe if the decomposition would be ambiguous otherwise.
The function stored given as option 'pinyinApostropheFunction' is used to check if a apostrophe should have been placed.
Parameter: | readingEntities (list of str) – decomposed reading string |
---|---|
Return type: | bool |
Returns: | true if decomposition is strict, false otherwise |
Removes apostrophes between two syllables for a given decomposition.
Parameter: | readingEntities (list of str) – list of basic syllables or other content |
---|---|
Return type: | list of str |
Returns: | the given entity list without separating apostrophes |
Splits the entity into an entity without tone mark and the entity’s tone index.
The plain entity returned will always be in Unicode’s Normalization Form C (NFC, see http://www.unicode.org/reports/tr15/).
Parameter: | entity (str) – entity with tonal information |
---|---|
Return type: | tuple |
Returns: | plain entity without tone mark and entity’s tone index (starting with 1) |