PinyinOperator — Hanyu Pinyin

cjklib.reading.operator.PinyinOperator is a complete implementation of the standard Chinese Pinyin romanisation (Hanyu Pinyin Fang’an, 汉语拼音方案, standardised in ISO 7098).

Features:

  • tones marked by either diacritics or numbers,
  • flexible handling of misplaced tone marks on input,
  • flexible handling of wrong diacritics (e.g. breve instead of caron),
  • correct placement of apostrophes to separate syllables,
  • alternative representation of ü-character,
  • alternatively shortend letters ŋ, , ĉ, ŝ,
  • guessing of input form (reading dialect),
  • support for Erhua and
  • splitting of syllables into onset and rhyme.

Specifics

Apostrophes

Pinyin syllables need to be separated by an apostrophe in case their decomposition will get ambiguous. A famous example might be the city Xi’an, which if written xian would be read as one syllable, meaning e.g. ‘fresh’. Another example would be Chang’an which could be read chan’gan if no delimiter is used in at least one of both cases.

Different rules exist where to place apostrophes. A simple yet sufficient rule is implemented in aeoApostropheRule() which is used as default in this class. Syllables starting with one of the three vowels a, e, o will be separated. Remember that vowels [i], [u], [y] are represented as yi, wu, yu respectively, thus making syllable boundaries clear. compose() will place apostrophes where required when composing the reading string.

An alternative rule can be specified to the constructor passing a function as an option pinyinApostropheFunction. A possible function could be a rule separating all syllables by an apostrophe thus simplifying the reading process for beginners.

On decomposition of strings it is important to check which of the possibly several choices will be the one actually meant. E.g. syllable xian given above should always be segmented into one syllable, solution xi’an is not an option in this case. Therefore an alternative to aeoApostropheRule() should make sure it guarantees proper decomposition, which is tested through isStrictDecomposition().

Last but not least compose(decompose(string)) will only be the identity if apostrophes are applied properly according to the rule as wrongly placed apostrophes will be kept when composing. Use removeApostrophes() to remove separating apostrophes.

Example

>>> def noToneApostropheRule(opInst, precedingEntity, followingEntity):
...     return precedingEntity and precedingEntity[0].isalpha() \
...         and not precedingEntity[-1].isdigit() \
...         and followingEntity[0].isalpha()
...
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('an3ma5mi5ba5ni2mou1', 'Pinyin', 'Pinyin',
...     sourceOptions={'toneMarkType': 'numbers'},
...     targetOptions={'toneMarkType': 'numbers',
...         'missingToneMark': 'fifth',
...         'pinyinApostropheFunction': noToneApostropheRule})
u"an3ma'mi'ba'ni2mou1"

R-colouring

The phenomenon Erhua (兒化音/儿化音, Erhua yin), i.e. the r-colouring of syllables, is found in the northern Chinese dialects and results from merging the formerly independent sound er with the preceding syllable. In written form a word is followed by the character 兒/儿, e.g. 頭兒/头儿.

In Pinyin the Erhua sound is quite often expressed by appending a single r to the syllable of the character preceding 兒/儿, e.g. tóur for 頭兒/头儿, to stress the monosyllabic nature and in contrast to words like 兒子/儿子 ér’zi where 兒/儿 ér constitutes a single syllable.

For decomposing syllables in Pinyin it is thus important to decide if the r marking r-colouring should be an entity on its own account stressing the representation in the character string with an own character or rather stressing the monosyllabic nature and being part of a syllable of the foregoing character. This can be configured at instantiation time. By default the two-syllable form is chosen, which is more general as both examples are allowed: banr and ban r (i.e. one without delimiter, one with; both though being two entities in this representation).

Placement of tones

Tone marks, if using the standard form with diacritics, are placed according to official Pinyin rules. The PinyinOperator by default tries to work around misplaced tone marks though, e.g. *tīan’ānmén (correct: tiān’ānmén), to ease handling of malformed input. There are cases though, where this generous behaviour leads to a different segmentation compared to the strict interpretation, as for *hónglùo which can fall into hóng *lùo (correct: hóng luò) or hóng lù o (also, using the first example, tī an ān mén). As the latter result also stems from a wrong transcription, no means are implemented to disambiguate between both solutions. The general behaviour is controlled with option 'strictDiacriticPlacement'.

Shortened letters

Pinyin allows to shorten two-letter pairs ng, zh, ch and sh to ŋ, , ĉ and ŝ. This behaviour can be controlled by option 'shortenedLetters'.

Source

  • Yǐn Bīnyōng (尹斌庸), Mary Felley (傅曼丽): Chinese romanization: Pronunciation and Orthography (汉语拼音和正词法). Sinolingua, Beijing, 1990, ISBN 7-80052-148-6, ISBN 0-8351-1930-0.
  • Ireneus László Legeza: Guide to transliterated Chinese in the modern Peking dialect. Conversion tables of the currently used international and European systems with comparative tables of initials and finals. E. J. Brill, Leiden, 1968.

See also

Where do the tone marks go?
Tone mark rules on pinyin.info.
Pinyin apostrophes
Apostrophe rules on pinyin.info.
Pinyin initals/finals
Initial/finals table on pinyin.info.
Erhua sound
Article on Wikipedia.
The Unicode Consortium: The Unicode Standard, Version 5.0.0
Chapter 7, European Alphabetic Scripts, 7.9 Combining Marks, defined by: The Unicode Standard, Version 5.0 (Boston, MA, Addison-Wesley, 2007. ISBN 0-321-48091-0)
Unicode: Combining Diacritical Marks
Range: 0300-036F
Characters and Combining Marks
Unicode: FAQ

Class

class cjklib.reading.operator.PinyinOperator(**options)

Bases: cjklib.reading.operator.TonalRomanisationOperator

Provides an operator for the Mandarin romanisation Hanyu Pinyin. It can be configured to cope with different representations (“dialects”) of Pinyin. For conversion between different representations the PinyinDialectConverter can be used.

Todo

  • Impl: ISO 7098 asks for conversion of 。、·「」 to .,-«». What about ,?《》:-? Implement a method for conversion to be optionally used.
  • Impl: Special marker for neutral tone: ‘mȧ’ (u’m\u0227’, reported by Ching-song Gene Hsiao: A Manual of Transcription Systems For Chinese, 中文拼音手册. Far Eastern Publications, Yale University, New Haven, Connecticut, 1985, ISBN 0-88710-141-0. Seems like left over from Pinjin, 1956), and ‘·ma’ (u’\xb7ma’, check!: 现代汉语词典(第5版)[Xiàndài Hànyǔ Cídiǎn 5. Edition]. 商务印书馆 [Shāngwù Yìnshūguǎn], Beijing, 2005, ISBN 7-100-04385-9.)
  • Impl: Consider handling \*nue and \*lue.
Parameters:
  • options – extra options
  • dbConnectInst – instance of a cjklib.dbconnector.DatabaseConnector, if none is given, default settings will be assumed.
  • strictSegmentation – if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
  • case – if set to 'lower', only lower case will be supported, if set to 'both' a mix of upper and lower case will be supported.
  • toneMarkType – if set to 'diacritics' tones will be marked using diacritic marks, if set to 'numbers' appended numbers from 1 to 5 will be used to mark tones, if set to 'none' no tone marks will be used and no tonal information will be supplied at all.
  • missingToneMark – if set to 'fifth' no tone mark is set to indicate the fifth tone (qingsheng, e.g. 'wo3men' stands for 'wo3men5'), if set to 'noinfo', no tone information will be deduced when no tone mark is found (takes on value None), if set to 'ignore' this entity will not be valid and for segmentation the behaviour defined by 'strictSegmentation' will take affect. This option only has effect for the tone mark type 'numbers'.
  • strictDiacriticPlacement – if set to True syllables have to follow the diacritic placement rule of Pinyin strictly. Wrong placement will result in splitEntityTone() raising an InvalidEntityError. Defaults to False. In either way, diacritics must be placed on one of the vowels (nasal semi-vowels being an exception).
  • pinyinDiacritics – a 4-tuple of diacritic marks for tones one to for. If a circumflex (U+0302) is contained as diacritic mark, special vowel ê will not be supported and the given string will be interpreted as tonal version of vowel e.
  • yVowel – a character (or string) that is taken as alternative for ü which depicts (among others) the close front rounded vowel [y] (IPA) in Pinyin and includes an umlaut. Changes forms of syllables nü, nüe, lü, lüe. This option is not valid for the tone mark type 'diacritics'.
  • shortenedLetters – if set to True final letter ng will be shortend to ŋ, and initial letters zh, ch, sh will be shortend to , ĉ, ŝ.
  • pinyinApostrophe – an alternate apostrophe that is taken instead of the default one.
  • pinyinApostropheFunction – a function that indicates when a syllable combination needs to be split by an apostrophe, see aeoApostropheRule() for the default implementation.
  • erhua – if set to 'ignore' no special support will be provided for retroflex -r at syllable end (Erhua), i.e. zher will raise an exception. If set to 'twoSyllables' syllables with an append r are given/will be segmented into two syllables, the -r suffix making up one syllable itself as 'r'. If set to 'oneSyllable' syllables with an appended r are given/will be segmented into one syllable only.
APOSTROPHE_LIST
List of apostrophes used in guessing routine.
DIACRITICS_LIST
Dictionary of diacritics per tone used in guessing routine. Only diacritics with canonical combining class 230 supported (unicodedata.combining() == 230, see Unicode 3.11, or http://unicode.org/Public/UNIDATA/UCD.html#Canonical_Combining_Class_Values), due to implementation of how ü and ê, ẑ, ĉ, ŝ are handled.
PINYIN_SOUND_REGEX
Regular Expression matching onset, nucleus and coda. Syllables ‘n’, ‘ng’, ‘r’ (for Erhua) and ‘ê’ have to be handled separately.
TONEMARK_VOWELS
List of characters of the nucleus possibly carrying the tone mark. n is included in standalone syllables n and ng. r is used for supporting Erhua in a two syllable form, ŋ is the shortened form of ng.
Y_VOWEL_LIST
List of vowels for [y] after initials n/l used in guessing routine.
static aeoApostropheRule(operatorInst, precedingEntity, followingEntity)

Checks if the given entities need to be separated by an apostrophe.

Returns true for syllables starting with one of the three vowels a, e, o having a preceding syllable. Additionally forms n and ng are separated from preceding syllables. Furthermore corner case e’r will handled to distinguish from er.

This function serves as the default apostrophe rule.

Parameters:
  • operatorInst (instance) – instance of the Pinyin operator
  • precedingEntity (str) – the preceding syllable or any other content
  • followingEntity (str) – the following syllable or any other content
Return type:

bool

Returns:

true if the syllables need to be separated, false otherwise

compose(readingEntities)

Composes the given list of basic entities to a string. Applies an apostrophe between syllables if needed using default implementation aeoApostropheRule().

Parameter:readingEntities (list of str) – list of basic syllables or other content
Return type:str
Returns:composed entities
convertPlainEntity(plainEntity, targetOptions=None)

Converts the alternative syllable representation from the current dialect to the given target, or by default to the standard representation. Erhua forms will not be converted.

Use the PinyinDialectConverter for conversions in general.

Parameters:
  • plainEntity (str) – plain syllable in the current reading
  • targetOptions (dict) – target reading options
Return type:

str

Returns:

converted entity

classmethod getDefaultOptions()
getFormattingEntities(*args, **kwargs)
getOnsetRhyme(plainSyllable)

Splits the given plain syllable into onset (initial) and rhyme (final).

Pinyin can’t be separated into onset and rhyme clearly within its own system. There are syllables with same finals written differently (e.g. wei and dui both ending in a final that can be described by uei) and reduction of vowels (same example: dui which is pronounced with vowels uei). This method will use three forms not found as substrings in Pinyin (uei, uen and iou) and substitutes (pseudo) initials w and y with its vowel equivalents.

Furthermore final i will be distinguished in three forms given by the following three examples: yi, zhi and zi to express phonological difference.

Returned strings will be lowercase.

Parameter:plainSyllable (str) – syllable without tone marks
Return type:tuple of str
Returns:tuple of entity onset and rhyme
Raises InvalidEntityError:
 if the entity is invalid.
Raises UnsupportedError:
 for entity r when Erhua is handled as separate entity.
getPlainReadingEntities(*args, **kwargs)

Gets the list of plain entities supported by this reading. Different to getReadingEntities() the entities will carry no tone mark.

Depending on the type of Erhua support either additional syllables with an ending -r are added, or a single r is included. The user specified character for vowel ü will be used.

Return type:set of str
Returns:set of supported syllables

Todo

  • Fix: don’t raise an ValueError here (delayed), raise an Exception directly in the constructor. See also WadeGilesOperator.
getReadingCharacters(*args, **kwargs)
getReadingEntities(*args, **kwargs)
getTonalEntity(plainEntity, tone)
getTones(*args, **kwargs)
classmethod guessReadingDialect(readingString, includeToneless=False)

Takes a string written in Pinyin and guesses the reading dialect.

The basic options 'toneMarkType', 'pinyinDiacritics', 'yVowel', 'erhua', 'pinyinApostrophe' and 'shortenedLetters' are guessed. Unless 'includeToneless' is set to True only the tone mark types 'diacritics' and 'numbers' are considered as the latter one can also represent the state of missing tones. Strings tested for 'yVowel' are ü, v and u:. 'erhua' is set to 'twoSyllables' by default and only tested when 'toneMarkType' is assumed to be set to 'numbers'.

Parameters:
  • readingString (str) – Pinyin string
  • includeToneless (bool) – if set to True option 'toneMarkType' can take on value 'none', but by default (i.e. set to False) is covered by tone mark type set to 'numbers'.
Return type:

dict

Returns:

dictionary of basic keyword settings

isReadingEntity(entity)
isStrictDecomposition(readingEntities)

Checks if the given decomposition follows the Pinyin format strictly for unambiguous decomposition: syllables have to be preceded by an apostrophe if the decomposition would be ambiguous otherwise.

The function stored given as option 'pinyinApostropheFunction' is used to check if a apostrophe should have been placed.

Parameter:readingEntities (list of str) – decomposed reading string
Return type:bool
Returns:true if decomposition is strict, false otherwise
removeApostrophes(readingEntities)

Removes apostrophes between two syllables for a given decomposition.

Parameter:readingEntities (list of str) – list of basic syllables or other content
Return type:list of str
Returns:the given entity list without separating apostrophes
splitEntityTone(entity)

Splits the entity into an entity without tone mark and the entity’s tone index.

The plain entity returned will always be in Unicode’s Normalization Form C (NFC, see http://www.unicode.org/reports/tr15/).

Parameter:entity (str) – entity with tonal information
Return type:tuple
Returns:plain entity without tone mark and entity’s tone index (starting with 1)