PinyinDialectConverter — Hanyu Pinyin dialects

Specifics

Examples

The following examples show how to convert between different representations of Pinyin.

  • Create the Converter and convert from standard Pinyin to Pinyin with tones represented by numbers:

    >>> from cjklib.reading import *
    >>> targetOp = operator.PinyinOperator(toneMarkType='numbers')
    >>> pinyinConv = converter.PinyinDialectConverter(
    ...     targetOperators=[targetOp])
    >>> pinyinConv.convert(u'hànzì', 'Pinyin', 'Pinyin')
    u'han4zi4'
    
  • Convert Pinyin written with numbers, the ü (u with umlaut) replaced by character v and omitted fifth tone to standard Pinyin:

    >>> sourceOp = operator.PinyinOperator(toneMarkType='numbers',
    ...    yVowel='v', missingToneMark='fifth')
    >>> pinyinConv = converter.PinyinDialectConverter(
    ...     sourceOperators=[sourceOp])
    >>> pinyinConv.convert('nv3hai2zi', 'Pinyin', 'Pinyin')
    u'nǚháizi'
    
  • Or more elegantly:

    >>> f = ReadingFactory()
    >>> f.convert('nv3hai2zi', 'Pinyin', 'Pinyin',
    ...     sourceOptions={'toneMarkType': 'numbers', 'yVowel': 'v',
    ...     'missingToneMark': 'fifth'})
    u'nǚháizi'
    
  • Decompose the reading of a dictionary entry from CEDICT into syllables and convert the ü-vowel and forms of Erhua sound:

    >>> pinyinFrom = operator.PinyinOperator(toneMarkType='numbers',
    ...     yVowel='u:', Erhua='oneSyllable')
    >>> syllables = pinyinFrom.decompose('sun1nu:r3')
    >>> print syllables
    ['sun1', 'nu:r3']
    >>> pinyinTo = operator.PinyinOperator(toneMarkType='numbers',
    ...     Erhua='twoSyllables')
    >>> pinyinConv = converter.PinyinDialectConverter(
    ...     sourceOperators=[pinyinFrom], targetOperators=[pinyinTo])
    >>> pinyinConv.convertEntities(syllables, 'Pinyin', 'Pinyin')
    [u'sun1', u'nü3', u'r5']
    
  • Or more elegantly with entities already decomposed:

    >>> f.convertEntities(['sun1', 'nu:r3'], 'Pinyin', 'Pinyin',
    ...     sourceOptions={'toneMarkType': 'numbers', 'yVowel': 'u:',
    ...        'Erhua': 'oneSyllable'},
    ...     targetOptions={'toneMarkType': 'numbers',
    ...        'Erhua': 'twoSyllables'})
    [u'sun1', u'nü3', u'r5']
    
  • Fix cosmetic errors in Pinyin input (note tone mark and apostrophe):

    >>> f.convert(u"Wǒ peí nǐ qù Xīān.", 'Pinyin', 'Pinyin')
    u"Wǒ péi nǐ qù Xī'ān."
    
  • Fix more errors in Pinyin input (note diacritics):

    >>> string = u"Wŏ peí nĭ qù Xīān."
    >>> dialect = operator.PinyinOperator.guessReadingDialect(string)
    >>> f.convert(string, 'Pinyin', 'Pinyin', sourceOptions=dialect)
    u"Wǒ péi nǐ qù Xī'ān."
    

Class

class cjklib.reading.converter.PinyinDialectConverter(*args, **options)

Bases: cjklib.reading.converter.ReadingConverter

Provides a converter for different representations of the Chinese romanisation Hanyu Pinyin.

Parameters:
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
  • keepPinyinApostrophes – if set to True apostrophes separating two syllables in Pinyin will be kept even if not necessary. Apostrophes missing according to the given rule will be added though.
  • breakUpErhua – if set to 'on' Erhua forms will be converted to single syllables with a full er syllable regardless of the Erhua form setting of the target reading, e.g. zher will be converted to zhe, er, if set to 'auto' Erhua forms are converted if the given target reading operator doesn’t support Erhua forms, if set to 'off' Erhua forms will always be conserved.
convertEntities(readingEntities, fromReading='Pinyin', toReading='Pinyin')

Converts a list of entities in the source reading to the given target reading.

Parameters:
  • readingEntities (list of str) – list of entities written in source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

list of str

Returns:

list of entities written in target reading

Raises AmbiguousConversionError:
 

if conversion for a specific entity of the source reading is ambiguous.

Raises ConversionError:
 

on other operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading is not supported for conversion.

Raises InvalidEntityError:
 

if an invalid entity is given.

static convertToSingleSyllableErhua(entityTuples)

Converts the various Erhua forms in a list of reading entities to a representation with one syllable, e.g. ['tou2', 'r5'] to ['tour2'].

Parameter:entityTuples (list of tuple/str) – list of tuples with plain syllable and tone
Return type:list of tuple/str
Returns:list of tuples with plain syllable and tone
static convertToTwoSyllablesErhua(entityTuples)

Converts the various Erhua forms in a list of reading entities to a representation with two syllable, e.g. ['tour2'] to ['tou2', 'r5'].

Parameter:entityTuples (list of tuple/str) – list of tuples with plain syllable and tone
Return type:list of tuple/str
Returns:list of tuples with plain syllable and tone
classmethod getDefaultOptions()