WadeGilesOperator – Wade-Giles¶

cjklib.reading.operator.WadeGilesOperator is an implementation of the Mandarin Chinese romanisation Wade-Giles. It was in common use before being replaced by Pinyin.

Features:

tones marked by either superscript or plain digits,
flexibility with derived writing, e.g. szu instead of ssu,
alternative representation of characters ŭ and ê,
handling of omissions of umlaut ü with resulting ambiguity,
alternative marking of neutral tone (qingsheng) with either no mark or digits zero or five,
configurable apostrophe for marking aspiration,
placement of hyphens between syllables and
guessing of input form (reading dialect).

Specifics¶

Alterations¶

While the Wade-Giles romanisation system itself is a modification by H. A. Giles, some further alterations exist, requiring an adaptable solution to parse transliterated text.

Diacritics¶

While non-retroflex zero final syllables tzŭ, tz’ŭ and ssŭ carry a breve on top of the u in the standard realization of Wade-Giles, it is often left out while creating no ambiguity. In the same fashion finals -ê, -ên and -êng, also syllable êrh, carry a circumflex over the e which often is not written, and no ambiguity arises as no equivalent forms with a plain e exist. These forms can be handled by setting options 'zeroFinal' to 'u' and 'diacriticE' to 'e'.

Different to that, leaving out the umlaut on the u for finals -ü, -üan, -üeh and -ün does create forms where back-conversion for some cases is not possible as an equivalent vowel u form exists. Unambiguous forms consist of initial hs- and y- (exception yu) and/or finals -üeh and -üo, the latter being dialect forms not in use today. So while for example hsu can be unambiguously converted back to its correct form hsü, it is not clear if ch’uan is the wanted form or if it stems from ch’üan, its diacritics being mangled. This reporting is done by checkPlainEntity(). The omission of the umlaut can be controlled by setting 'umlautU' to 'u'.

Others¶

For the non-retroflex zero final forms tzŭ, tz’ŭ and ssŭ the latter is sometimes changed to szŭ. The operator can be configured by setting the Boolean option 'useInitialSz'.

The neutral tone by default is not marked. As sometimes the digits zero or five are used, they can be set by option 'neutralToneMark'.

The apostrophe marking aspiration can be set by 'wadeGilesApostrophe'.

Tones are by default marked with superscript characters. This can be controlled by option 'toneMarkType'.

Recovering omitted apostrophes for aspiration is not possible as for all cases there exists ambiguity. No means are provided to warn for possible missing apostrophes. In case of uncertainty check for initials p-, t-, k-, ch-, ts and tz.

Examples¶

The WadeGilesDialectConverter allows conversion between said forms.

Restore diacritics:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert(u"K’ung³-tzu³", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'zeroFinal': 'u'})
u'K\u2019ung\xb3-tz\u016d\xb3'
>>> f.convert(u"k’ai¹-men²-chien⁴-shan¹", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'diacriticE': 'e'})
u'k\u2019ai\xb9-m\xean\xb2-chien\u2074-shan\xb9'
>>> f.convert(u"hsueh²", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'umlautU': 'u'})
u'hs\xfceh\xb2'

But:

>>> f.convert(u"hsu⁴-ch’u³", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'umlautU': 'u'})
Traceback (most recent call last):
...
cjklib.exception.AmbiguousConversionError: conversion for entity 'ch’u³' is ambiguous: ch’u³, ch’ü³

Guess non-standard form:

>>> from cjklib.reading import operator
>>> operator.WadeGilesOperator.guessReadingDialect(
...     u"k'ai1-men2-chien4-shan1")
{'zeroFinal': u'\u016d', 'diacriticE': u'e', 'umlautU': u'\xfc', 'toneMarkType': 'numbers', 'useInitialSz': False, 'neutralToneMark': 'none', 'wadeGilesApostrophe': "'"}

Class¶

class cjklib.reading.operator.WadeGilesOperator(**options)¶

Bases: cjklib.reading.operator.TonalRomanisationOperator

Provides an operator for the Mandarin Wade-Giles romanisation.

Todo

Lang: Asterisk (*) marking the entering tone (入聲): e.g. chio²* and chüeh²* for 覺 used by Giles (A Chinese-English Dictionary, second edition, 1912).

Parameters:

options – extra options
dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
strictSegmentation – if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
case – if set to 'lower', only lower case will be supported, if set to 'both' a mix of upper and lower case will be supported.
wadeGilesApostrophe – an alternate apostrophe that is taken instead of the default one.
toneMarkType – if set to 'numbers' appended numbers from 1 to 5 will be used to mark tones, if set to 'superscriptNumbers' appended superscript numbers from 1 to 5 will be used to mark tones, if set to 'none' no tone marks will be used and no tonal information will be supplied at all.
neutralToneMark – if set to 'none' no tone mark is set to indicate the fifth tone (qingsheng, e.g. 'chih¹tao', if set to 'zero' the number zero is used, e.g. 'chih¹tao⁰' and if set to 'five' the number five is used, e.g. 'chih¹-tao⁵'.
missingToneMark – if set to 'noinfo', no tone information will be deduced when no tone mark is found (takes on value None), if set to 'ignore' this entity will not be valid and for segmentation the behaviour defined by 'strictSegmentation' will take affect. This options only has effect for tone mark type 'numbers' and 'superscriptNumbers'. This option is only valid if 'neutralToneMark' is set to something other than 'none'.
diacriticE – character used instead of ê. 'e' is a possible alternative, no ambiguities arise.
zeroFinal – character used instead of ŭ. 'u' is a possible alternative, no ambiguities arise.
umlautU – character used instead of ü. 'u' is a allowed, but ambiguities are possible.
useInitialSz – if set to True syllable form szŭ is used instead of the standard ssŭ.

Todo

Impl: Raise value error on invalid values for diacriticE, zeroFinal, umlautU

ALLOWED_VOWEL_SUBST¶: Regular Wade-Giles-vowels that the given options can be substituted with. Other regular vowels are not allowed for substitution as of ambiguity.

APOSTROPHE_LIST¶: List of apostrophes used in guessing routine.

DB_ASPIRATION_APOSTROPHE¶: Apostrophe used by Wade-Giles syllable data in database.

DIACRICTIC_E_LIST¶: List of characters for diacritic e used in guessing routine. Except ‘e’ no other values are allowed that intersect with WG vowels as they can cause ambiguous forms.

FROM_SUPERSCRIPT¶: Mapping of superscript numbers to tone numbers.

TO_SUPERSCRIPT¶: Mapping of tone numbers to superscript numbers.

UMLAUT_U_LIST¶: List of characters used for u-umlaut in guessing routine. Except ‘u’ no other values are allowed that intersect with WG vowels. Vowel ‘u’ will generate ambiguous forms, so that the guessing routine has to take care of only chosing this on forms that have no “natural” ‘u’ counterpart. For all other vowels this is not guaranteed, so they are not allowed as values.

ZERO_FINAL_LIST¶: List of characters for zero final used in guessing routine. Except ‘u’ no other values are allowed that intersect with WG vowels as they can cause ambiguous forms.

checkPlainEntity(plainEntity, option)¶

Checks if the given plain entity with is a form with lost diacritics or an ambiguous case.

Examples: While form *erh can be clearly traced to êrh, form kuei has no equivalent part with diacritcs. The former is a case of a 'lost' vowel, the second of a 'strict' form. Syllable ch’u though is an 'ambiguous' case as both ch’u and ch’ü are valid.

Raises ValueError:
Parameters:	plainEntity (str) – entity without tonal information option (str) – one option out of `'diacriticE'`, `'zeroFinal'` or `'umlautU'`
Return type:	str
Returns:	`'strict'` if the given form is a strict Wade-Giles form with vowel u, `'lost'` if the given form is a mangled vowel form, `'ambiguous'` if two forms exist with vowels (i.e. u and ü) each
	if plain entity doesn’t include the ambiguous vowel in question

compose(readingEntities)¶

Composes the given list of basic entities to a string by applying a hyphen between syllables.

Parameter:	readingEntities (list of str) – list of basic syllables or other content
Return type:	str
Returns:	composed entities

convertPlainEntity(plainEntity, targetOptions=None)¶

Converts the alternative syllable representation from the current dialect to the given target, or by default to the standard representation.

Use the WadeGilesDialectConverter for conversions in general.

Raises AmbiguousConversionError:
Parameters:	plainEntity (str) – plain syllable in the current reading in lower case letters targetOptions (dict) – target reading options
Return type:	str
Returns:	converted entity
	if conversion is ambiguous.

classmethod getDefaultOptions()¶

getFormattingEntities(*args, **kwargs)¶

getOnsetRhyme(plainSyllable)¶

Splits the given plain syllable into onset (initial) and rhyme (final).

Semivowels w- and y- will be treated specially and an empty initial will be returned, while the final will be extended with vowel i or u.

Old forms are not supported and will raise an UnsupportedError. For the dialect with missing diacritics on the ü an UnsupportedError is also raised, as it is unclear which syllable is meant.

Returned strings will be lowercase.

Raises InvalidEntityError:
Parameter:	plainSyllable (str) – syllable without tone marks
Return type:	tuple of str
Returns:	tuple of entity onset and rhyme
	if the entity is invalid.
Raises UnsupportedError:
	if the given entity is not supported

getPlainReadingEntities(*args, **kwargs)¶

Gets the list of plain entities supported by this reading. Different to getReadingEntities() the entities will carry no tone mark.

Syllables will use the user specified apostrophe to mark aspiration.

Return type:	set of str
Returns:	set of supported syllables

getReadingCharacters(*args, **kwargs)¶

getTonalEntity(plainEntity, tone)¶

getTones(*args, **kwargs)¶

classmethod guessReadingDialect(readingString)¶

Takes a string written in Wade-Giles and guesses the reading dialect.

The following options are tested:

'toneMarkType'
'wadeGilesApostrophe'
'neutralToneMark'
'diacriticE'
'zeroFinal'
'umlautU'
'useInitialSz'

Parameter:	readingString (str) – Wade-Giles string
Return type:	dict
Returns:	dictionary of basic keyword settings

removeHyphens(readingEntities)¶

Removes hyphens between two syllables for a given decomposition.

Parameter:	readingEntities (list of str) – list of basic syllables or other content
Return type:	list of str
Returns:	the given entity list without separating hyphens

splitEntityTone(entity)¶

syllableRegex¶

Regex to split a string into several syllables in a crude way. It consists of:

Initial consonants,
apostrophe for aspiration,
vowels,
final consonants n/ng and rh (for êrh), h (for -ih, -üeh),
tone numbers.

WadeGilesOperator – Wade-Giles¶

Specifics¶

Alterations¶

Diacritics¶

Others¶

Examples¶

Class¶

Table Of Contents

Previous topic

Next topic

This Page

Navigation

WadeGilesOperator – Wade-Giles¶

Specifics¶

Alterations¶

Diacritics¶

Others¶

Examples¶

Class¶

Table Of Contents

Previous topic

Next topic

This Page

Quick search

Navigation