WadeGilesOperator – Wade-Giles

cjklib.reading.operator.WadeGilesOperator is an implementation of the Mandarin Chinese romanisation Wade-Giles. It was in common use before being replaced by Pinyin.

Features:

  • tones marked by either superscript or plain digits,
  • flexibility with derived writing, e.g. szu instead of ssu,
  • alternative representation of characters ŭ and ê,
  • handling of omissions of umlaut ü with resulting ambiguity,
  • alternative marking of neutral tone (qingsheng) with either no mark or digits zero or five,
  • configurable apostrophe for marking aspiration,
  • placement of hyphens between syllables and
  • guessing of input form (reading dialect).

Specifics

Alterations

While the Wade-Giles romanisation system itself is a modification by H. A. Giles, some further alterations exist, requiring an adaptable solution to parse transliterated text.

Diacritics

While non-retroflex zero final syllables tzŭ, tz’ŭ and ssŭ carry a breve on top of the u in the standard realization of Wade-Giles, it is often left out while creating no ambiguity. In the same fashion finals , -ên and -êng, also syllable êrh, carry a circumflex over the e which often is not written, and no ambiguity arises as no equivalent forms with a plain e exist. These forms can be handled by setting options 'zeroFinal' to 'u' and 'diacriticE' to 'e'.

Different to that, leaving out the umlaut on the u for finals , -üan, -üeh and -ün does create forms where back-conversion for some cases is not possible as an equivalent vowel u form exists. Unambiguous forms consist of initial hs- and y- (exception yu) and/or finals -üeh and -üo, the latter being dialect forms not in use today. So while for example hsu can be unambiguously converted back to its correct form hsü, it is not clear if ch’uan is the wanted form or if it stems from ch’üan, its diacritics being mangled. This reporting is done by checkPlainEntity(). The omission of the umlaut can be controlled by setting 'umlautU' to 'u'.

Others

For the non-retroflex zero final forms tzŭ, tz’ŭ and ssŭ the latter is sometimes changed to szŭ. The operator can be configured by setting the Boolean option 'useInitialSz'.

The neutral tone by default is not marked. As sometimes the digits zero or five are used, they can be set by option 'neutralToneMark'.

The apostrophe marking aspiration can be set by 'wadeGilesApostrophe'.

Tones are by default marked with superscript characters. This can be controlled by option 'toneMarkType'.

Recovering omitted apostrophes for aspiration is not possible as for all cases there exists ambiguity. No means are provided to warn for possible missing apostrophes. In case of uncertainty check for initials p-, t-, k-, ch-, ts and tz.

Examples

The WadeGilesDialectConverter allows conversion between said forms.

Restore diacritics:
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert(u"K’ung³-tzu³", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'zeroFinal': 'u'})
u'K\u2019ung\xb3-tz\u016d\xb3'
>>> f.convert(u"k’ai¹-men²-chien⁴-shan¹", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'diacriticE': 'e'})
u'k\u2019ai\xb9-m\xean\xb2-chien\u2074-shan\xb9'
>>> f.convert(u"hsueh²", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'umlautU': 'u'})
u'hs\xfceh\xb2'
But:
>>> f.convert(u"hsu⁴-ch’u³", 'WadeGiles', 'WadeGiles',
...     sourceOptions={'umlautU': 'u'})
Traceback (most recent call last):
...
cjklib.exception.AmbiguousConversionError: conversion for entity 'ch’u³' is ambiguous: ch’u³, ch’ü³
Guess non-standard form:
>>> from cjklib.reading import operator
>>> operator.WadeGilesOperator.guessReadingDialect(
...     u"k'ai1-men2-chien4-shan1")
{'zeroFinal': u'\u016d', 'diacriticE': u'e', 'umlautU': u'\xfc', 'toneMarkType': 'numbers', 'useInitialSz': False, 'neutralToneMark': 'none', 'wadeGilesApostrophe': "'"}

Class

class cjklib.reading.operator.WadeGilesOperator(**options)

Bases: cjklib.reading.operator.TonalRomanisationOperator

Provides an operator for the Mandarin Wade-Giles romanisation.

Todo

  • Lang: Asterisk (*) marking the entering tone (入聲): e.g. chio²* and chüeh²* for 覺 used by Giles (A Chinese-English Dictionary, second edition, 1912).
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • strictSegmentation – if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
  • case – if set to 'lower', only lower case will be supported, if set to 'both' a mix of upper and lower case will be supported.
  • wadeGilesApostrophe – an alternate apostrophe that is taken instead of the default one.
  • toneMarkType – if set to 'numbers' appended numbers from 1 to 5 will be used to mark tones, if set to 'superscriptNumbers' appended superscript numbers from 1 to 5 will be used to mark tones, if set to 'none' no tone marks will be used and no tonal information will be supplied at all.
  • neutralToneMark – if set to 'none' no tone mark is set to indicate the fifth tone (qingsheng, e.g. 'chih¹tao', if set to 'zero' the number zero is used, e.g. 'chih¹tao⁰' and if set to 'five' the number five is used, e.g. 'chih¹-tao⁵'.
  • missingToneMark – if set to 'noinfo', no tone information will be deduced when no tone mark is found (takes on value None), if set to 'ignore' this entity will not be valid and for segmentation the behaviour defined by 'strictSegmentation' will take affect. This options only has effect for tone mark type 'numbers' and 'superscriptNumbers'. This option is only valid if 'neutralToneMark' is set to something other than 'none'.
  • diacriticE – character used instead of ê. 'e' is a possible alternative, no ambiguities arise.
  • zeroFinal – character used instead of ŭ. 'u' is a possible alternative, no ambiguities arise.
  • umlautU – character used instead of ü. 'u' is a allowed, but ambiguities are possible.
  • useInitialSz – if set to True syllable form szŭ is used instead of the standard ssŭ.

Todo

  • Impl: Raise value error on invalid values for diacriticE, zeroFinal, umlautU
ALLOWED_VOWEL_SUBST
Regular Wade-Giles-vowels that the given options can be substituted with. Other regular vowels are not allowed for substitution as of ambiguity.
APOSTROPHE_LIST
List of apostrophes used in guessing routine.
DB_ASPIRATION_APOSTROPHE
Apostrophe used by Wade-Giles syllable data in database.
DIACRICTIC_E_LIST
List of characters for diacritic e used in guessing routine. Except ‘e’ no other values are allowed that intersect with WG vowels as they can cause ambiguous forms.
FROM_SUPERSCRIPT
Mapping of superscript numbers to tone numbers.
TO_SUPERSCRIPT
Mapping of tone numbers to superscript numbers.
UMLAUT_U_LIST
List of characters used for u-umlaut in guessing routine. Except ‘u’ no other values are allowed that intersect with WG vowels. Vowel ‘u’ will generate ambiguous forms, so that the guessing routine has to take care of only chosing this on forms that have no “natural” ‘u’ counterpart. For all other vowels this is not guaranteed, so they are not allowed as values.
ZERO_FINAL_LIST
List of characters for zero final used in guessing routine. Except ‘u’ no other values are allowed that intersect with WG vowels as they can cause ambiguous forms.
checkPlainEntity(plainEntity, option)

Checks if the given plain entity with is a form with lost diacritics or an ambiguous case.

Examples: While form *erh can be clearly traced to êrh, form kuei has no equivalent part with diacritcs. The former is a case of a 'lost' vowel, the second of a 'strict' form. Syllable ch’u though is an 'ambiguous' case as both ch’u and ch’ü are valid.

Parameters:
  • plainEntity (str) – entity without tonal information
  • option (str) – one option out of 'diacriticE', 'zeroFinal' or 'umlautU'
Return type:

str

Returns:

'strict' if the given form is a strict Wade-Giles form with vowel u, 'lost' if the given form is a mangled vowel form, 'ambiguous' if two forms exist with vowels (i.e. u and ü) each

Raises ValueError:
 

if plain entity doesn’t include the ambiguous vowel in question

compose(readingEntities)

Composes the given list of basic entities to a string by applying a hyphen between syllables.

Parameter:readingEntities (list of str) – list of basic syllables or other content
Return type:str
Returns:composed entities
convertPlainEntity(plainEntity, targetOptions=None)

Converts the alternative syllable representation from the current dialect to the given target, or by default to the standard representation.

Use the WadeGilesDialectConverter for conversions in general.

Parameters:
  • plainEntity (str) – plain syllable in the current reading in lower case letters
  • targetOptions (dict) – target reading options
Return type:

str

Returns:

converted entity

Raises AmbiguousConversionError:
 

if conversion is ambiguous.

classmethod getDefaultOptions()
getFormattingEntities(*args, **kwargs)
getOnsetRhyme(plainSyllable)

Splits the given plain syllable into onset (initial) and rhyme (final).

Semivowels w- and y- will be treated specially and an empty initial will be returned, while the final will be extended with vowel i or u.

Old forms are not supported and will raise an UnsupportedError. For the dialect with missing diacritics on the ü an UnsupportedError is also raised, as it is unclear which syllable is meant.

Returned strings will be lowercase.

Parameter:plainSyllable (str) – syllable without tone marks
Return type:tuple of str
Returns:tuple of entity onset and rhyme
Raises InvalidEntityError:
 if the entity is invalid.
Raises UnsupportedError:
 if the given entity is not supported
getPlainReadingEntities(*args, **kwargs)

Gets the list of plain entities supported by this reading. Different to getReadingEntities() the entities will carry no tone mark.

Syllables will use the user specified apostrophe to mark aspiration.

Return type:set of str
Returns:set of supported syllables
getReadingCharacters(*args, **kwargs)
getTonalEntity(plainEntity, tone)
getTones(*args, **kwargs)
classmethod guessReadingDialect(readingString)

Takes a string written in Wade-Giles and guesses the reading dialect.

The following options are tested:

  • 'toneMarkType'
  • 'wadeGilesApostrophe'
  • 'neutralToneMark'
  • 'diacriticE'
  • 'zeroFinal'
  • 'umlautU'
  • 'useInitialSz'
Parameter:readingString (str) – Wade-Giles string
Return type:dict
Returns:dictionary of basic keyword settings
removeHyphens(readingEntities)

Removes hyphens between two syllables for a given decomposition.

Parameter:readingEntities (list of str) – list of basic syllables or other content
Return type:list of str
Returns:the given entity list without separating hyphens
splitEntityTone(entity)
syllableRegex

Regex to split a string into several syllables in a crude way. It consists of:

  • Initial consonants,
  • apostrophe for aspiration,
  • vowels,
  • final consonants n/ng and rh (for êrh), h (for -ih, -üeh),
  • tone numbers.

Table Of Contents

Previous topic

PinyinOperator — Hanyu Pinyin

Next topic

GROperator — Gwoyeu Romatzyh

This Page