cjklib.reading.operator — Operation on character readings

Operation on character readings.

Architecture

A ReadingOperator supports basic operations on string written in a character reading:

  • decompose() breaks down a text into the basic entities of that reading (additional non reading substrings are also accepted).
  • compose() joins these entities together and might apply formatting rules needed by the reading.
  • isReadingEntity() and isFormattingEntity() are provided to check which of the strings returned by decompose() are supported entities for the given reading. While a reading entity expresses an entity of the language (in most cases a syllable), a formatting entity merely exists for the convenience of the written form, e.g. punctuation marks or syllable separators.
  • getDefaultOptions() will return the default reading dialect.

Many child classes add many more reading specific methods.

Romanisation

Additional to decompose() provided by the class ReadingOperator a RomanisationOperator offers a method getDecompositions() that returns several possible decompositions in an ambiguous case. Also, as Romanisations have a fixed set of entities, a method getReadingEntities() offers access to a list of all accepted reading entities.

Decomposition

Transcriptions into the Latin (or Cyrilic) alphabet generate the problem that syllable boundaries or boundaries of entities belonging to single Chinese characters aren’t clear anymore once entities are grouped together.

Therefore it is important to have methods at hand to separate strings and to split those into single entities. This though cannot always be done in a clear and unambiguous way as several different decompositions might be possible thus leading to the general case of ambiguous decompositions.

Many romanisations do provide a way to tackle this problem. Pinyin for example requires the use of an apostrophe (') when the reverse process of splitting the string into syllables gets ambiguous. The Wade-Giles romanisation in its strict implementation asks for a hyphen used between all syllables. The LSHK’s Jyutping when written with tone marks will always be clearly decomposable as the digits mark syllable borders.

The method isStrictDecomposition() can be implemented to check if one possible decomposition is the strict decomposition offered by the romanisation’s protocol. This method should guarantee that under all circumstances only one decomposed version will be regarded as strict.

If no strict version is yielded and different decompositions exist an unambiguous decomposition can not be made. These decompositions can be accessed through method getDecompositions(), even in a cases where a strict decomposition exists.

Letter case

Romanisations are special to other readings as their entities can be written in upper or lower case, or in a mix of them. By default operators will recognise both, this behaviour can be changed with option 'case' which can alternatively be changed to 'lower'. Upper case is not explicitly supported. If such a writing is needed, this behaviour can be implemented by choosing lower case and converting strings to and from the operator manually. Method getReadingEntities() will by default return lower case entities.

Tonal readings

Tonal readings are supported with class TonalFixedEntityOperator. It provides two methods getTonalEntity() and splitEntityTone() to cope with tonal information in text.

Tones

Operators are free to handle tones according to their needs. No data type constraint is given so that some will handle tones as integers, while others will handle strings. Even the count of tones between different operators for the same language may vary as one system might be more specific about tonal features.

Plain entities

While some operators have a fixed set of accepted entities, the more specific subgroup for tonal languages has a set of basic entities, such entity here being called plain entity, which can be annotated with tonal information to yield a regular reading entity. Some plain entities might themselves be normal reading entities, while others might be not. No requirements are made that the set of plain entity in cross product with the set of tones will fully span the set of reading entities.

Examples

Decompose a reading string in Gwoyeu Romatzyh into single entities:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.decompose('"Hannshyue" .de mingcheng duey Jonggwo [...]', 'GR')
['"', 'Hann', 'shyue', '" ', '.de', ' ', 'ming', 'cheng', ' ', 'duey', ' ', 'Jong', 'gwo', ' [...]']

The same can be done by directly using the operator’s instance:

>>> from cjklib.reading import operator
>>> cy = operator.CantoneseYaleOperator()
>>> cy.decompose(u'gwóngjàuwá')
[u'gwóng', u'jàu', u'wá']

Composing will reverse the process, using a Pinyin string:

>>> f.compose([u'xī', u'ān'], 'Pinyin')
u"xī'ān"

For more complex operators, see PinyinOperator or MandarinIPAOperator.

Base classes

class cjklib.reading.operator.ReadingOperator(**options)

Defines an abstract operator on text written in a character reading.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
READING_NAME
Unique name of reading
compose(readingEntities)

Composes the given list of basic entities to a string.

Composing entities can raise a CompositionError if a non-reading entity is about to be joined with a reading entity and will result in a string that is impossible to decompose.

The base class’ implementation will raise a NotImplementedError.

Parameter:readingEntities (list of str) – list of basic entities or other content
Return type:str
Returns:composed entities
Raises CompositionError:
 if the given entities can not be composed.
decompose(readingString)

Decomposes the given string into basic entities that can be mapped to one Chinese character each (exceptions possible).

The given input string can contain other non reading characters, e.g. punctuation marks.

The returned list contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.

The base class’ implementation will raise a NotImplementedError.

Parameter:readingString (str) – reading string
Return type:list of str
Returns:a list of basic entities of the input string
Raises DecompositionError:
 if the string can not be decomposed.
classmethod getDefaultOptions()

Returns the reading operator’s default options.

The base class’ implementation returns an empty dictionary. The keyword ‘dbConnectInst’ is not regarded a configuration option of the operator and is thus not included in the dict returned.

Return type:dict
Returns:the reading operator’s default options.
isFormattingEntity(entity)

Returns True if the given entity is a valid formatting entity recognised by the reading operator.

The base class’ implementation will always return False.

Parameter:entity (str) – entity to check
Return type:bool
Returns:True if string is a formatting entity of the reading.
isReadingEntity(entity)

Returns True if the given entity is a valid reading entity recognised by the reading operator, i.e. it will be returned by decompose().

The base class’ implementation will raise a NotImplementedError.

Parameter:entity (str) – entity to check
Return type:bool
Returns:True if string is an entity of the reading, false otherwise.
class cjklib.reading.operator.RomanisationOperator(**options)

Bases: cjklib.reading.operator.ReadingOperator

Defines an abstract ReadingOperator on text written in a romanisation, i.e. text written in the Latin alphabet or written in the Cyrillic alphabet.

Todo

  • Impl: Optimise decompose() as to incorporate segment() and prune the tree while it is created. Does this though yield significant improvement? Would at least be O(n).
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • strictSegmentation – if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
  • case – if set to 'lower', only lower case will be supported, if set to 'both' a mix of upper and lower case will be supported.
decompose(readingString)

Decomposes the given string into basic entities on a one-to-one mapping level to Chinese characters. Decomposing can be ambiguous and there are two assumptions made to solve this problem: If two subsequent entities together make up a longer valid entity, then the decomposition with the shorter entities can be disregarded. Furthermore it is assumed that the reading provides rules to mark entity borders and that these rules can be checked, so that the decomposition that abides by this rules will be prefered. This check is done by calling isStrictDecomposition().

The given input string can contain other characters not supported by the reading, e.g. punctuation marks. The returned list then contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.

Parameter:readingString (str) – reading string
Return type:list of str
Returns:a list of basic entities of the input string
Raises AmbiguousDecompositionError:
 if decomposition is ambiguous.
Raises DecompositionError:
 if the given string has a wrong format.
getDecompositionTree(readingString)

Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions and returns the possible decompositions as a lattice.

Parameter:readingString (str) – reading string
Return type:list
Returns:a list of all possible decompositions consisting of basic entities as a lattice construct.
Raises DecompositionError:
 if the given string has a wrong format.
getDecompositions(readingString)

Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions. This method is a more general version of decompose().

The returned list construction consists of two entity types: entities of the romanisation and other strings.

Parameter:readingString (str) – reading string
Return type:list of list of str
Returns:a list of all possible decompositions consisting of basic entities.
Raises DecompositionError:
 if the given string has a wrong format.
classmethod getDefaultOptions()
getFormattingEntities(*args, **kwargs)

Gets a set of entities used by the reading to format reading entities.

The base class’ implementation will return an empty set.

Return type:set of str
Returns:set of supported formatting entities
getReadingCharacters(*args, **kwargs)

Gets a list of characters parsed by this reading operator as reading entities. For alphabetic characters, lower case is returned.

Separators like the apostrophe (') in Pinyin are not part of reading entities and as such not included.

Return type:set
Returns:set of characters parsed by the reading operator
getReadingEntities(*args, **kwargs)

Gets a set of all entities supported by the reading.

The list is used in the segmentation process to find entity boundaries. The base class’ implementation will raise a NotImplementedError.

Returned entities are in lowercase.

Return type:set of str
Returns:set of supported reading entities
isFormattingEntity(entity)

Returns True if the given entity is a valid formatting entity recognised by the romanisation operator.

Letter case of characters will be handled depending on the setting for option 'case'.

Parameter:entity (str) – entity to check
Return type:bool
Returns:True if string is a formatting entity of the reading.
isReadingEntity(entity)

Returns true if the given entity is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.

Letter case of characters will be handled depending on the setting for option 'case'.

Parameter:entity (str) – entity to check
Return type:bool
Returns:True if string is an entity of the reading, False otherwise.
isStrictDecomposition(decomposition)

Checks if the given decomposition follows the romanisation format strictly to allow unambiguous decomposition.

The romanisation should offer a way/protocol to make an unambiguous decomposition into it’s basic syllables possible as to make the process of appending syllables to a string reversible. The testing on compliance with this protocol has to be implemented here. Thus this method can only return true for one and only one possible decomposition for all strings.

Parameter:decomposition (list of str) – decomposed reading string
Return type:bool
Returns:False, as this methods needs to be implemented by the sub class
segment(readingString)

Takes a string written in the romanisation and returns the possible segmentations as a list of syllables.

In contrast to decompose() this method merely segments continuous entities of the romanisation. Characters not part of the romanisation will not be dealt with, this is the task of the more general decompose method.

Option 'strictSegmentation' controls the behaviour of this method for strings that cannot be parsed. If set to True segmentation will raise an exception, if set to False the given string will be returned unsegmented.

Parameter:readingString (str) – reading string
Return type:list of list of str
Returns:a list of possible segmentations (several if ambiguous) into single syllables
Raises DecompositionError:
 if the given string has an invalid format.
class cjklib.reading.operator.SimpleEntityOperator(**options)

Bases: cjklib.reading.operator.ReadingOperator

Provides an operator on readings with a single character per entity.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
compose(readingEntities)
decompose(readingString)
class cjklib.reading.operator.TonalFixedEntityOperator(**options)

Bases: cjklib.reading.operator.ReadingOperator

Provides an abstract ReadingOperator for tonal languages for a reading based on a fixed set of reading entities.

Parameter:options – extra options
getPlainReadingEntities(*args, **kwargs)

Gets the list of plain entities supported by this reading. Different to getReadingEntities() these entities will carry no tone mark.

The base class’ implementation will raise a NotImplementedError.

Return type:set of str
Returns:set of supported syllables
getReadingEntities(*args, **kwargs)

Gets a set of all entities supported by the reading.

The list is used in the segmentation process to find entity boundaries.

Return type:list of str
Returns:list of supported syllables
getTonalEntity(plainEntity, tone)

Gets the entity with tone mark for the given plain entity and tone. The letter case of the given plain entity might not be fully conserved for mixed case strings.

The base class’ implementation will raise a NotImplementedError.

Parameters:
  • plainEntity (str) – entity without tonal information
  • tone – tone
Return type:

str

Returns:

entity with appropriate tone

Raises InvalidEntityError:
 

if the entity is invalid.

Raises UnsupportedError:
 

if the operation is not supported for the given form.

getTones(*args, **kwargs)

Returns a set of tones supported by the reading. These tones don’t necessarily reflect the tones of the underlying language but may defer to reflect notational or other features.

The base class’ implementation will raise a NotImplementedError.

Return type:list
Returns:list of supported tone marks.
isPlainReadingEntity(entity)

Returns true if the given plain entity (without any tone mark) is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.

Parameter:entity (str) – entity to check
Return type:bool
Returns:True if string is an entity of the reading, False otherwise.
isReadingEntity(entity)
splitEntityTone(entity)

Splits the entity into an entity without tone mark (plain entity) and the entity’s tone. The letter case of the given entity might not be fully conserved for mixed case strings.

The base class’ implementation will raise a NotImplementedError.

Parameter:entity (str) – entity with tonal information
Return type:tuple
Returns:plain entity without tone mark and entity’s tone
Raises InvalidEntityError:
 if the entity is invalid.
Raises UnsupportedError:
 if the operation is not supported for the given form.
class cjklib.reading.operator.TonalIPAOperator(**options)

Bases: cjklib.reading.operator.TonalFixedEntityOperator

Defines an operator on strings of a tonal language written in the International Phonetic Alphabet (IPA).

TonalIPAOperator does not supply the same closed set of syllables as other ReadingOperators as IPA provides different ways to represent pronunciation. Because of that a user defined IPA syllable will not easily map to another transcription system and thus only basic support is provided for this direction.

Tones in IPA can be expressed using different schemes. The following schemes are implemented here:

  • Numbers, tone numbers ,
  • ChaoDigits, numbers displaying the levels of Chao tone contours,
  • IPAToneBar, IPA modifying tone bar characters, e.g. ɛw˥˧,
  • Diacritics, diacritical marks and finally
  • None, no support for tone marks

Todo

  • Lang: Shed more light on representations of tones in IPA.
  • Impl: Get all diacritics used in IPA as tones for TONE_MARK_REGEX.
  • Fix: What about CompositionError? All romanisations raise it, but they have a distinct set of characters that belong to the reading.
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • toneMarkType – type of tone marks, one out of 'numbers', 'superscriptNumbers', 'chaoDigits', 'superscriptChaoDigits', 'ipaToneBar', 'diacritics', 'none'
  • missingToneMark – if set to 'noinfo' no tone information will be deduced when no tone mark is found (takes on value None), if set to 'ignore' this entity will not be valid. Either behaviour only becomes effective if the chosen 'toneMarkType' makes no use of empty tone marks.
DEFAULT_TONE_MARK_TYPE
Tone mark type to select by default.
TONES
List of tone names. Needs to be implemented in child class.
TONE_MARK_MAPPING
Mapping of tone names to tone mark for each tone mark type. Needs to be implemented in child classes.
TONE_MARK_PREFER
Mapping of tone marks to tone name which will be preferred on ambiguous mappings. Needs to be implemented in child classes.
compose(readingEntities)

Composes the given list of basic entities to a string. IPA syllables are separated by a period.

Parameter:readingEntities (list of str) – list of basic entities or other content
Return type:str
Returns:composed entities
decompose(readingString)

Decomposes the given string into basic entities that can be mapped to one Chinese character each (exceptions possible).

The returned list contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.

Single syllables can only be found if distinguished by a period or whitespace, such as compose() would return.

Parameter:readingString (str) – reading string
Return type:list of str
Returns:a list of basic entities of the input string
classmethod getDefaultOptions()
getTonalEntity(plainEntity, tone)

Gets the entity with tone mark for the given plain entity and tone.

The plain entity returned will always be in Unicode’s Normalization Form C (NFC, see http://www.unicode.org/reports/tr15/).

Parameters:
  • plainEntity (str) – entity without tonal information
  • tone (str) – tone
Return type:

str

Returns:

entity with appropriate tone

Raises InvalidEntityError:
 

if the entity is invalid.

Todo

  • Impl: Place diacritics on main vowel, derive from IPA representation.
getToneForToneMark(toneMark)

Gets the tone for the given tone mark.

Parameter:toneMark (str) – tone mark representation of the tone
Return type:str
Returns:tone
Raises InvalidEntityError:
 if the toneMark does not exist.
getTones(*args, **kwargs)
classmethod guessReadingDialect(readingString, includeToneless=False)

Takes a string written in IPA and guesses the reading dialect.

Supports option 'toneMarkType'.

Parameter:readingString (str) – IPA string
Return type:dict
Returns:dictionary of basic keyword settings
splitEntityTone(entity)

Splits the entity into an entity without tone mark and the name of the entity’s tone.

The plain entity returned will always be in Unicode’s Normalization Form C (NFC, see http://www.unicode.org/reports/tr15/).

Parameter:entity (str) – entity with tonal information
Return type:tuple
Returns:plain entity without tone mark and additionally the tone
Raises InvalidEntityError:
 if the entity is invalid.
class cjklib.reading.operator.TonalRomanisationOperator(**options)

Bases: cjklib.reading.operator.RomanisationOperator, cjklib.reading.operator.TonalFixedEntityOperator

Provides an abstract RomanisationOperator for tonal languages incorporating methods from TonalFixedEntityOperator.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • strictSegmentation – if True segmentation (using segment()) and thus decomposition (using decompose()) will raise an exception if an alphabetic string is parsed which can not be segmented into single reading entities. If False the aforesaid string will be returned unsegmented.
  • case – if set to 'lower', only lower case will be supported, if set to 'both' a mix of upper and lower case will be supported.
getReadingEntities(*args, **kwargs)

Gets a set of all entities supported by the reading.

The list is used in the segmentation process to find entity boundaries.

Returned entities are in lowercase.

Return type:list of str
Returns:list of supported syllables
isPlainReadingEntity(entity)

Returns true if the given plain entity (without any tone mark) is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.

Case of characters will be handled depending on the setting for option 'case'.

Parameter:entity (str) – entity to check
Return type:bool
Returns:True if string is an entity of the reading, False otherwise.
isReadingEntity(entity)