cjklib.reading.converter — Conversion between character readings

Conversion between character readings.

Architecture

The basic method is convert() which converts one input string from one reading to another.

The method getDefaultOptions() will return the conversion default settings.

What gets converted

The conversion process uses the ReadingOperator for the source reading to decompose the given string into the single entities. The decomposition contains reading entities and entities that don’t represent any pronunciation. While the goal is to convert included reading entities to the target reading, some convertes might decide to also convert non-reading entities. This can be for example delimiters like apostrophes that differ between romanisations or punctuation marks that have a defined representation in the target system, e.g. Braille.

Errors

By default conversion won’t stop on entities that closely resemble other reading entities but itself are not valid. Those will turn up unchanged in the result and can cause a CompositionError when the target operator decideds that it is impossible to link a converted entity with a non-converted one as it would make it impossible to later determine the entity boundaries. Most of those errors will probably result from bad input that fails on conversion. This can be solved by telling the source operator to be strict on decomposition (where supported) so that the error will be reported beforehand. The followig example tries to convert xiǎo tōu (“thief”), misspelled as *xiǎo tō:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert(u'xiao3to1', 'Pinyin', 'GR',
...     sourceOptions={'toneMarkType': 'numbers'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
cjklib.exception.CompositionError: Unable to delimit non-reading entity 'to1'
>>> print f.convert(u'xiao3to1', 'Pinyin', 'GR',
...     sourceOptions={'toneMarkType': 'numbers',
...         'strictSegmentation': True})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
cjklib.exception.DecompositionError: Segmentation of 'to1' not possible or invalid syllable

Not being strict results in a lazy conversion, which might fail in some cases as shown above. u'xiao3 to1' (with a space in between) though will work for the lazy way ('to1' not being converted), while the strict version will still report the wrong *to1.

Other errors that can arise:

Bridge

Conversions between two Readings can be made using a third reading if no direct conversion is defined. This reading is called a bridge reading and is implemented in BridgeConverter. Using the routines from the ReadingFactory will automatically employ bridges if needed.

Examples

Convert a string from Jyutping to Cantonese Yale:

>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('gwong2jau1waa2', 'Jyutping', 'CantoneseYale')
u'gwóngyāuwá'

This is also possible creating a converter instance explicitly using the factory:

>>> jyc = f.createReadingConverter('GR', 'Pinyin')
>>> jyc.convert('Woo.men tingshuo yeou "Yinnduhshyue", "Aijyishyue"')
u'Wǒmen tīngshuō yǒu "Yìndùxué", "Āijíxué"'

Convert between different dialects of the same reading Wade-Giles:

>>> f.convert(u'kuo3-yü2', 'WadeGiles', 'WadeGiles',
...     sourceOptions={'toneMarkType': 'numbers'},
...     targetOptions={'toneMarkType': 'superscriptNumbers'})
u'kuo³-yü²'

See PinyinDialectConverter for more examples.

Base classes

class cjklib.reading.converter.ReadingConverter(*args, **options)

Bases: object

Defines an abstract converter between two or more character readings.

Parameters:
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
CONVERSION_DIRECTIONS
List of tuples for specifying supported conversion directions from reading A to reading B. If both directions are supported, two tuples (A, B) and (B, A) are given.
convert(string, fromReading, toReading)

Converts a string in the source reading to the given target reading.

Parameters:
  • string (str) – string written in the source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

str

Returns:

the input string converted to the toReading

Raises DecompositionError:
 

if the string can not be decomposed into basic entities with regards to the source reading or the given information is insufficient.

Raises CompositionError:
 

if the target reading’s entities can not be composed.

Raises ConversionError:
 

on operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading is not supported for conversion.

Todo

  • Impl: Make parameters fromReading, toReading optional if only one conversion direction is given. Same for convertEntities().
convertEntities(readingEntities, fromReading, toReading)

Converts a list of entities in the source reading to the given target reading.

The default implementation will raise a NotImplementedError.

Parameters:
  • readingEntities (list of str) – list of entities written in source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

list of str

Returns:

list of entities written in target reading

Raises ConversionError:
 

on operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading is not supported for conversion.

Raises InvalidEntityError:
 

if an invalid entity is given.

classmethod getDefaultOptions()

Returns the reading converter’s default options.

The keyword ‘dbConnectInst’ is not regarded a configuration option of the converter and is thus not included in the dict returned.

Return type:dict
Returns:the reading converter’s default options.
class cjklib.reading.converter.EntityWiseReadingConverter(*args, **options)

Bases: cjklib.reading.converter.ReadingConverter

Defines an abstract ReadingConverter between two or more readings for doing entity wise conversion.

Converters that simply convert one syllable at once can implement this class and merely need to overwrite convertBasicEntity()

Parameters:
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
convertBasicEntity(entity, fromReading, toReading)

Converts a basic entity (e.g. a syllable) in the source reading to the given target reading.

This method is called by convertEntities() and a single entity is given for conversion.

The default implementation will raise a NotImplementedError.

Parameters:
  • entity (str) – string written in the source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

str

Returns:

the entity converted to the toReading

Raises AmbiguousConversionError:
 

if conversion for this entity of the source reading is ambiguous.

Raises ConversionError:
 

on other operations specific to the conversion of the entity.

Raises InvalidEntityError:
 

if the entity is invalid.

convertEntities(readingEntities, fromReading, toReading)
class cjklib.reading.converter.DialectSupportReadingConverter(*args, **options)

Bases: cjklib.reading.converter.ReadingConverter

Defines an abstract ReadingConverter that support non-standard reading representations (dialect) as in- and output.

Input will be converted to a standard representation of the input reading before the actual conversion step is done. If needed the converted reading will be converted to a defined dialect.

Parameters:
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
DEFAULT_READING_OPTIONS

Defines the default reading options for the reading dialect used as a bridge in conversion between the user specified representation and the target reading.

The most general reading dialect should be specified as to allow for a broad range of input.

convertEntities(readingEntities, fromReading, toReading)

Converts a list of entities in the source reading to the given target reading.

Parameters:
  • readingEntities (list of str) – list of entities written in source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

list of str

Returns:

list of entities written in target reading

Raises AmbiguousConversionError:
 

if conversion for a specific entity of the source reading is ambiguous.

Raises ConversionError:
 

on other operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading is not supported for conversion.

Raises InvalidEntityError:
 

if an invalid entity is given.

convertEntitySequence(entitySequence, fromReading, toReading)

Convert a list of reading entities in standard representatinon given by DEFAULT_READING_OPTIONS() and non reading entities from the source reading to the target reading.

The default implementation will raise a NotImplementedError.

Parameters:
  • entitySequence (list structure) – list of reading entities given as list and non-reading entities as single str objects
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

list structure

Returns:

list of converted reading entities given as list and non-reading entities as single str objects

class cjklib.reading.converter.RomanisationConverter(*args, **options)

Bases: cjklib.reading.converter.DialectSupportReadingConverter

Defines an abstract ReadingConverter between two or more romanisations.

Reading dialects can produce different entities which have to be handled by the conversion process. This is realised by converting the given reading dialect to a default form, then converting to the default target reading and finally converting to the specified target reading dialect. On conversion step thus involves three single conversion steps using a default form. This default form can be defined in DEFAULT_READING_OPTIONS.

Letter case will be transfered between syllables, no special formatting according to anyhow defined standards will be guaranteed. Letter case will be identified according to three classes: uppercase (all case-sensible characters are uppercase), titlecase (all case-sensible characters are lowercase except the first case-sensible character), lowercase (all case-sensible characters are lowercase). For entities of single latin characters uppercase has precedence over titlecase, e.g. E5 will convert to ÉH in Cantonese Yale, not to Éh. In general letter case should be handled outside of cjklib if special formatting is required.

The class itself can’t be used directly, it has to be subclassed and convertBasicEntity() has to be implemented, as to make the translation of a syllable from one romanisation to another possible.

Parameters:
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
convertBasicEntity(entity, fromReading, toReading)

Converts a basic entity (e.g. a syllable) in the source reading to the given target reading.

This method is called by convertEntities() and a lower case entity is given for conversion. The returned value should be in lower case characters too, as convertEntities() will take care of capitalisation.

If a single entity needs to be converted it is recommended to use convertEntities() instead. In the general case it can not be ensured that a mapping from one reading to another can be done by the simple conversion of a basic entity. One-to-many mappings are possible and there is no guarantee that any entity of a reading recognised by isReadingEntity() will be mapped here.

The default implementation will raise a NotImplementedError.

Parameters:
  • entity (str) – string written in the source reading in lower case letters
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

str

Returns:

the entity converted to the toReading in lower case

Raises AmbiguousConversionError:
 

if conversion for this entity of the source reading is ambiguous.

Raises ConversionError:
 

on other operations specific to the conversion of the entity.

Raises InvalidEntityError:
 

if the entity is invalid.

convertEntitySequence(entitySequence, fromReading, toReading)
class cjklib.reading.converter.BridgeConverter(*args, **options)

Bases: cjklib.reading.converter.ReadingConverter

Provides a ReadingConverter that converts between readings over a third reading called bridge reading.

Parameters:
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – extra options passed to the ReadingConverter instances
  • dbConnectInst – instance of a DatabaseConnector, if none is given, default settings will be assumed.
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
CONVERSION_BRIDGE
List containing all conversion directions together with the bridge reading over which the conversion is made. Form: (fromReading, bridgeReading, toReading) As conversion may be lossy it is important which conversion path is chosen.
convertEntities(readingEntities, fromReading, toReading)
classmethod getDefaultOptions()