cjklib.reading — Character reading based functions

Character reading based functions (transliterations, romanizations, ...).

This includes ReadingOperators used to handle basic operations like decomposing strings written in a reading into their basic entities (e.g. syllables) and for some languages getting tonal information, syllable onset and rhyme and other features. Furthermore it includes ReadingConverter classes which offer the conversion of strings from one reading to another.

All basic functionality can be accessed using the ReadingFactory which provides factory methods for creating instances of the supplied classes and also acts as a façade for the functions defined there.

Examples

The following examples should give a quick view into how to use this package.

  • Create the ReadingFactory object with default settings (read from cjklib.conf or using cjklib.db in module directory as default):

    >>> from cjklib.reading import ReadingFactory
    >>> f = ReadingFactory()
    
  • Create an operator for Mandarin romanisation Pinyin:

    >>> pinyinOp = f.createReadingOperator('Pinyin')
    
  • Construct a Pinyin syllable with second tone:

    >>> pinyinOp.getTonalEntity(u'han', 2)
    u'hán'
    
  • Segments the given Pinyin string into a list of syllables:

    >>> pinyinOp.decompose(u"tiān'ānmén")
    [u'tiān', u''', u'ān', u'mén']
    
  • Do the same using the factory class as a façade to easily access functions provided by those classes in the background:

    >>> f.decompose(u"tiān'ānmén", 'Pinyin')
    [u'tiān', u''', u'ān', u'mén']
    
  • Convert the given Gwoyeu Romatzyh syllables to their pronunciation in IPA:

    >>> f.convert('liow shu', 'GR', 'MandarinIPA')
    u'liəu˥˩ ʂu˥˥'
    

Readings

Han-characters give only few visual hints about how they are pronounced. The big number of homophones further increases the problem of deriving the character’s actual pronunciation from the given glyph. This module implements a framework and desirable functionality to deal with the characteristics of character readings.

From a programmatical view point readings in languages making use of Chinese characters differ in many ways. Some use the Roman alphabet, some have tonal information, some can be mapped character-wise, some map from one Chinese character to a sequence of characters in the target system while some map only to one character.

One mayor group in the topic of readings are romanisations, which are transcriptions into the Roman alphabet (Cyrillic respectively). Romanisations of tonal languages are a subgroup that ask for even more detailed functions. The interface implemented here tries to grasp similar factors on different abstraction levels while trying to maintain flexibility.

In the context of this library the term reading will refer to two things: the realisation of expressing the pronunciation (e.g. the specific romanisation) on the one hand, and the specific reading of a given character on the other hand.

Technical implementation

While module cjklib.characterlookup includes the functions for mapping a character to its potential reading, module cjklib.reading is specialised on all functionality that is primarily connected to the reading of characters.

The main functions implemented here provide ways of handling text written in a reading and converting between different readings.

Handling text written in a reading

Text written in a character reading is special to other text, as it consists of entities which map to corresponding Chinese characters. They can be deduced from the text through breaking the whole string down into a sequence of single entities. This functionality is provided by all operators on readings by providing the interface ReadingOperator. The process of breaking input down (called decomposition) can be reversed by composing the single entities to a string.

Many ReadingOperators provide additional functions, each depending on the characteristics of the implemented reading. For readings of tonal languages for example they might allow to question the tone of the given reading of a character.

Inheritance diagram of cjklib.reading.operator

Converting between readings

The second part provided are means to provide support for conversion between different readings.

What all CJK languages seem to have in common is their irreversibility of the mapping from a character to its reading, as these languages are rich in homophones. Thus the highest degree in information for a text is obtained by the pair of characters and their reading (aside from the meaning).

If one has a text written in reading A and one wants to obtain the text written in B instead then it is not feasible to obtain the reading from the corresponding characters even if present, as many characters have several pronunciations. Instead one wants to convert the reading through conversion from A to B.

Simple means to convert between readings is provided by classes implementing ReadingConverter. This conversion might neither be surjective nor injective, and several exceptions can occur.

Inheritance diagram of cjklib.reading.converter

Configurable reading dialect

Many readings come in specific representations even if standardised. This may start with simple difference in type setting (e.g. punctuation) or include special entities and derivatives.

Instead of selecting one default form as a global standard cjklib lets the user choose the preferred dialect, though still trying to offer good default values. It does so by offering a wide range of options for handling and conversion of readings. These options can be given optionally in many places and are handed down by the system to the component knowing about this specific configuration option. Furthermore each class implements a method that states which options it uses by default.

A special notion of dialect converters is used for ReadingConverter classes that convert between two different representations of the same reading. These allow flexible switching between reading dialects.

Limitations of reading conversion

While reading conversion allows for flexible handling of any reading, there are corner cases and limitations that arise from the difference in the readings’ designs. The following list tries to name limitations for some conversions, it is not meant to be exhaustive though. The best way to be really sure about what can be mapped and what not, it to actually try it out. Missing mappings for some syllables will not be listed here.

  • Jyutping to Cantonese Yale: Jyutping was designed for Cantonese as spoken in Hong Kong. While the high falling tone is lost there, it still exists in the area of Guangzhou. The first tone of Jyutping will either map to the high level tone (default) or the high falling tone.
  • Pinyin to Wade-Giles: Wade-Giles distinguishes between finals o and ê while Pinyin only writes e (ê for the syllable itself). A mapping is thus ambiguous.
  • GR to Pinyin: GR transcribes Erhua sound such that the etymological syllable gets lost. A mapping to Pinyin is thus ambiguous.
  • Pinyin to GR: GR transcribes the etymological tone for a fifth tone, while Pinyin does not. A mapping cannot fill in the missing information.
  • IPA: IPA for Mandarin and Cantonese needs to transcribe tonal changes and other co-articulation features, which most of the romanisations don’t cover. A mapping is often either done as approximation, or is not possible at all.

Classes

class cjklib.reading.ReadingFactory(databaseUrl=None, dbConnectInst=None)

Provides an abstract factory for creating ReadingOperators and ReadingConverters and a façade to directly access the methods offered by these classes.

Instances of other classes are cached in the background and reused on later calls for methods accessed through the façade. createReadingOperator() and createReadingConverter() can be used to create new instances for use outside of the ReadingFactory.

Todo

  • Impl: What about hiding of inner classes? _checkSpecialOperators() method is called for internal converters and for external ones delivered by createReadingConverter(). Latter method doesn’t return internal cached copies though, but creates new instances. ReadingOperator also gets copies from ReadingFactory objects for internal instances. Sharing saves memory but changing one object will affect all other objects using this instance.
  • Impl: General reading options given for a converter with **options need to be used on creating a operator. How to raise errors to save user of specifying an operator twice, one per options, one per concrete instance (similar to sourceOptions and targetOptions)?

Initialises the ReadingFactory.

If no parameters are given default values are assumed for the connection to the database. The database connection parameters can be given in databaseUrl, or an instance of DatabaseConnector can be passed in dbConnectInst, the latter one being preferred if both are specified.

Parameters:
  • databaseUrl (str) – database connection setting in the format driver://user:pass@host/database.
  • dbConnectInst (instance) – instance of a DatabaseConnector
class SimpleReadingConverterAdaptor(converterInst, fromReading, toReading)

Defines a simple converter between two character readings that keeps the real converter doing the work in the background.

The basic method is convert() which converts one input string from one reading to another. In contrast to a ReadingConverter no source or target reading needs to be specified.

Creates an instance of the SimpleReadingConverterAdaptor.

Parameters:
  • converterInst (instance) – ReadingConverter instance doing the actual conversion work.
  • fromReading (str) – name of reading converted from
  • toReading (str) – name of reading converted to
convert(string, fromReading=None, toReading=None)

Converts a string in the source reading to the target reading.

If parameters fromReading or toReading are not given the class’s default values will be applied.

Parameters:
  • string (str) – string written in the source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

str

Returns:

the input string converted to the toReading

Raises DecompositionError:
 

if the string can not be decomposed into basic entities with regards to the source reading.

Raises CompositionError:
 

if the target reading’s entities can not be composed.

Raises ConversionError:
 

on operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading not supported for conversion.

convertEntities(readingEntities, fromReading=None, toReading=None)

Converts a list of entities in the source reading to the target reading.

If parameters fromReading or toReading are not given the class’s default values will be applied.

Parameters:
  • readingEntities (list of str) – list of entities written in source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

list of str

Returns:

list of entities written in target reading

Raises ConversionError:
 

on operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading is not supported for conversion.

Raises InvalidEntityError:
 

if an invalid entity is given.

ReadingFactory.clearCache()
Clears cached classes for the current database.
ReadingFactory.compose(readingEntities, readingN, **options)

Composes the given list of basic entities to a string for the given reading.

Composing entities can raise a CompositionError if a non-reading entity is about to be joined with a reading entity and will result in a string that is impossible to decompose.

Parameters:
  • readingEntities (list of str) – list of basic syllables or other content
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

str

Returns:

composed entities

Raises CompositionError:
 

if the given entities can not be composed.

Raises UnsupportedError:
 

if the given reading is not supported.

ReadingFactory.convert(readingStr, fromReading, toReading, *args, **options)

Converts the given string in the source reading to the given target reading.

Parameters:
  • readingStr (str) – string that needs to be converted
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – additional options for handling the input
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
  • sourceOptions – dictionary of options to configure the ReadingOperators used for handling source readings. If an operator for the source reading is explicitly specified, no options can be given.
  • targetOptions – dictionary of options to configure the ReadingOperators used for handling target readings. If an operator for the target reading is explicitly specified, no options can be given.
Return type:

str

Returns:

the converted string

Raises DecompositionError:
 

if the string can not be decomposed into basic entities with regards to the source reading or the given information is insufficient.

Raises CompositionError:
 

if the target reading’s entities can not be composed.

Raises ConversionError:
 

on operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading is not supported for conversion.

ReadingFactory.convertEntities(readingEntities, fromReading, toReading, *args, **options)

Converts a list of entities in the source reading to the given target reading.

Parameters:
  • readingEntities (list of str) – list of entities written in source reading
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – additional options for handling the input
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
  • sourceOptions – dictionary of options to configure the ReadingOperators used for handling source readings. If an operator for the source reading is explicitly specified, no options can be given.
  • targetOptions – dictionary of options to configure the ReadingOperators used for handling target readings. If an operator for the target reading is explicitly specified, no options can be given.
Return type:

list of str

Returns:

list of entities written in target reading

Raises ConversionError:
 

on operations specific to the conversion between the two readings (e.g. error on converting entities).

Raises UnsupportedError:
 

if source or target reading is not supported for conversion.

Raises InvalidEntityError:
 

if an invalid entity is given.

ReadingFactory.createReadingConverter(fromReading, toReading, *args, **options)

Creates an instance of a ReadingConverter for the given source and target reading and returns it wrapped as a SimpleReadingConverterAdaptor.

As ReadingConverters generally support more than one conversion direction the user needs to specify which source and target reading is needed on a regular instance. Wrapping the created instance in the adaptor gives a simple convert() and convertEntities() routine, such that on conversion the source and target readings don’t have to be specified. Other methods signatures remain unchanged.

Parameters:
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
  • args – optional list of ReadingOperators to use for handling source and target readings.
  • options – options for the created instance
  • hideComplexConverter – if true the ReadingConverter is wrapped as a SimpleReadingConverterAdaptor (default).
  • sourceOperators – list of ReadingOperators used for handling source readings.
  • targetOperators – list of ReadingOperators used for handling target readings.
  • sourceOptions – dictionary of options to configure the ReadingOperators used for handling source readings. If an operator for the source reading is explicitly specified, no options can be given.
  • targetOptions – dictionary of options to configure the ReadingOperators used for handling target readings. If an operator for the target reading is explicitly specified, no options can be given.
Return type:

instance

Returns:

a SimpleReadingConverterAdaptor or ReadingConverter instance

Raises UnsupportedError:
 

if conversion for the given readings is not supported.

ReadingFactory.createReadingOperator(readingN, **options)

Creates an instance of a ReadingOperator for the given reading.

Parameters:
  • readingN (str) – name of a supported reading
  • options – options for the created instance
Return type:

instance

Returns:

a ReadingOperator instance

Raises UnsupportedError:
 

if the given reading is not supported.

ReadingFactory.decompose(string, readingN, **options)

Decomposes the given string into basic entities that can be mapped to one Chinese character each for the given reading.

The given input string can contain other non reading characters, e.g. punctuation marks.

The returned list contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.

Parameters:
  • string (str) – reading string
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

list of str

Returns:

a list of basic entities of the input string

Raises DecompositionError:
 

if the string can not be decomposed.

Raises UnsupportedError:
 

if the given reading is not supported.

ReadingFactory.getDecompositions(string, readingN, **options)

Decomposes the given string into basic entities that can be mapped to one Chinese character each for ambiguous decompositions. It all possible decompositions. This method is a more general version of decompose().

The returned list construction consists of two entity types: entities of the romanisation and other strings.

Parameters:
  • string (str) – reading string
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

list of list of str

Returns:

a list of all possible decompositions consisting of basic entities.

Raises DecompositionError:
 

if the given string has a wrong format.

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.getDefaultOptions(*args)

Returns the default options for the ReadingOperator or ReadingConverter applied for the given reading name or names respectively.

The keyword ‘dbConnectInst’ is not regarded a configuration option and is thus not included in the dict returned.

Raises ValueError:
 if more than one or two reading names are given.
Raises UnsupportedError:
 if no ReadingOperator or ReadingConverter exists for the given reading or readings respectively.
ReadingFactory.getFormattingEntities(readingN, **options)

Gets a set of entities used by the reading to format reading entities.

Parameters:
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

set of str

Returns:

set of supported formatting entities

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.getPlainReadingEntities(readingN, **options)

Gets the list of plain entities supported by this reading. Different to getReadingEntities() the entities will carry no tone mark.

Parameters:
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

set of str

Returns:

set of supported syllables

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.getReadingConverterClass(fromReading, toReading)

Gets the ReadingConverter‘s class for the given source and target reading.

Parameters:
  • fromReading (str) – name of the source reading
  • toReading (str) – name of the target reading
Return type:

classobj

Returns:

a ReadingConverter class

Raises UnsupportedError:
 

if conversion for the given readings is not supported.

static ReadingFactory.getReadingConverterClasses()

Gets all classes implementing ReadingConverter from module cjklib.reading.converter.

Return type:list
Returns:list of all classes inheriting form ReadingConverter
ReadingFactory.getReadingEntities(readingN, **options)

Gets a set of all entities supported by the reading.

The list is used in the segmentation process to find entity boundaries.

Parameters:
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

set of str

Returns:

set of supported reading entities

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.getReadingOperatorClass(readingN)

Gets the ReadingOperator‘s class for the given reading.

Parameter:readingN (str) – name of a supported reading
Return type:classobj
Returns:a ReadingOperator class
Raises UnsupportedError:
 if the given reading is not supported.
static ReadingFactory.getReadingOperatorClasses()

Gets all classes implementing ReadingOperator from module cjklib.reading.operator.

Return type:list
Returns:list of all classes inheriting form ReadingOperator
ReadingFactory.getSupportedReadings()

Gets a list of all supported readings.

Return type:list of str
Returns:a list of readings a ReadingOperator is available for
ReadingFactory.getTonalEntity(plainEntity, tone, readingN, **options)

Gets the entity with tone mark for the given plain entity and tone. The letter case of the given plain entity might not be fully conserved for mixed case strings.

Parameters:
  • plainEntity (str) – entity without tonal information
  • tone – tone
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

str

Returns:

entity with appropriate tone

Raises InvalidEntityError:
 

if the entity is invalid.

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.getTones(readingN, **options)

Returns a set of tones supported by the reading.

Parameters:
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

list

Returns:

list of supported tone marks.

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.isFormattingEntity(entity, readingN, **options)

Returns True if the given entity is a valid formatting entity recognised by the reading operator.

Parameters:
  • entity (str) – entity to check
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

bool

Returns:

True if string is a formatting entity of the reading.

Raises UnsupportedError:
 

if the given reading is not supported.

ReadingFactory.isPlainReadingEntity(entity, readingN, **options)

Returns true if the given plain entity (without any tone mark) is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.

Reading entities will be handled as being case insensitive.

Parameters:
  • entity (str) – entity to check
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

bool

Returns:

True if string is an entity of the reading, False otherwise.

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.isReadingConversionSupported(fromReading, toReading)

Checks if the conversion from reading A to reading B is supported.

Return type:bool
Returns:true if conversion is supported, false otherwise
ReadingFactory.isReadingEntity(entity, readingN, **options)

Returns True if the given entity is a valid reading entity recognised by the reading operator, i.e. it will be returned by decompose().

Parameters:
  • entity (str) – entity to check
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

bool

Returns:

True if string is an entity of the reading, false otherwise.

Raises UnsupportedError:
 

if the given reading is not supported.

ReadingFactory.isReadingOperationSupported(operation, readingN, **options)

Returns True if the given method is supported by the reading.

Parameters:
  • operation (str) – name of method
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

bool

Returns:

True if method is supported, False otherwise.

Raises ValueError:
 

if the given method is not covered.

ReadingFactory.isStrictDecomposition(decomposition, readingN, **options)

Checks if the given decomposition follows the romanisation format strictly to allow unambiguous decomposition.

The romanisation should offer a way/protocol to make an unambiguous decomposition into it’s basic syllables possible as to make the process of appending syllables to a string reversible. The testing on compliance with this protocol has to be implemented here. Thus this method can only return true for one and only one possible decomposition for all strings.

Parameters:
  • decomposition (list of str) – decomposed reading string
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

bool

Returns:

False, as this methods needs to be implemented by the sub class

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.publishReadingConverter(readingConverter)

Publishes a ReadingConverter to the list and thus makes it available for other methods in the library.

Parameter:readingConverter (classobj) – a new ReadingConverter to be published
ReadingFactory.publishReadingOperator(readingOperator)

Publishes a ReadingOperator to the list and thus makes it available for other methods in the library.

Parameter:readingOperator (classobj) – a new ReadingOperator to be published
ReadingFactory.segment(string, readingN, **options)

Takes a string written in the romanisation and returns the possible segmentations as a list of syllables.

In contrast to decompose() this method merely segments continuous entities of the romanisation. Characters not part of the romanisation will not be dealt with, this is the task of the more general decompose method.

Parameters:
  • string (str) – reading string
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

list of list of str

Returns:

a list of possible segmentations (several if ambiguous) into single syllables

Raises DecompositionError:
 

if the given string has an invalid format.

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.

ReadingFactory.splitEntityTone(entity, readingN, **options)

Splits the entity into an entity without tone mark (plain entity) and the entity’s tone. The letter case of the given entity might not be fully conserved for mixed case strings.

Parameters:
  • entity (str) – entity with tonal information
  • readingN (str) – name of reading
  • options – additional options for handling the input
Return type:

tuple

Returns:

plain entity without tone mark and entity’s tone

Raises InvalidEntityError:
 

if the entity is invalid.

Raises UnsupportedError:
 

if the given reading is not supported or the reading doesn’t support the specified method.