Operation on character readings.
A ReadingOperator supports basic operations on string written in a character reading:
Many child classes add many more reading specific methods.
Additional to decompose() provided by the class ReadingOperator a RomanisationOperator offers a method getDecompositions() that returns several possible decompositions in an ambiguous case. Also, as Romanisations have a fixed set of entities, a method getReadingEntities() offers access to a list of all accepted reading entities.
Transcriptions into the Latin (or Cyrilic) alphabet generate the problem that syllable boundaries or boundaries of entities belonging to single Chinese characters aren’t clear anymore once entities are grouped together.
Therefore it is important to have methods at hand to separate strings and to split those into single entities. This though cannot always be done in a clear and unambiguous way as several different decompositions might be possible thus leading to the general case of ambiguous decompositions.
Many romanisations do provide a way to tackle this problem. Pinyin for example requires the use of an apostrophe (') when the reverse process of splitting the string into syllables gets ambiguous. The Wade-Giles romanisation in its strict implementation asks for a hyphen used between all syllables. The LSHK’s Jyutping when written with tone marks will always be clearly decomposable as the digits mark syllable borders.
The method isStrictDecomposition() can be implemented to check if one possible decomposition is the strict decomposition offered by the romanisation’s protocol. This method should guarantee that under all circumstances only one decomposed version will be regarded as strict.
If no strict version is yielded and different decompositions exist an unambiguous decomposition can not be made. These decompositions can be accessed through method getDecompositions(), even in a cases where a strict decomposition exists.
Romanisations are special to other readings as their entities can be written in upper or lower case, or in a mix of them. By default operators will recognise both, this behaviour can be changed with option 'case' which can alternatively be changed to 'lower'. Upper case is not explicitly supported. If such a writing is needed, this behaviour can be implemented by choosing lower case and converting strings to and from the operator manually. Method getReadingEntities() will by default return lower case entities.
Tonal readings are supported with class TonalFixedEntityOperator. It provides two methods getTonalEntity() and splitEntityTone() to cope with tonal information in text.
Operators are free to handle tones according to their needs. No data type constraint is given so that some will handle tones as integers, while others will handle strings. Even the count of tones between different operators for the same language may vary as one system might be more specific about tonal features.
While some operators have a fixed set of accepted entities, the more specific subgroup for tonal languages has a set of basic entities, such entity here being called plain entity, which can be annotated with tonal information to yield a regular reading entity. Some plain entities might themselves be normal reading entities, while others might be not. No requirements are made that the set of plain entity in cross product with the set of tones will fully span the set of reading entities.
Decompose a reading string in Gwoyeu Romatzyh into single entities:
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.decompose('"Hannshyue" .de mingcheng duey Jonggwo [...]', 'GR')
['"', 'Hann', 'shyue', '" ', '.de', ' ', 'ming', 'cheng', ' ', 'duey', ' ', 'Jong', 'gwo', ' [...]']
The same can be done by directly using the operator’s instance:
>>> from cjklib.reading import operator
>>> cy = operator.CantoneseYaleOperator()
>>> cy.decompose(u'gwóngjàuwá')
[u'gwóng', u'jàu', u'wá']
Composing will reverse the process, using a Pinyin string:
>>> f.compose([u'xī', u'ān'], 'Pinyin')
u"xī'ān"
For more complex operators, see PinyinOperator or MandarinIPAOperator.
Defines an abstract operator on text written in a character reading.
Parameters: |
|
---|
Composes the given list of basic entities to a string.
Composing entities can raise a CompositionError if a non-reading entity is about to be joined with a reading entity and will result in a string that is impossible to decompose.
The base class’ implementation will raise a NotImplementedError.
Parameter: | readingEntities (list of str) – list of basic entities or other content |
---|---|
Return type: | str |
Returns: | composed entities |
Raises CompositionError: | |
if the given entities can not be composed. |
Decomposes the given string into basic entities that can be mapped to one Chinese character each (exceptions possible).
The given input string can contain other non reading characters, e.g. punctuation marks.
The returned list contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.
The base class’ implementation will raise a NotImplementedError.
Parameter: | readingString (str) – reading string |
---|---|
Return type: | list of str |
Returns: | a list of basic entities of the input string |
Raises DecompositionError: | |
if the string can not be decomposed. |
Returns the reading operator’s default options.
The base class’ implementation returns an empty dictionary. The keyword ‘dbConnectInst’ is not regarded a configuration option of the operator and is thus not included in the dict returned.
Return type: | dict |
---|---|
Returns: | the reading operator’s default options. |
Returns True if the given entity is a valid formatting entity recognised by the reading operator.
The base class’ implementation will always return False.
Parameter: | entity (str) – entity to check |
---|---|
Return type: | bool |
Returns: | True if string is a formatting entity of the reading. |
Returns True if the given entity is a valid reading entity recognised by the reading operator, i.e. it will be returned by decompose().
The base class’ implementation will raise a NotImplementedError.
Parameter: | entity (str) – entity to check |
---|---|
Return type: | bool |
Returns: | True if string is an entity of the reading, false otherwise. |
Bases: cjklib.reading.operator.ReadingOperator
Defines an abstract ReadingOperator on text written in a romanisation, i.e. text written in the Latin alphabet or written in the Cyrillic alphabet.
Todo
Parameters: |
|
---|
Decomposes the given string into basic entities on a one-to-one mapping level to Chinese characters. Decomposing can be ambiguous and there are two assumptions made to solve this problem: If two subsequent entities together make up a longer valid entity, then the decomposition with the shorter entities can be disregarded. Furthermore it is assumed that the reading provides rules to mark entity borders and that these rules can be checked, so that the decomposition that abides by this rules will be prefered. This check is done by calling isStrictDecomposition().
The given input string can contain other characters not supported by the reading, e.g. punctuation marks. The returned list then contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.
Parameter: | readingString (str) – reading string |
---|---|
Return type: | list of str |
Returns: | a list of basic entities of the input string |
Raises AmbiguousDecompositionError: | |
if decomposition is ambiguous. | |
Raises DecompositionError: | |
if the given string has a wrong format. |
Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions and returns the possible decompositions as a lattice.
Parameter: | readingString (str) – reading string |
---|---|
Return type: | list |
Returns: | a list of all possible decompositions consisting of basic entities as a lattice construct. |
Raises DecompositionError: | |
if the given string has a wrong format. |
Decomposes the given string into basic entities that can be mapped to one Chinese character each for all possible decompositions. This method is a more general version of decompose().
The returned list construction consists of two entity types: entities of the romanisation and other strings.
Parameter: | readingString (str) – reading string |
---|---|
Return type: | list of list of str |
Returns: | a list of all possible decompositions consisting of basic entities. |
Raises DecompositionError: | |
if the given string has a wrong format. |
Gets a set of entities used by the reading to format reading entities.
The base class’ implementation will return an empty set.
Return type: | set of str |
---|---|
Returns: | set of supported formatting entities |
Gets a list of characters parsed by this reading operator as reading entities. For alphabetic characters, lower case is returned.
Separators like the apostrophe (') in Pinyin are not part of reading entities and as such not included.
Return type: | set |
---|---|
Returns: | set of characters parsed by the reading operator |
Gets a set of all entities supported by the reading.
The list is used in the segmentation process to find entity boundaries. The base class’ implementation will raise a NotImplementedError.
Returned entities are in lowercase.
Return type: | set of str |
---|---|
Returns: | set of supported reading entities |
Returns True if the given entity is a valid formatting entity recognised by the romanisation operator.
Letter case of characters will be handled depending on the setting for option 'case'.
Parameter: | entity (str) – entity to check |
---|---|
Return type: | bool |
Returns: | True if string is a formatting entity of the reading. |
Returns true if the given entity is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.
Letter case of characters will be handled depending on the setting for option 'case'.
Parameter: | entity (str) – entity to check |
---|---|
Return type: | bool |
Returns: | True if string is an entity of the reading, False otherwise. |
Checks if the given decomposition follows the romanisation format strictly to allow unambiguous decomposition.
The romanisation should offer a way/protocol to make an unambiguous decomposition into it’s basic syllables possible as to make the process of appending syllables to a string reversible. The testing on compliance with this protocol has to be implemented here. Thus this method can only return true for one and only one possible decomposition for all strings.
Parameter: | decomposition (list of str) – decomposed reading string |
---|---|
Return type: | bool |
Returns: | False, as this methods needs to be implemented by the sub class |
Takes a string written in the romanisation and returns the possible segmentations as a list of syllables.
In contrast to decompose() this method merely segments continuous entities of the romanisation. Characters not part of the romanisation will not be dealt with, this is the task of the more general decompose method.
Option 'strictSegmentation' controls the behaviour of this method for strings that cannot be parsed. If set to True segmentation will raise an exception, if set to False the given string will be returned unsegmented.
Parameter: | readingString (str) – reading string |
---|---|
Return type: | list of list of str |
Returns: | a list of possible segmentations (several if ambiguous) into single syllables |
Raises DecompositionError: | |
if the given string has an invalid format. |
Bases: cjklib.reading.operator.ReadingOperator
Provides an operator on readings with a single character per entity.
Parameters: |
|
---|
Bases: cjklib.reading.operator.ReadingOperator
Provides an abstract ReadingOperator for tonal languages for a reading based on a fixed set of reading entities.
Parameter: | options – extra options |
---|
Gets the list of plain entities supported by this reading. Different to getReadingEntities() these entities will carry no tone mark.
The base class’ implementation will raise a NotImplementedError.
Return type: | set of str |
---|---|
Returns: | set of supported syllables |
Gets a set of all entities supported by the reading.
The list is used in the segmentation process to find entity boundaries.
Return type: | list of str |
---|---|
Returns: | list of supported syllables |
Gets the entity with tone mark for the given plain entity and tone. The letter case of the given plain entity might not be fully conserved for mixed case strings.
The base class’ implementation will raise a NotImplementedError.
Parameters: |
|
---|---|
Return type: | str |
Returns: | entity with appropriate tone |
Raises InvalidEntityError: | |
if the entity is invalid. |
|
Raises UnsupportedError: | |
if the operation is not supported for the given form. |
Returns a set of tones supported by the reading. These tones don’t necessarily reflect the tones of the underlying language but may defer to reflect notational or other features.
The base class’ implementation will raise a NotImplementedError.
Return type: | list |
---|---|
Returns: | list of supported tone marks. |
Returns true if the given plain entity (without any tone mark) is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.
Parameter: | entity (str) – entity to check |
---|---|
Return type: | bool |
Returns: | True if string is an entity of the reading, False otherwise. |
Splits the entity into an entity without tone mark (plain entity) and the entity’s tone. The letter case of the given entity might not be fully conserved for mixed case strings.
The base class’ implementation will raise a NotImplementedError.
Parameter: | entity (str) – entity with tonal information |
---|---|
Return type: | tuple |
Returns: | plain entity without tone mark and entity’s tone |
Raises InvalidEntityError: | |
if the entity is invalid. | |
Raises UnsupportedError: | |
if the operation is not supported for the given form. |
Bases: cjklib.reading.operator.TonalFixedEntityOperator
Defines an operator on strings of a tonal language written in the International Phonetic Alphabet (IPA).
TonalIPAOperator does not supply the same closed set of syllables as other ReadingOperators as IPA provides different ways to represent pronunciation. Because of that a user defined IPA syllable will not easily map to another transcription system and thus only basic support is provided for this direction.
Tones in IPA can be expressed using different schemes. The following schemes are implemented here:
Todo
Parameters: |
|
---|
Composes the given list of basic entities to a string. IPA syllables are separated by a period.
Parameter: | readingEntities (list of str) – list of basic entities or other content |
---|---|
Return type: | str |
Returns: | composed entities |
Decomposes the given string into basic entities that can be mapped to one Chinese character each (exceptions possible).
The returned list contains a mix of basic reading entities and other characters e.g. spaces and punctuation marks.
Single syllables can only be found if distinguished by a period or whitespace, such as compose() would return.
Parameter: | readingString (str) – reading string |
---|---|
Return type: | list of str |
Returns: | a list of basic entities of the input string |
Gets the entity with tone mark for the given plain entity and tone.
The plain entity returned will always be in Unicode’s Normalization Form C (NFC, see http://www.unicode.org/reports/tr15/).
Parameters: |
|
---|---|
Return type: | str |
Returns: | entity with appropriate tone |
Raises InvalidEntityError: | |
if the entity is invalid. |
Todo
Gets the tone for the given tone mark.
Parameter: | toneMark (str) – tone mark representation of the tone |
---|---|
Return type: | str |
Returns: | tone |
Raises InvalidEntityError: | |
if the toneMark does not exist. |
Takes a string written in IPA and guesses the reading dialect.
Supports option 'toneMarkType'.
Parameter: | readingString (str) – IPA string |
---|---|
Return type: | dict |
Returns: | dictionary of basic keyword settings |
Splits the entity into an entity without tone mark and the name of the entity’s tone.
The plain entity returned will always be in Unicode’s Normalization Form C (NFC, see http://www.unicode.org/reports/tr15/).
Parameter: | entity (str) – entity with tonal information |
---|---|
Return type: | tuple |
Returns: | plain entity without tone mark and additionally the tone |
Raises InvalidEntityError: | |
if the entity is invalid. |
Bases: cjklib.reading.operator.RomanisationOperator, cjklib.reading.operator.TonalFixedEntityOperator
Provides an abstract RomanisationOperator for tonal languages incorporating methods from TonalFixedEntityOperator.
Parameters: |
|
---|
Gets a set of all entities supported by the reading.
The list is used in the segmentation process to find entity boundaries.
Returned entities are in lowercase.
Return type: | list of str |
---|---|
Returns: | list of supported syllables |
Returns true if the given plain entity (without any tone mark) is recognised by the romanisation operator, i.e. it is a valid entity of the reading returned by the segmentation method.
Case of characters will be handled depending on the setting for option 'case'.
Parameter: | entity (str) – entity to check |
---|---|
Return type: | bool |
Returns: | True if string is an entity of the reading, False otherwise. |