Conversion between character readings.
The basic method is convert() which converts one input string from one reading to another.
The method getDefaultOptions() will return the conversion default settings.
The conversion process uses the ReadingOperator for the source reading to decompose the given string into the single entities. The decomposition contains reading entities and entities that don’t represent any pronunciation. While the goal is to convert included reading entities to the target reading, some convertes might decide to also convert non-reading entities. This can be for example delimiters like apostrophes that differ between romanisations or punctuation marks that have a defined representation in the target system, e.g. Braille.
By default conversion won’t stop on entities that closely resemble other reading entities but itself are not valid. Those will turn up unchanged in the result and can cause a CompositionError when the target operator decideds that it is impossible to link a converted entity with a non-converted one as it would make it impossible to later determine the entity boundaries. Most of those errors will probably result from bad input that fails on conversion. This can be solved by telling the source operator to be strict on decomposition (where supported) so that the error will be reported beforehand. The followig example tries to convert xiǎo tōu (“thief”), misspelled as *xiǎo tō:
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> print f.convert(u'xiao3to1', 'Pinyin', 'GR',
... sourceOptions={'toneMarkType': 'numbers'})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
cjklib.exception.CompositionError: Unable to delimit non-reading entity 'to1'
>>> print f.convert(u'xiao3to1', 'Pinyin', 'GR',
... sourceOptions={'toneMarkType': 'numbers',
... 'strictSegmentation': True})
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
...
cjklib.exception.DecompositionError: Segmentation of 'to1' not possible or invalid syllable
Not being strict results in a lazy conversion, which might fail in some cases as shown above. u'xiao3 to1' (with a space in between) though will work for the lazy way ('to1' not being converted), while the strict version will still report the wrong *to1.
Other errors that can arise:
Conversions between two Readings can be made using a third reading if no direct conversion is defined. This reading is called a bridge reading and is implemented in BridgeConverter. Using the routines from the ReadingFactory will automatically employ bridges if needed.
Convert a string from Jyutping to Cantonese Yale:
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> f.convert('gwong2jau1waa2', 'Jyutping', 'CantoneseYale')
u'gwóngyāuwá'
This is also possible creating a converter instance explicitly using the factory:
>>> jyc = f.createReadingConverter('GR', 'Pinyin')
>>> jyc.convert('Woo.men tingshuo yeou "Yinnduhshyue", "Aijyishyue"')
u'Wǒmen tīngshuō yǒu "Yìndùxué", "Āijíxué"'
Convert between different dialects of the same reading Wade-Giles:
>>> f.convert(u'kuo3-yü2', 'WadeGiles', 'WadeGiles',
... sourceOptions={'toneMarkType': 'numbers'},
... targetOptions={'toneMarkType': 'superscriptNumbers'})
u'kuo³-yü²'
See PinyinDialectConverter for more examples.
Bases: object
Defines an abstract converter between two or more character readings.
Parameters: |
|
---|
Converts a string in the source reading to the given target reading.
Parameters: |
|
---|---|
Return type: | str |
Returns: | the input string converted to the toReading |
Raises DecompositionError: | |
if the string can not be decomposed into basic entities with regards to the source reading or the given information is insufficient. |
|
Raises CompositionError: | |
if the target reading’s entities can not be composed. |
|
Raises ConversionError: | |
on operations specific to the conversion between the two readings (e.g. error on converting entities). |
|
Raises UnsupportedError: | |
if source or target reading is not supported for conversion. |
Todo
Converts a list of entities in the source reading to the given target reading.
The default implementation will raise a NotImplementedError.
Parameters: |
|
---|---|
Return type: | list of str |
Returns: | list of entities written in target reading |
Raises ConversionError: | |
on operations specific to the conversion between the two readings (e.g. error on converting entities). |
|
Raises UnsupportedError: | |
if source or target reading is not supported for conversion. |
|
Raises InvalidEntityError: | |
if an invalid entity is given. |
Returns the reading converter’s default options.
The keyword ‘dbConnectInst’ is not regarded a configuration option of the converter and is thus not included in the dict returned.
Return type: | dict |
---|---|
Returns: | the reading converter’s default options. |
Bases: cjklib.reading.converter.ReadingConverter
Defines an abstract ReadingConverter between two or more readings for doing entity wise conversion.
Converters that simply convert one syllable at once can implement this class and merely need to overwrite convertBasicEntity()
Parameters: |
|
---|
Converts a basic entity (e.g. a syllable) in the source reading to the given target reading.
This method is called by convertEntities() and a single entity is given for conversion.
The default implementation will raise a NotImplementedError.
Parameters: |
|
---|---|
Return type: | str |
Returns: | the entity converted to the toReading |
Raises AmbiguousConversionError: | |
if conversion for this entity of the source reading is ambiguous. |
|
Raises ConversionError: | |
on other operations specific to the conversion of the entity. |
|
Raises InvalidEntityError: | |
if the entity is invalid. |
Bases: cjklib.reading.converter.ReadingConverter
Defines an abstract ReadingConverter that support non-standard reading representations (dialect) as in- and output.
Input will be converted to a standard representation of the input reading before the actual conversion step is done. If needed the converted reading will be converted to a defined dialect.
Parameters: |
|
---|
Defines the default reading options for the reading dialect used as a bridge in conversion between the user specified representation and the target reading.
The most general reading dialect should be specified as to allow for a broad range of input.
Converts a list of entities in the source reading to the given target reading.
Parameters: |
|
---|---|
Return type: | list of str |
Returns: | list of entities written in target reading |
Raises AmbiguousConversionError: | |
if conversion for a specific entity of the source reading is ambiguous. |
|
Raises ConversionError: | |
on other operations specific to the conversion between the two readings (e.g. error on converting entities). |
|
Raises UnsupportedError: | |
if source or target reading is not supported for conversion. |
|
Raises InvalidEntityError: | |
if an invalid entity is given. |
Convert a list of reading entities in standard representatinon given by DEFAULT_READING_OPTIONS() and non reading entities from the source reading to the target reading.
The default implementation will raise a NotImplementedError.
Parameters: |
|
---|---|
Return type: | list structure |
Returns: | list of converted reading entities given as list and non-reading entities as single str objects |
Bases: cjklib.reading.converter.DialectSupportReadingConverter
Defines an abstract ReadingConverter between two or more romanisations.
Reading dialects can produce different entities which have to be handled by the conversion process. This is realised by converting the given reading dialect to a default form, then converting to the default target reading and finally converting to the specified target reading dialect. On conversion step thus involves three single conversion steps using a default form. This default form can be defined in DEFAULT_READING_OPTIONS.
Letter case will be transfered between syllables, no special formatting according to anyhow defined standards will be guaranteed. Letter case will be identified according to three classes: uppercase (all case-sensible characters are uppercase), titlecase (all case-sensible characters are lowercase except the first case-sensible character), lowercase (all case-sensible characters are lowercase). For entities of single latin characters uppercase has precedence over titlecase, e.g. E5 will convert to ÉH in Cantonese Yale, not to Éh. In general letter case should be handled outside of cjklib if special formatting is required.
The class itself can’t be used directly, it has to be subclassed and convertBasicEntity() has to be implemented, as to make the translation of a syllable from one romanisation to another possible.
Parameters: |
|
---|
Converts a basic entity (e.g. a syllable) in the source reading to the given target reading.
This method is called by convertEntities() and a lower case entity is given for conversion. The returned value should be in lower case characters too, as convertEntities() will take care of capitalisation.
If a single entity needs to be converted it is recommended to use convertEntities() instead. In the general case it can not be ensured that a mapping from one reading to another can be done by the simple conversion of a basic entity. One-to-many mappings are possible and there is no guarantee that any entity of a reading recognised by isReadingEntity() will be mapped here.
The default implementation will raise a NotImplementedError.
Parameters: |
|
---|---|
Return type: | str |
Returns: | the entity converted to the toReading in lower case |
Raises AmbiguousConversionError: | |
if conversion for this entity of the source reading is ambiguous. |
|
Raises ConversionError: | |
on other operations specific to the conversion of the entity. |
|
Raises InvalidEntityError: | |
if the entity is invalid. |
Bases: cjklib.reading.converter.ReadingConverter
Provides a ReadingConverter that converts between readings over a third reading called bridge reading.
Parameters: |
|
---|