cjklib.characterlookup — Chinese character based functions

Chinese character based functions.

CharacterLookup

CharacterLookup provides access to lookup methods related to Han characters.

The real system of CharacterLookup lies in the database beneath where all relevant data is stored. So for nearly all methods this class needs access to a database. Thus on initialisation of the object a connection to a database is established, the logic for this provided by the DatabaseConnector.

See the DatabaseConnector for supported database systems.

CharacterLookup will try to read the config file from the user’s home folder as cjklib.conf or .cjklib.conf or /etc/cjklib.conf (Unix), %APPDATA%/cjklib/cjklib.conf (Windows), or /Library/Application Support/cjklib/ and $HOME/Library/Application Support/cjklib/cjklib.conf (Mac OS X). If none is present it will try to open a SQLite database stored as cjklib.db in the same folder by default. You can override this behaviour by specifying additional parameters on creation of the object.

Examples

The following examples should give a quick view into how to use this package.

  • Create the CharacterLookup object with default settings (read from cjklib.conf or cjklib.db in same directory as default) and set the character locale to traditional:

    >>> from cjklib import characterlookup
    >>> cjk = characterlookup.CharacterLookup('T')
    
  • Get a list of characters, that are pronounced “국” in Korean:

    >>> cjk.getCharactersForReading(u'국', 'Hangul')
    [u'匊', u'國', u'局', u'掬', u'菊', u'跼', u'鞠', u'鞫', u'麯', u'麴']
    
  • Check if a character is included in another character as a component:

    >>> cjk.isComponentInCharacter(u'玉', u'宝')
    True
    
  • Get all Kangxi radical variants for Radical 184 (⾷) (under the traditional locale):

    >>> cjk.getKangxiRadicalVariantForms(184)
    [u'⻞', u'⻟']
    

Character locale

During the development of characters in the different cultures character appearances changed over time to that extent, that the handling of radicals, character components and strokes needs to be distinguished, depending on the locale.

To deal with this circumstance CharacterLookup works with a character locale. Most of the methods of this class need a locale context. In these cases the output of the method depends on the specified locale.

For example in the traditional locale 这 has 8 strokes, but in simplified Chinese it has only 7, as the radical ⻌ has different stroke counts, depending on the locale.

Glyphs

One feature of Chinese characters is the glyph form describing the visual representation. This feature doesn’t need to be unique and so many characters can be found in different writing variants e.g. character 福 (English: luck) which has numerous forms.

The Unicode Consortium does not include same characters of different actual shape in the Unicode standard (called Z-variants), except a few “double” entries which are included as to maintain backward compatibility. In fact a code point represents an abstract character not defining any visual representation. Thus a distinct appearance description including strokes and stroke order cannot be simply assigned to a code point but one needs to deal with the notion of glyphs, each representing a distinct appearance to which a visual description can be applied.

Cjklib tries to offer a simple approach to handle different glyphs. As character components, strokes and the stroke order depend on this variant, methods dealing with this kind will ask for a glyph value to be specified. In these cases the output of the method depends on the specified shape.

Glyphs and character locales

Varying stroke count, stroke order or decomposition into character components for different character locales is implemented using different glyphs. For the example given above the entry 这 has two glyphs, one with 8 strokes, one with 7 strokes.

In most cases one might only be interested in a single visual appearance, the “standard” one. This would be the one generally used in the specific locale.

Instead of specifying a certain glyph most functions will allow for passing of a character locale. Giving the locale will apply the default glyph given by the mapping defined in the database which can be obtained by calling getDefaultGlyph().

More complex relations as which of several glyphs for a given character are used in a given locale are not covered.

Kangxi radical functions

Using the Unihan database queries about the Kangxi radical of characters can be made. It is possible to get a Kangxi radical for a character or lookup all characters for a given radical.

Unicode has extra code points for radical forms (e.g. ⾔), here called Unicode radical forms, and radical variant forms (e.g. ⻈), here called Unicode radical variants. These characters should be used when explicitly referring to their function as radicals. For most of the radicals and variants their exist complementary character forms which have the same appearance (e.g. 言 and 讠) and which shall be called equivalent characters here.

Mapping from one to another side is not trivially possible, as some forms only exist as radical forms, some only as character forms, but from their meaning used in the radical context (called isolated radical characters here, e.g. 訁 for Kangxi radical 149).

Additionally a one to one mapping can’t be guaranteed, as some forms have two or more equivalent forms in another domain, and mapping is highly dependant on the locale.

CharacterLookup provides methods for dealing with this different kinds of characters and the mapping between them.

Character decomposition

Many characters can be decomposed into two or more components, that again are Chinese characters. This fact can be used in many ways, including character lookup, finding patterns for font design or studying characters. Even the stroke order and stroke count can be deduced from the stroke information of the character’s components.

A character decomposition is depends on the appearance of the character, a glyph, so a glyph index needs to be given (will by default be chosen following the current character locale) when looking at a decomposition into components.

More points render this task more complex: decomposition into one set of components is not distinct, some characters can be broken down into different sets. Furthermore sometimes one component can be given, but the other component will not be encoded as a character in its own right.

These components again might be characters that contain further components (again not distinct ones), thus a complex decomposition in several steps is possible.

The basis for the character decomposition lies in the database, where all decompositions are stored, using Ideographic Description Sequences (IDS). These sequences consist of Unicode IDS operators and characters to describe the structure of the character. There are binary IDS operators to describe decomposition into two components (e.g. ⿰ for one component left, one right as in 好: ⿰女子) or trinary IDS operators for decomposition into three components (e.g. ⿲ for three components from left to right as in 辨: ⿲⾟刂⾟). Using IDS operators it is possible to give a basic structural information, that for example is sufficient in many cases to derive an overall stroke order from two single sets of stroke orders, namely that of the components. Further more it is possible to look for redundant information in different entries and thus helps to keep the definition data clean.

This class provides methods for retrieving the basic partition entries, lookup of characters by components and decomposing as a tree from the character as a root down to the minimal components as leaf nodes.

See also

Character decomposition guidelines
Discussion on the project’s wiki.

Strokes

Chinese characters consist of different strokes as basic parts. These strokes are written in a mostly distinct order called the stroke order and have a distinct stroke count.

The stroke order in the writing of Chinese characters is important e.g. for calligraphy or students learning new characters and is normally fixed as there is only one possible stroke order for each character. Further more there is a fixed set of possible strokes and these strokes carry names.

As with character decomposition the stroke order and stroke count depends on the actual rendering of the character, the glyph. If no specific glyph is specified, it will be deduced from the current character locale.

The set of strokes as defined by Unicode in block 31C0-31EF is supported. Simplifying subsets might be supported in the future.

TODO: About the different classifications of strokes

Stroke names and abbreviated names

Additionally to the encoded stroke forms, stroke names and abbreviated stroke names can be used to conveniently refer to strokes. Currently supported are Mandarin names (following Unicode), and abbreviated stroke names are built by taking the first character of the Pinyin spelling of each syllable, e.g. HZZZG for 橫折折折鉤 (i.e. , U+31E1).

Inconsistencies

The stroke order of some characters is disputed in academic fields. A current workaround would be adding another glyph definition, showing the alternative order.

TODO: About plans of cjklib how to support different views on the stroke order

Readings

See module cjklib.reading for a detailed introduction into character readings.

CharacterLookup provides to methods for accessing character readings: CharacterLookup.getReadingForCharacter() will return all readings known for the given character. CharacterLookup.getCharactersForReading() will return all characters known to have the given reading.

The database offers mappings for the following readings:

Most other readings are available by using one of the above readings as bridge.

Character domains

Unicode encodes Chinese characters for all languages that make use of them, but neither of those writing system make use of the whole spectrum encoded. While it is difficult, if not impossible, to make a clear distinction which characters are used in on system and which not, there exist authorative character sets that are widely used. Following one of those character sets can decrease the amount of characters in question and focus on those actually used in the given context.

In cjklib this concept is implemented as character domain and if a CharacterLookup instance is given a character domain, then its reported results are limited to the characters therein.

For example limit results to the character encoding BIG5, which encodes traditional Chinese characters:

>>> from cjklib import characterlookup
>>> cjk = characterlookup.CharacterLookup('T', 'BIG5')

Available character domains can be checked via getAvailableCharacterDomains(). Special character domain Unicode represents the whole set of Chinese characters encoded in Unicode.

See also

Radicals
Wikipedia on radicals.
Z-variants
Unicode Standard Annex #38, Unicode Han Database (Unihan), 3.7 Variants

Surrogate pairs

Python supports UCS-2 and UCS-4 for Unicode strings. The former is a 2-byte implementation called narrow build, while the latter uses 4 bytes to store Unicode characters and is called a wide build respectively. The latter can directly store any character encoded by Unicode, while UCS-2 only supports the 16-bit range called the Basic Multilingual Plane (BMP). By default Python is compiled with UCS-2 support only and some versions, e.g. the one for Windows, have no publicly available version supporting UCS-4.

To circumvent the fact of only being able to represent the first 65536 codepoints of Unicode Python narrow builds support surrogate pairs as found in UTF-16 to represent characters above the 0xFFFF codepoint. Here a logical character from a codepoint above 0xFFFF is represented by two physical characters. The most significant surrogate lies between 0xD800 and 0xDBFF while the least significant surrogate lies between 0xDC00 and 0xDFFF. Cjklib supports surrogate pairs and will return a string of length 2 for characters outside the BMP for narrow builds. Users need to notice that the assertion len(char) == 1 doesn’t hold here anymore.

See also

PEP 261
Support for “wide” Unicode characters
Encoding of characters outside the BMP
Wikipedia on UTF16/UCS-2.

Classes

class cjklib.characterlookup.CharacterLookup(locale, characterDomain='Unicode', databaseUrl=None, dbConnectInst=None)

Bases: object

CharacterLookup provides access to lookup methods related to Han characters.

Todo

  • Impl: Incorporate stroke lookup (bigram) techniques.

  • Impl: How to handle character forms (either decomposition or stroke order), that can only be found as a component in other characters? We already mark them by flagging it with an ‘S’.

  • Impl: Add option to component decomposition methods to stop on Kangxi radical forms without breaking further down beyond those.

  • Impl: Further character domains for Japanese, Cantonese, Korean, Vietnamese

  • Impl: There are more than 800 characters that have compatibility mappings with its targets having same semantics. Those characters do not need own data for stroke order and decomposition, but can share with their targets:

    >>> unicodedata.normalize('NFD', u'嗀')
    u'嗀'
    

If no parameters are given default values are assumed for the connection to the database. The database connection parameters can be given in databaseUrl, or an instance of DatabaseConnector can be passed in dbConnectInst, the latter one being preferred if both are specified.

Parameters:
  • locale (str) – character locale giving the context for glyph and radical based functions, one character out of TCJKV.
  • characterDomain (str) – character domain (see getAvailableCharacterDomains())
  • databaseUrl (str) – database connection setting in the format driver://user:pass@host/database.
  • dbConnectInst (instance) – instance of a DatabaseConnector
CHARARACTER_READING_MAPPING

A list of readings for which a character mapping exists including the database’s table name and the reading dialect parameters.

On conversion the first matching reading will be selected, so supplying several equivalent readings has limited use.

HAN_SCRIPT_RANGES
List of character codepoint ranges for the Han script. see Scripts.txt from Unicode
IDS_BINARY
A list of binary IDS operators used to describe character decompositions.
IDS_TRINARY
A list of trinary IDS operators used to describe character decompositions.
characterDomain
current character domain
static decompositionFromString(decomposition)

Gets a tuple representation with character/glyph of the given character’s decomposition into components.

Example: Entry ⿱尚[1]儿 will be returned as [u'⿱', (u'尚', 1), (u'儿', 0)].

Parameter:decomposition (str) – character decomposition with IDS operator, components and optional glyph index
Return type:list
Returns:decomposition with character/glyph tuples
static decompositionToString(decomposition, pureIds=False)

Gets a string representation of the given character decomposition.

Example: [u'⿱', (u'尚', 1), (u'儿', 0)] will yield ⿱尚[1]儿.

Parameters:
  • decomposition (list) – decomposition with character/glyph tuples
  • pureIds (bool) – if True a pure Ideographic Description Sequence will be returned and no glyph information will be included.
Return type:

str

Returns:

character decomposition with IDS operator, components and optional glyph index

filterDomainCharacters(charList)

Filters a given list of characters to match only those inside the current character domain. Returns the characters in the given order.

Parameter:charList (list of str) – characters to filter
Return type:list of str
Returns:list of characters inside the current character domain
getAllCharacterVariants(char)

Gets all variant forms regardless of the type for the character.

A list of tuples is returned, including the character and its variant type. See getCharacterVariants() for variant types.

Variants depend on the locale which is not taken into account here. Thus some of the returned characters might be only be variants under some locales.

Parameter:char (str) – Chinese character
Return type:list of tuple
Returns:list of character variant(s) with their type
getAvailableCharacterDomains()

Gets a list of all available character domains. By default available is domain Unicode, which represents all Chinese characters encoded in Unicode. Further domains can be given to the database as tables ending in ...Set including a column ChineseCharacter, e.g. GB2312Set and BIG5Set.

Return type:list of str
Returns:list of supported character domains
getCharacterDomain()

Returns the current character domain.

Return type:str
Returns:the current character domain
getCharacterEquivalentRadicalForms(equivalentForm)

Gets Unicode radical forms or Unicode radical variants for the given equivalent character.

The mapping mostly follows the Han Radical folding specified in the Draft Unicode Technical Report #30 Character Foldings under http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding. Several radical forms can be mapped to the same equivalent character and thus this method in general returns several values.

Parameter:equivalentForm (str) – Equivalent character of Unicode radical form or Unicode radical variant
Return type:list of str
Returns:equivalent character forms
Raises ValueError:
 if an invalid equivalent character is specified
getCharacterGlyphs(char)

Gets a list of character glyph indices supported by the database.

Parameter:char (str) – Chinese character
Return type:list of int
Returns:list of supported glyphs
Raises NoInformationError:
 if no glyph information is available
getCharacterKangxiRadicalIndex(char)

Gets the Kangxi radical index for the given character as defined by the Unihan database.

Parameter:char (str) – Chinese character
Return type:int
Returns:Kangxi radical index
Raises NoInformationError:
 if no Kangxi radical index information for given character
getCharacterKangxiRadicalResidualStrokeCount(char, glyph=None)

Gets the Kangxi radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.

The representation of the included radical or radical variant form depends on the respective character shape and thus the form’s glyph is returned. Some characters include the given radical more than once and in some cases the representation is different between those same forms thus in the general case several matches can be returned, each entry with a different radical form glyph. In these cases the entries are sorted by their glyph index.

There are characters which include both, the radical form and a variant form of the radical (e.g. 伦: 人 and 亻). In these cases both are returned.

This method will return radical forms regardless of the selected locale, e.g. radical ⻔ is returned for character 间, though this variant form is not recognised under a traditional locale (like the character itself).

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used.
Return type:

list of tuple

Returns:

list of radical/variant form, its glyph, the main layout of the character (using a IDS operator), the position of the radical wrt. layout (0, 1 or 2) and the residual stroke count.

Raises NoInformationError:
 

if no stroke count information available

getCharacterKangxiResidualStrokeCount(char, glyph=None)

Gets the stroke count of the residual character components when leaving aside the radical form.

This method returns a subset of data with regards to getCharacterKangxiRadicalResidualStrokeCount(). It may though offer more entries after all, as their might exist information only about the residual stroke count, but not about the concrete radical form.

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
Return type:

int

Returns:

residual stroke count

Raises NoInformationError:
 

if no stroke count information available

Note

The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getCharacterRadicalResidualStrokeCount(char, radicalIndex, glyph=None)

Gets the radical form (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components.

This is a more general version of getCharacterKangxiRadicalResidualStrokeCount() which is not limited to the mapping of characters to a Kangxi radical as done by Unihan.

Parameters:
  • char (str) – Chinese character
  • radicalIndex (int) – radical index
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
Return type:

list of tuple

Returns:

list of radical/variant form, its glyph, the main layout of the character (using a IDS operator), the position of the radical wrt. layout (0, 1 or 2) and the residual stroke count.

Raises NoInformationError:
 

if no stroke count information available

Todo

  • Lang: Clarify on characters classified under a given radical but without any proper radical glyph found as component.
  • Lang: Clarify on different radical glyphs for the same radical form. At best this method should return one and only one radical form (glyph).
  • Impl: Give the Unicode radical form and not the equivalent character form in the relevant table as to always return the pure radical form (also avoids duplicates). Then state: If the included component has an appropriate Unicode radical form or Unicode radical variant, then this form is returned. In either case the radical form can be an ordinary character.
getCharacterRadicalResidualStrokeCountDict()

Gets the full table of radical forms (either a Unicode radical form or a Unicode radical variant) found as a component in the character and the stroke count of the residual character components from the database.

A typical entry looks like (u'众', 0): {9: [(u'人', 0, u'⿱', 0, 4), (u'人', 0, u'⿻', 0, 4)]}, and can be accessed as radicalDict[(u'众', 0)][9] with the Chinese character, its glyph and Kangxi radical index. The values are given in the order radical form, radical glyph, character layout, relative position of the radical and finally the residual stroke count.

Return type:dict
Returns:dictionary of radical/residual stroke count entries.
getCharacterResidualStrokeCount(char, radicalIndex, glyph=None)

Gets the stroke count of the residual character components when leaving aside the radical form.

This is a more general version of getCharacterKangxiResidualStrokeCount() which is not limited to the mapping of characters to a Kangxi radical as done by Unihan.

Parameters:
  • char (str) – Chinese character
  • radicalIndex (int) – radical index
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
Return type:

int

Returns:

residual stroke count

Raises NoInformationError:
 

if no stroke count information available

Note

The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getCharacterResidualStrokeCountDict()

Gets the table of stroke counts of the residual character components from the database for all characters in the chosen character domain.

A typical entry looks like (u'众', 0): {9: [4]}, and can be accessed as residualCountDict[(u'众', 0)][9] with the Chinese character, its glyph and Kangxi radical index which then gives the residual stroke count.

Return type:dict
Returns:dictionary of radical/residual stroke count entries.
getCharacterVariants(char, variantType)

Gets the variant forms of the given type for the character.

The type can be one out of:
  • C, compatible character form (if character was added to Unicode

    to maintain compatibility and round-trip convertibility)

  • M, semantic variant forms, which are often used interchangeably

    instead of the character.

  • P, specialised semantic variant forms, which are often used

    interchangeably instead of the character but limited to certain contexts.

  • Z, Z-variant forms, which only differ in typeface (and would

    have been unified if not to maintain round trip convertibility)

  • S, simplified Chinese character forms, originating from the

    character simplification process of the PR China.

  • T, traditional character forms for a

    simplified Chinese character.

Variants depend on the locale which is not taken into account here. Thus some of the returned characters might be only be variants under some locales.

Parameters:
  • char (str) – Chinese character
  • variantType (str) – type of variant(s) to be returned
Return type:

list of str

Returns:

list of character variant(s) of given type

Todo

  • Docu: Write about different kinds of variants
  • Impl: Give a source on variant information as information can contradict itself (http://www.unicode.org/reports/tr38/tr38-5.html#N10211). See 呆 (U+5446) which has one form each for semantic and specialised semantic, each derived from a different source. Change also in getAllCharacterVariants().
  • Lang: What is the difference on Z-variants and compatible variants? Some links between two characters are bidirectional, some not. Is there any rule?
getCharactersForComponents(componentList, includeEquivalentRadicalForms=True, resultIncludeRadicalForms=False, includeAllGlyphs=False)

Gets all characters that contain the given components.

If option includeEquivalentRadicalForms is set, all equivalent forms will be search for when a Kangxi radical is given.

Parameters:
  • componentList (list of str) – list of character components
  • includeEquivalentRadicalForms (bool) – if True then characters in the given component list are interpreted as representatives for their radical and all radical forms are included in the search. E.g. 肉 will include ⺼ as a possible component.
  • resultIncludeRadicalForms (bool) – if True the result will include Unicode radical forms and Unicode radical variants
  • includeAllGlyphs (bool) – if True all matches will be returned, if False only those with glyphs matching the locale’s default one will be returned
Return type:

list of tuple

Returns:

list of pairs of matching characters and their glyphs

Todo

  • Impl: Table of same character glyphs, including special radical forms (e.g. 言 and 訁).
  • Data: Adopt locale dependant glyph for parent characters (e.g. 鬼 in 隗 愧 嵬).
  • Data: Use radical forms and radical variant forms instead of equivalent characters in decomposition data. Mapping looses information.
  • Lang: By default we get the equivalent character for a radical form. In some cases these equivalent characters will be only abstractly related to the given radical form (e.g. being the main radical form), so that the result set will be too big and doesn’t reflect the original query. Set up a table including only strict visual relations between radical forms and equivalent characters. Alternatively restrict decomposition data to only include radical forms if appropriate, so there would be no need for conversion.
  • Fix: Radical equivalent forms should be included independent of the chosen locale. E.g. u’⻔’ for u’门’.
getCharactersForEquivalentComponents(componentConstruct, resultIncludeRadicalForms=False, includeAllGlyphs=False)

Gets all characters that contain at least one component per list entry, sorted by stroke count if available.

This is the general form of getCharactersForComponents() and allows a set of characters per list entry of which at least one character must be a component in the given list.

Parameters:
  • componentConstruct (list of list of str) – list of character components given as single characters or, for alternative characters, given as a list
  • resultIncludeRadicalForms (bool) – if True the result will include Unicode radical forms and Unicode radical variants
  • includeAllGlyphs (bool) – if True all matches will be returned, if False only those with glyphs matching the locale’s default one will be returned
Return type:

list of tuple

Returns:

list of pairs of matching characters and their glyphs

getCharactersForKangxiRadicalIndex(radicalIndex)

Gets all characters for the given Kangxi radical index.

Parameter:radicalIndex (int) – Kangxi radical index
Return type:list of str
Returns:list of matching Chinese characters

Todo

  • Docu: Write about how Unihan maps characters to a Kangxi radical. Especially Chinese simplified characters.
  • Lang: 6954 characters have no Kangxi radical. Provide integration for these (SELECT COUNT(*) FROM Unihan WHERE kRSUnicode IS NOT NULL AND kRSKangxi IS NULL;).
getCharactersForRadicalIndex(radicalIndex)

Gets all characters for the given radical index.

This is a more general version of getCharactersForKangxiRadicalIndex() which is not limited to the mapping of characters to a Kangxi radical as done by Unihan and one character can show up under several different radical indices.

Parameter:radicalIndex (int) – Kangxi radical index
Return type:list of str
Returns:list of matching Chinese characters
getCharactersForReading(readingString, readingN, **options)

Gets all know characters for the given reading.

Cjklib uses the mappings defined in CHARARACTER_READING_MAPPING, but offers lookup for additional readings by converting those to a reading for which a mapping exists. See cjklib.reading for limitations that arise from reading conversion.

Parameters:
  • readingString (str) – reading string for lookup
  • readingN (str) – name of reading
  • options – additional options for handling the reading input
Return type:

list of str

Returns:

list of characters for the given reading

Raises UnsupportedError:
 

if no mapping between characters and target reading exists. Either the database wasn’t build with the table needed or the given reading cannot be converted to any of the available mappings.

Raises ConversionError:
 

if conversion from the internal source reading to the given target reading fails.

getDecompositionEntries(char, glyph=None)

Gets the decomposition of the given character into components from the database. The resulting decomposition is only the first layer in a tree of possible paths along the decomposition as the components can be further subdivided.

There can be several decompositions for one character so a list of decomposition is returned.

Each entry in the result list consists of a list of characters (with its glyph) and IDS operators.

Parameters:
  • char (str) – Chinese character that is to be decomposed into components
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
Return type:

list

Returns:

list of first layer decompositions

getDecompositionEntriesDict()

Gets the decomposition table from the database for all characters in the chosen character domain.

Return type:dict
Returns:dictionary with key pair character, glyph and the first layer decomposition as value
getDecompositionTreeList(char, glyph=None)

Gets the decomposition of the given character into components as a list of decomposition trees.

There can be several decompositions for one character so one tree per decomposition is returned.

Each entry in the result list consists of a list of characters (with its glyph and list of further decomposition) and IDS operators. If a character can be further subdivided, its containing list is non empty and includes yet another list of trees for the decomposition of the component.

Parameters:
  • char (str) – Chinese character that is to be decomposed into components
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
Return type:

list

Returns:

list of decomposition trees

getDefaultGlyph(char)

Gets the default glyph for the given character under the chosen character locale.

The glyph returned is an index to the internal database of different character glyphs and represents the most common glyph used under the given locale.

Parameter:char (str) – Chinese character
Return type:int
Returns:glyph index
Raises NoInformationError:
 if no glyph information is available
getDomainCharacterIterator()

Returns an iterator over the full set of domain characters.

Return type:iterator
Returns:iterator of characters inside the current character domain
getKangxiRadicalForm(radicalIdx)

Gets a Unicode radical form for the given Kangxi radical index.

This method will always return a single non null value, even if there are several radical forms for one index.

Parameter:radicalIdx (int) – Kangxi radical index
Return type:str
Returns:Unicode radical form
Raises ValueError:
 if an invalid radical index is specified

Todo

  • Lang: Check if radicals for which multiple radical forms exists include a simplified form or other variation (e.g. ⻆, ⻝, ⺐). There are radicals for which a Chinese simplified character equivalent exists and that is mapped to a different radical under Unicode.
getKangxiRadicalIndex(radicalForm)

Gets the Kangxi radical index for the given form.

The given form might either be an Unicode radical form or an equivalent character.

Parameter:radicalForm (str) – radical form
Return type:int
Returns:Kangxi radical index
Raises ValueError:
 if an invalid radical form is specified
getKangxiRadicalRepresentativeCharacters(radicalIdx)

Gets a list of characters that represent the radical for the given Kangxi radical index.

This includes the radical form(s), character equivalents and variant forms and equivalents. Results are not limited to the chosen character domain.

E.g. character for to speak/to say/talk/word (Pinyin yán): ⾔ (0x2f94), 言 (0x8a00), ⻈ (0x2ec8), 讠 (0x8ba0), 訁 (0x8a01)

Parameter:radicalIdx (int) – Kangxi radical index
Return type:list of str
Returns:list of Chinese characters representing the radical for the given index, including Unicode radical and variant forms and their equivalent real character forms
getKangxiRadicalVariantForms(radicalIdx)

Gets a list of Unicode radical variants for the given Kangxi radical index.

This method can return an empty list if there are no Unicode radical variant forms. There might be non Unicode radical variants for this radial as character forms though.

Parameter:radicalIdx (int) – Kangxi radical index
Return type:list of str
Returns:list of Unicode radical variants

Todo

  • Lang: Narrow locales, not all variant forms are valid under all locales.
getLocaleDefaultGlyph(char, locale)

Gets the default glyph for the given character under the given locale.

The glyph returned is an index to the internal database of different character glyphs and represents the most common glyph used under the given locale.

Parameters:
  • char (str) – Chinese character
  • locale (str) – character locale (one out of TCJKV)
Return type:

int

Returns:

glyph

Raises NoInformationError:
 

if no glyph information is available

Raises ValueError:
 

if an invalid character locale is specified

getRadicalFormEquivalentCharacter(radicalForm)

Gets the equivalent character of the given Unicode radical form or Unicode radical variant.

The mapping mostly follows the Han Radical folding specified in the Draft Unicode Technical Report #30 Character Foldings under http://www.unicode.org/unicode/reports/tr30/#HanRadicalFolding. All radical forms except U+2E80 (⺀) have an equivalent character. These equivalent characters are not necessarily visual identical and can be subject to major variation. Results are not limited to the chosen character domain.

This method may raise a UnsupportedError if there is no supported equivalent character form.

Parameter:radicalForm (str) – Unicode radical form
Return type:str
Returns:equivalent character form
Raises UnsupportedError:
 if there is no supported equivalent character form
Raises ValueError:
 if an invalid radical form is specified
getReadingForCharacter(char, readingN, **options)

Gets all know readings for the character in the given target reading.

Cjklib uses the mappings defined in CHARARACTER_READING_MAPPING, but offers lookup for additional readings by converting those to a reading for which a mapping exists. See cjklib.reading for limitations that arise from reading conversion.

Parameters:
  • char (str) – Chinese character for lookup
  • readingN (str) – name of target reading
  • options – additional options for handling the reading output
Return type:

str

Returns:

list of readings for the given character

Raises UnsupportedError:
 

if no mapping between characters and target reading exists.

Raises ConversionError:
 

if conversion from the internal source reading to the given target reading fails.

Todo

  • Impl: Add option to return converted entities even if conversion fails for some entities. Represent those with None.
getResidualStrokeCountForKangxiRadicalIndex(radicalIndex)

Gets all characters and residual stroke count for the given Kangxi radical index.

This brings together methods getCharactersForKangxiRadicalIndex() and getCharacterResidualStrokeCountDict() and reports all characters including the given Kangxi radical, additionally supplying the residual stroke count.

Parameter:radicalIndex (int) – Kangxi radical index
Return type:list of tuple
Returns:list of matching Chinese characters with residual stroke count
getResidualStrokeCountForRadicalIndex(radicalIndex)

Gets all characters and residual stroke count for the given radical index.

This brings together methods getCharactersForRadicalIndex() and getCharacterResidualStrokeCountDict() and reports all characters including the given radical without being limited to the mapping of characters to a Kangxi radical as done by Unihan, additionally supplying the residual stroke count.

Parameter:radicalIndex (int) – Kangxi radical index
Return type:list of tuple
Returns:list of matching Chinese characters with residual stroke count
getStrokeCount(char, glyph=None)

Gets the stroke count for the given character.

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
Return type:

int

Returns:

stroke count of given character

Raises NoInformationError:
 

if no stroke count information available

Note

The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getStrokeCountDict()

Returns a stroke count dictionary for all characters in the chosen character domain.

Return type:dict
Returns:dictionary of key pair character, glyph and value stroke count

Note

The quality of the returned data depends on the sources used when compiling the database. Unihan itself only gives very general stroke order information without being bound to a specific glyph.

getStrokeForAbbrev(abbrev)

Gets the stroke form for the given abbreviated stroke name (e.g. 'HZ').

Parameter:abbrev (str) – abbreviated stroke name
Return type:str
Returns:Unicode stroke character
Raises ValueError:
 if an invalid stroke abbreviation is specified
getStrokeForName(name)

Gets the stroke form for the given stroke name (e.g. '横折').

Parameter:name (str) – Chinese name of stroke
Return type:str
Returns:Unicode stroke char
Raises ValueError:
 if an invalid stroke name is specified
getStrokeOrder(char, glyph=None, includePartial=False)

Gets the stroke order sequence for the given character.

The stroke order is constructed using the character decomposition into components.

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
  • includePartial (bool) – if True a stroke order sequence will be returned even if only partial information is available. Unknown strokes will be replaced by None.
Return type:

list

Returns:

list of Unicode strokes

Raises NoInformationError:
 

if no stroke order information available

getStrokeOrderAbbrev(char, glyph=None, includePartial=False)

Gets the stroke order sequence for the given character as a string of abbreviated stroke names separated by spaces and hyphens.

The stroke order is constructed using the character decomposition into components.

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used.
  • includePartial (bool) – if True a stroke order sequence will be returned even if only partial information is available. Unknown strokes will be replaced by a question mark (?).
Return type:

str

Returns:

string of stroke abbreviations separated by spaces and hyphens.

Raises NoInformationError:
 

if no stroke order information available

Todo

  • Lang: Add stroke order source to stroke order data so that in general different and contradicting stroke order information can be given. The user then could prefer several sources that in the order given would be queried.
getStrokeOrderAbbrevDict()

Returns a stroke order dictionary for all characters in the chosen character domain.

Return type:dict
Returns:dictionary of key pair character, glyph and value stroke order
hasMappingForCharacterToReading(readingN)

Returns True if a mapping between Chinese characters and the given reading is supported.

Parameter:readingN (str) – name of reading
Return type:bool
Returns:True if a mapping between Chinese characters and the given reading is supported, False otherwise.
hasMappingForReadingToCharacter(readingN)

Returns True if a mapping between the given reading and Chinese characters is supported.

Parameter:readingN (str) – name of reading
Return type:bool
Returns:True if a mapping between the given reading and Chinese characters is supported, False otherwise.
classmethod isBinaryIDSOperator(char)

Checks if given character is a binary IDS operator.

Parameter:char (str) – Chinese character
Return type:bool
Returns:True if binary IDS operator, False otherwise
isCharacterInDomain(char)

Checks if the given character is inside the current character domain.

Parameter:char (str) – Chinese character for lookup
Return type:bool
Returns:True if character is inside the current character domain, False otherwise.
isComponentInCharacter(component, char, glyph=None, componentGlyph=None)

Checks if the given character contains the second character as a component.

Parameters:
  • component (str) – character questioned to be a component
  • char (str) – Chinese character
  • glyph (int) – glyph of the character. This parameter is optional and if omitted the default glyph defined by getDefaultGlyph() will be used
  • componentGlyph (int) – glyph of the component; if left out every glyph matches for that character.
Return type:

bool

Returns:

True if component is a component of the given character, False otherwise

Todo

Impl: Implement means to check if the component is really not
found, or if our data is just insufficient.
classmethod isIDSOperator(char)

Checks if given character is an IDS operator.

Parameter:char (str) – Chinese character
Return type:bool
Returns:True if IDS operator, False otherwise
isKangxiRadicalFormOrEquivalent(form)

Checks if the given form is a Kangxi radical form or a radical equivalent. This includes Unicode radical forms, Unicode radical variants, equivalent character and isolated radical characters.

Parameter:form (str) – Chinese character
Return type:bool
Returns:True if given form is a radical or equivalent character, False otherwise
static isRadicalChar(char)

Checks if the given character is a Unicode radical form or Unicode radical variant.

This method does a quick Unicode code index checking. So there is no guarantee this form has actually a radical entry in the database.

Parameter:char (str) – Chinese character
Return type:bool
Returns:True if given form is a radical form, False otherwise
classmethod isTrinaryIDSOperator(char)

Checks if given character is a trinary IDS operator.

Parameter:char (str) – Chinese character
Return type:bool
Returns:True if trinary IDS operator, False otherwise
setCharacterDomain(characterDomain)

Sets the current character domain.

Parameter:characterDomain (str) – the current character domain