cjklib.build.builder — Build methods

Methods for building the library’s database.

Inheritance diagram of cjklib.build.builder

Classes

class cjklib.build.builder.BIG5HKSCSSetBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a simple list of all characters in the standard BIG5-HKSCS.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
build()
getGenerator()

New in version 0.3.

class cjklib.build.builder.BIG5SetBuilder(**options)

Bases: cjklib.build.builder.UnihanCharacterSetBuilder

Builds a simple list of all characters in the Taiwanese standard BIG5.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CEDICTBuilder(**options)

Bases: cjklib.build.builder.CEDICTFormatBuilder

Builds the CEDICT dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
getArchiveContentName(nameList, filePath)
class cjklib.build.builder.CEDICTFormatBuilder(**options)

Bases: cjklib.build.builder.EDICTFormatBuilder

Provides an abstract class for loading CEDICT formatted dictionaries.

Two column will be provided for the headword (one for traditional and simplified writings each), one for the reading (e.g. in CEDICT Pinyin) and one for the translation.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
class cjklib.build.builder.CEDICTGRBuilder(**options)

Bases: cjklib.build.builder.EDICTFormatBuilder

Builds the CEDICT-GR dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
getArchiveContentName(nameList, filePath)
class cjklib.build.builder.CEDICTGRWordIndexBuilder(**options)

Bases: cjklib.build.builder.WordIndexBuilder

Builds the word index of the CEDICT-GR dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CEDICTWordIndexBuilder(**options)

Bases: cjklib.build.builder.WordIndexBuilder

Builds the word index of the CEDICT dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CFDICTBuilder(**options)

Bases: cjklib.build.builder.TimestampedCEDICTFormatBuilder

Builds the CFDICT dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
class cjklib.build.builder.CFDICTWordIndexBuilder(**options)

Bases: cjklib.build.builder.WordIndexBuilder

Builds the word index of the CFDICT dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CSVFileLoader(**options)

Bases: cjklib.build.builder.TableBuilder

Builds a table by loading its data from a list of comma separated values (CSV).

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
INDEX_KEYS
Index keys (not unique) of the created table
TABLE_CSV_FILE_MAPPING
csv file path
TABLE_DECLARATION_FILE_MAPPING
file path containing SQL create table code.
build()
classmethod getDefaultOptions()
classmethod getOptionMetaData(option)
class cjklib.build.builder.CantoneseIPAInitialFinalBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping from Cantonese syllables in IPA to their initial/final parts.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CantoneseYaleInitialNucleusCodaBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping of Cantonese syllable in the Yale romanisation system to the syllables’ initial, nucleus and coda.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CantoneseYaleSyllablesBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of Cantonese Yale syllables.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CharacterComponentLookupBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a mapping between characters and their components.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class CharacterComponentGenerator(dbConnectInst, characterSet)

Generates the component to character mapping.

Parameters:
  • dbConnectInst (instance) – instance of a DatabaseConnector
  • characterSet (set) – set of characters to generate the table for
generator()
Provides the component entries.
getComponents(char, glyph, decompositionDict, componentDict)

Gets all character components for the given glyph.

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of character
Return type:

set

Returns:

all components of the character

CharacterComponentLookupBuilder.getGenerator()
class cjklib.build.builder.CharacterDecompositionBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between characters and their decomposition.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.CharacterDiacriticPinyinBuilder(**options)

Bases: cjklib.build.builder.CharacterReadingBuilder

Builds Pinyin mapping table using the Unihan database for syllables with diacritics.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
GENERATOR_CLASS
alias of ReadingSplitter
class ReadingSplitter(readingEntries, quiet=False)

Generates Pinyin syllables from Unihan entries in diacritic form.

Parameters:
  • readingEntries (list of tuple) – character reading entries from the Unihan database
  • quiet (bool) – if true no status information will be printed
convertTonemark(entity)

Converts the entity with diacritics into an entity with tone mark as appended number.

Parameter:entity (str) – entity with tonal information
Return type:tuple
Returns:plain entity without tone mark and entity’s tone index (starting with 1)
generator()
Provides one entry per reading entity and character.
class cjklib.build.builder.CharacterHDZReadingBuilder(**options)

Bases: cjklib.build.builder.CharacterDiacriticPinyinBuilder

Builds the Hanyu Da Zidian Pinyin mapping table using the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterHangulBuilder(**options)

Bases: cjklib.build.builder.CharacterReadingBuilder

Builds the character Hangul mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterJapaneseKunBuilder(**options)

Bases: cjklib.build.builder.CharacterReadingBuilder

Builds the character Kun’yomi mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterJapaneseOnBuilder(**options)

Bases: cjklib.build.builder.CharacterReadingBuilder

Builds the character On’yomi mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterJapaneseRadicalBuilder(**options)

Bases: cjklib.build.builder.CharacterRadicalBuilder

Builds the character Japanese radical mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterJyutpingBuilder(**options)

Bases: cjklib.build.builder.CharacterReadingBuilder

Builds the character Jyutping mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterKanWaRadicalBuilder(**options)

Bases: cjklib.build.builder.CharacterRadicalBuilder

Builds the character Dai Kan-Wa jiten radical mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterKangxiRadicalBuilder(**options)

Bases: cjklib.build.builder.CharacterRadicalBuilder

Builds the character Kangxi radical mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterKoreanRadicalBuilder(**options)

Bases: cjklib.build.builder.CharacterRadicalBuilder

Builds the character Korean radical mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterPinyinAdditionalBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Provides a mapping of character to Pinyin with additional data not found in other sources.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
getGenerator()
class cjklib.build.builder.CharacterPinyinBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds the character Pinyin mapping table from the several sources.

Field ‘kMandarin’ from Unihan is not used, see http://unicode.org/faq/han_cjk.html#19 and thread under http://www.unicode.org/mail-arch/unicode-ml/y2010-m01/0246.html.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
getGenerator()
class cjklib.build.builder.CharacterRadicalBuilder(**options)

Bases: cjklib.build.builder.UnihanDerivedBuilder

Provides an abstract class for building a character radical mapping table using the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
GENERATOR_CLASS
alias of RadicalExtractor
class RadicalExtractor(rsEntries, quiet=False)

Generates the radical to character mapping from the Unihan table.

Parameters:
  • rsEntries (list of tuple) – character radical entries from the Unihan database
  • quiet (bool) – if true no status information will be printed
generator()
Provides one entry per radical and character.
class cjklib.build.builder.CharacterRadicalStrokeCountBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a mapping between characters and their radical with stroke count of residual components.

This class can be extended by inheriting CharacterRadicalStrokeCountBuilder.CharacterRadicalStrokeCountGenerator and overwriting CharacterRadicalStrokeCountBuilder.CharacterRadicalStrokeCountGenerator.getFormRadicalIndex() to implement which forms should be regarded as radicals as well as CharacterRadicalStrokeCountBuilder.filterForms() to filter entries before creation.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class CharacterRadicalStrokeCountGenerator(dbConnectInst, characterSet, quiet=False)

Generates the character to radical/residual stroke count mapping.

Parameters:
  • dbConnectInst (instance) – instance of a DatabaseConnector
  • characterSet (set) – set of characters to generate the table for
  • quiet (bool) – if true no status information will be printed to stderr
filterForms(formSet)

Filters the set of given radical form entries to return only one single occurrence of a radical.

Parameter:formSet (set of dict) – radical/residual stroke count entries as generated by CharacterRadicalStrokeCountBuilder.CharacterRadicalStrokeCountGenerator.getEntries().
Return type:set of dict
Returns:subset of input

Todo

  • Lang: On multiple occurrences of same radical (may be in different forms): Which one to choose? Implement to turn down unwanted forms.
generator()
Provides the radical/stroke count entries.
getEntries(char, glyph, strokeCountDict, decompositionDict, entriesDict)

Gets all radical/residual stroke count combinations from the given decomposition.

Return type:list
Returns:all radical/residual stroke count combinations for the character
Raises ValueError:
 if IDS is malformed or ambiguous residual stroke count is calculated
getFormRadicalIndex(form)

Returns the Kangxi radical index for the given component.

Parameter:form (str) – component
Return type:int
Returns:radical index of the given radical form.
CharacterRadicalStrokeCountBuilder.getGenerator()
class cjklib.build.builder.CharacterReadingBuilder(**options)

Bases: cjklib.build.builder.UnihanDerivedBuilder

Provides an abstract class for building a character reading mapping table using the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
GENERATOR_CLASS
alias of SimpleReadingSplitter
class SimpleReadingSplitter(readingEntries, quiet=False)

Generates the reading entities from the Unihan table.

Parameters:
  • readingEntries (list of tuple) – character reading entries from the Unihan database
  • quiet (bool) – if true no status information will be printed
generator()
Provides one entry per reading entity and character.
class cjklib.build.builder.CharacterResidualStrokeCountBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a mapping between characters and their residual stroke count when splitting of the radical form. This is stripped off information gathered from table CharacterRadicalStrokeCount.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class ResidualStrokeCountExtractor(dbConnectInst, characterSet)

Generates the character to residual stroke count mapping from the CharacterRadicalResidualStrokeCount table.

Parameters:
  • dbConnectInst (instance) – instance of a DatabaseConnector
  • characterSet (set) – set of characters to generate the table for
generator()
Provides one entry per character, glyph and locale subset.
getEntries(char, glyph, radicalDict)

Gets a list of radical residual entries. For multiple radical occurrences (e.g. 伦) only returns the residual stroke count for the “main” radical form.

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of given character
Return type:

list of tuple

Returns:

list of residual stroke count entries

Todo

  • Lang: Implement, find a good algorithm to turn down unwanted forms, don’t just choose random one. See the following list:

    >>> from cjklib import characterlookup
    >>> cjk = characterlookup.CharacterLookup('T')
    >>> for char in cjk.db.selectSoleValue('CharacterRadicalResidualStrokeCount',
    ...     'ChineseCharacter', distinctValues=True):
    ...     try:
    ...         entries = cjk.getCharacterKangxiRadicalResidualStrokeCount(char, 'C')
    ...         lastEntry = entries[0]
    ...         for entry in entries[1:]:
    ...             # print if diff. radical forms and diff. residual stroke count
    ...             if lastEntry[0] != entry[0] and lastEntry[2] != entry[2]:
    ...                 print char
    ...                 break
    ...             lastEntry = entry
    ...     except:
    ...         pass
    ...
    
    
    
    
    
    >>> cjk.getCharacterKangxiRadicalResidualStrokeCount(u'缧')
    [(u'糸', 0, u'⿻', 0, 8), (u'纟', 0, u'⿰', 0, 11)]
    
CharacterResidualStrokeCountBuilder.getGenerator()
class cjklib.build.builder.CharacterUnihanPinyinBuilder(**options)

Bases: cjklib.build.builder.CharacterReadingBuilder

Builds the character Pinyin mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterVariantBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a character variant mapping table from the Unihan database. By default only chooses characters from the Basic Multilingual Plane (BMP) with code values between U+0000 and U+FFFF.

Windows versions of Python by default are narrow builds and don’t support characters outside the 16 bit range. MySQL < 6 doesn’t support true UTF-8, and uses a Version with max 3 bytes: http://dev.mysql.com/doc/refman/6.0/en/charset-unicode.html.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • wideBuild – if True characters outside the BMP will be included.
COLUMN_SOURCE_ABBREV
Unihan table columns providing content for the table together with their abbreviation used in the target table.
class VariantGenerator(variantEntries, typeList, wideBuild=True, quiet=False)

Generates the character to variant mapping from the Unihan table.

Parameters:
  • variantEntries (list of tuple) – character variant entries from the Unihan database
  • typeList (list of str) – variant types in the order given in tableEntries
  • wideBuild (bool) – if True characters outside the BMP will be included.
  • quiet (bool) – if true no status information will be printed
VARIANT_REGEX_MAPPING
Mapping of entry types to regular expression describing the entry’s pattern.
generator()
Provides one entry per variant and character.
CharacterVariantBuilder.build()
classmethod CharacterVariantBuilder.getDefaultOptions()
CharacterVariantBuilder.getGenerator()
classmethod CharacterVariantBuilder.getOptionMetaData(option)
class cjklib.build.builder.CharacterVietnameseBuilder(**options)

Bases: cjklib.build.builder.CharacterReadingBuilder

Builds the character Vietnamese mapping table from the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterXHCReadingBuilder(**options)

Bases: cjklib.build.builder.CharacterDiacriticPinyinBuilder

Builds the Xiandai Hanyu Cidian Pinyin mapping table using the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
class cjklib.build.builder.CharacterXHPCReadingBuilder(**options)

Bases: cjklib.build.builder.UnihanDerivedBuilder

Builds the Xiandai Hanyu Pinlu Cidian Pinyin mapping table using the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
GENERATOR_CLASS
alias of XHPCReadingSplitter
class XHPCReadingSplitter(readingEntries, quiet=False)

Generates the Xiandai Hanyu Pinlu Cidian Pinyin syllables from the Unihan table.

Parameters:
  • readingEntries (list of tuple) – character reading entries from the Unihan database
  • quiet (bool) – if true no status information will be printed
generator()
Provides one entry per reading entity and character.
class cjklib.build.builder.CombinedCharacterResidualStrokeCountBuilder(**options)

Bases: cjklib.build.builder.CharacterResidualStrokeCountBuilder

Builds a mapping between characters and their residual stroke count when splitting of the radical form. Includes stroke count data from the Unihan database to make up for missing data in own data files.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class CombinedResidualStrokeCountExtractor(tableEntries, preferredBuilder, quiet=False)

Generates the character to residual stroke count mapping.

Parameters:
  • tableEntries (list of list) – list of characters with glyph
  • preferredBuilder (instance) – TableBuilder which forms are preferred over entries from the Unihan table
  • quiet (bool) – if true no status information will be printed
generator()
Provides one entry per character and glyph.
CombinedCharacterResidualStrokeCountBuilder.getGenerator()
class cjklib.build.builder.CombinedStrokeCountBuilder(**options)

Bases: cjklib.build.builder.StrokeCountBuilder

Builds a mapping between characters and their stroke count. Includes stroke count data from the Unihan database to make up for missing data in own data files.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class CombinedStrokeCountGenerator(dbConnectInst, characterSet, tableEntries, preferredBuilder, quiet=False)

Generates the character stroke count mapping.

Parameters:
  • dbConnectInst (instance) – instance of a DatabaseConnector.
  • characterSet (set) – set of characters to generate the table for
  • tableEntries (list of list) – list of characters with glyph
  • preferredBuilder (instance) – TableBuilder which forms are preferred over entries from the Unihan table
  • quiet (bool) – if true no status information will be printed to stderr
checkAgainstUnihan(strokeCountDict, tableEntries)
Checks forms in strokeCountDict to match unihanStrokeCountDict for one entry per character.
generator()
Provides one entry per character, glyph and locale subset.
getStrokeCount(char, glyph, strokeCountDict, unihanStrokeCountDict, decompositionDict)

Gets the stroke count of the given character by summing up the stroke count of its components and using the Unihan table as fallback.

For the sake of consistency this method doesn’t take the stroke count given by Unihan directly but sums up the stroke counts of the components to make sure the sum of component’s stroke count will always give the characters stroke count. The result yielded will be in many cases even more precise than the value given in Unihan (not depending on the actual glyph form).

Once calculated the stroke count will be cached in the given strokeCountDict object.

Parameters:
  • char (str) – Chinese character
  • glyph (int) – glyph of character
Return type:

int

Returns:

stroke count

Raises ValueError:
 

if stroke count is ambiguous due to inconsistent values wrt Unihan vs. own data.

Raises NoInformationError:
 

if decomposition is incomplete

CombinedStrokeCountBuilder.getGenerator()
class cjklib.build.builder.EDICTBuilder(**options)

Bases: cjklib.build.builder.EDICTFormatBuilder

Builds the EDICT dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
class cjklib.build.builder.EDICTFormatBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Provides an abstract class for loading EDICT formatted dictionaries.

One column will be provided for the headword, one for the reading (in EDICT that is the Kana) and one for the translation.

Todo

  • Fix: Optimize insert, use transaction which disables autocommit and cosider passing data all at once, requiring proper handling of row indices.
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
ENCODING
Encoding of the dictionary file.
ENTRY_REGEX
Regular Expression matching a dictionary entry. Needs to be overwritten if not strictly follows the EDICT format.
FILE_NAMES
Names of file containing the edict formated dictionary.
FILTER
Filter to apply to the read entry before writing to table.
FULLTEXT_COLUMNS
Column names which shall be fulltext searchable.
IGNORE_LINES
Number of starting lines to ignore.
class TableGenerator(fileHandle, quiet=False, entryRegex=None, columns=None, filterFunc=None)

Generates the dictionary entries.

Parameters:
  • fileHandle (file) – handle of file to read from
  • quiet (bool) – if true no status information will be printed
  • entryRegex (instance) – regular expression object for entry pattern
  • columns (list of str) – column names of generated data
  • filterFunc (function) – function used to filter entry content
generator()
Provides the dictionary entries.
EDICTFormatBuilder.build()

Build the table provided by the TableBuilder.

A search index is created to allow for fulltext searching.

EDICTFormatBuilder.buildFTS3CreateTableStatement(table)

Returns a SQL statement for creating a virtual table using FTS3 for SQLite.

Parameter:table (object) – SQLAlchemy table object representing the FTS3 table
Return type:str
Returns:Create table statement
EDICTFormatBuilder.buildFTS3Tables(tableName, columns, columnTypeMap=None, primaryKeys=None, fullTextColumns=None)

Builds a FTS3 table construct for supporting full text search under SQLite.

Parameters:
  • tableName (str) – name of table
  • columns (list of str) – column names
  • columnTypeMap (dict of str and object) – mapping of column name to SQLAlchemy Column
  • primaryKeys (list of str) – list of primary key columns
  • fullTextColumns (list of str) – list of fulltext columns
EDICTFormatBuilder.getArchiveContentName(nameList, filePath)

Function extracting the name of contained file from the zipped/tared archive using the file name. Reimplement and adapt to own needs.

Parameters:
  • nameList (list of str) – list of archive contents
  • filePath (str) – path of file
Return type:

str

Returns:

name of file in archive

classmethod EDICTFormatBuilder.getDefaultOptions()
EDICTFormatBuilder.getFileHandle(filePath)

Returns a handle to the give file.

The file can be either normal content, zip, tar, .tar.gz, tar.bz2 or gz.

Parameter:filePath (str) – path of file
Return type:file
Returns:handle to file’s content
EDICTFormatBuilder.getGenerator()
classmethod EDICTFormatBuilder.getOptionMetaData(option)
EDICTFormatBuilder.insertFTS3Tables(tableName, generator, columns=None, fullTextColumns=None)
EDICTFormatBuilder.remove()
EDICTFormatBuilder.testFTS3()

Tests if the SQLite FTS3 extension is supported on the build system.

Return type:bool
Returns:True if the FTS3 extension exists, False otherwise.
class cjklib.build.builder.EDICTWordIndexBuilder(**options)

Bases: cjklib.build.builder.WordIndexBuilder

Builds the word index of the EDICT dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.EntryGeneratorBuilder(**options)

Bases: cjklib.build.builder.TableBuilder

Implements an abstract class for building a table from a generator providing entries.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
COLUMNS
Columns that will be built
COLUMN_TYPES
Column types for created table
INDEX_KEYS
Index keys (not unique) of the created table
PRIMARY_KEYS
Primary keys of the created table
build()
getEntryDict(generator)
getGenerator()
Returns the entry generator. Needs to be implemented by child classes.
class cjklib.build.builder.GB2312SetBuilder(**options)

Bases: cjklib.build.builder.UnihanCharacterSetBuilder

Builds a simple list of all characters in the Chinese standard GB2312-80.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.GRAbbreviationBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of Gwoyeu Romatzyh abbreviated spellings.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.GRRhotacisedFinalsBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of Gwoyeu Romatzyh rhotacised finals.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.GRSyllablesBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of Gwoyeu Romatzyh syllables.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.GlyphBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a list of glyph indices for characters.

Todo

  • Impl: Check if all glyphs in LocaleCharacterGlyph are included.
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
getGenerator()
class cjklib.build.builder.GlyphInformationSetBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Virtual table (view) holding all characters with information about their glyph, i.e. stroke order or character decomposition.

Todo

  • Impl: For implementation as view, we need the concept of runtime dependency. All DEPENDS are actually BUILD_DEPENDS, while the DEPENDS here will be a runtime dependency.
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
getGenerator()
class cjklib.build.builder.HKSCSSetBuilder(**options)

Bases: cjklib.build.builder.UnihanCharacterSetBuilder

Builds a simple list of all supplementary characters in the Hong Kong standard HKSCS.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.HanDeDictBuilder(**options)

Bases: cjklib.build.builder.TimestampedCEDICTFormatBuilder

Builds the HanDeDict dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
class cjklib.build.builder.HanDeDictWordIndexBuilder(**options)

Bases: cjklib.build.builder.WordIndexBuilder

Builds the word index of the HanDeDict dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.IICoreSetBuilder(**options)

Bases: cjklib.build.builder.UnihanCharacterSetBuilder

Builds a simple list of all characters in IICore (Unicode International Ideograph Core).

see Chinese Wikipedia on IICore: http://zh.wikipedia.org/wiki/國際表意文字核心

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.JISX0208SetBuilder(**options)

Bases: cjklib.build.builder.UnihanCharacterSetBuilder

Builds a simple list of all characters in the Japanese standard JIS X 0208.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr

New in version 0.3.

class cjklib.build.builder.JISX0208_0213SetBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a simple list of all characters in the standards JIS X 0208 and JIS X 0213.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
build()
getGenerator()

New in version 0.3.

class cjklib.build.builder.JISX0213SetBuilder(**options)

Bases: cjklib.build.builder.UnihanCharacterSetBuilder

Builds a simple list of all supplementary characters in the Japanese standard JIS X 0213.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr

New in version 0.3.

class cjklib.build.builder.JyutpingIPAMappingBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between syllables in Jyutping and their representation in IPA.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.JyutpingInitialFinalBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping from Jyutping syllables to their initial/final parts.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.JyutpingSyllablesBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of Jyutping syllables.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.JyutpingYaleMappingBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between syllables in Jyutping and the Yale romanization system.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.KangxiRadicalBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between Kangxi radical index and radical characters.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.KangxiRadicalIsolatedCharacterBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between Kangxi radical index and radical equivalent characters without radical form.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.Kanjidic2Builder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds the Kanjidic database from the Kanjidic2 XML file http://www.csse.monash.edu.au/~jwb/kanjidic2/.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • wideBuild – if True characters outside the BMP will be included.
CHARACTER_COLUMN
Name of column for Chinese character key.
KANJIDIC_TAG_MAPPING
Dictionary of tag keys mapping to a table column including a function generating a string out of a list of entries given from the KANJIDIC entry. The tag keys constist of a tuple giving the xml element hierarchy below the ‘character’ element and a set of attribute value pairs.
class KanjidicGenerator(dataPath, tagDict, wideBuild=False)

Generates the KANJIDIC table.

Parameters:
  • dataPath (list of str) – optional list of paths to the data file(s)
  • tagDict (dict) – a dictionary mapping xml tag paths and attributes to a Column and a conversion function
  • wideBuild (bool) – if True characters outside the BMP will be included.
generator()
Provides a pronunciation and a path to the audio file.
getHandle()

Returns a handle of the KANJIDIC database file.

Return type:file
Returns:file handle of the KANJIDIC file
class Kanjidic2Builder.XMLHandler(entryList, tagDict)

Bases: xml.sax.handler.ContentHandler

Extracts a list of given tags.

characters(content)
endElement(name)
startElement(name, attrs)
classmethod Kanjidic2Builder.getDefaultOptions()
Kanjidic2Builder.getGenerator()

Returns the Kanjidic2Builder.KanjidicGenerator.

Return type:instance
Returns:instance of a Kanjidic2Builder.KanjidicGenerator
classmethod Kanjidic2Builder.getOptionMetaData(option)
class cjklib.build.builder.LocaleCharacterGlyphBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between a character under a locale and its default glyph.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.MandarinBraileFinalBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping of Mandarin Chinese syllable finals in Pinyin to Braille characters.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.MandarinBraileInitialBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping of Mandarin Chinese syllable initials in Pinyin to Braille characters.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.MandarinIPAInitialFinalBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping from Mandarin syllables in IPA to their initial/final parts.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.PinyinGRMappingBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between syllables in Pinyin and Gwoyeu Romatzyh.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.PinyinIPAMappingBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between syllables in Pinyin and their representation in IPA.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.PinyinInitialFinalBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping from Pinyin syllables to their initial/final parts.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.PinyinSyllablesBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of Pinyin syllables.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.RadicalEquivalentCharacterBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between Unicode radical forms and Unicode radical variants on one side and equivalent characters on the other side.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.SimpleWenlinFormatBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Provides a builder for loading dictionaries following the Wenlin format:

*** 928171 ***
pinyin                          gǎngbì
characters                      港币[-幣]
serial-number                   1929047
definition                      Hong Kong dollar
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .gz, .txt), overrides file type guessing
ENCODING
Encoding of the dictionary file.
FILE_NAMES
Names of file containing the edict formated dictionary.
FILTER
Filter to apply to the read entry before writing to table.
class TableGenerator(fileHandle, quiet=False, columnMap=None, filterFunc=None)

Generates the dictionary entries.

Parameters:
  • fileHandle (file) – handle of file to read from
  • quiet (bool) – if true no status information will be printed
  • columnMap (dict) – dictionary mapping keys onto table columns
  • filterFunc (function) – function used to filter entry content
generator()
Provides the dictionary entries.
SimpleWenlinFormatBuilder.filterCharacters(entry)

Generates a traditional and simplified form from ‘characters’.

Parameter:entry (tuple) – a dictionary entry
Return type:tuple
Returns:the given entry with corrected ü-voul
SimpleWenlinFormatBuilder.filterSequence(entry)
classmethod SimpleWenlinFormatBuilder.getDefaultOptions()
SimpleWenlinFormatBuilder.getFileHandle(filePath)

Returns a handle to the give file.

The file can be either normal content, zip or gz.

Parameter:filePath (str) – path of file
Return type:file
Returns:handle to file’s content
SimpleWenlinFormatBuilder.getGenerator()
classmethod SimpleWenlinFormatBuilder.getOptionMetaData(option)
class cjklib.build.builder.StrokeCountBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a mapping between characters and their stroke count.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class StrokeCountGenerator(dbConnectInst, quiet=False)

Generates the character stroke count mapping.

Parameters:
  • dbConnectInst (instance) – instance of a DatabaseConnector.
  • quiet (bool) – if true no status information will be printed to stderr
generator()
Provides one entry per character, glyph and locale subset.
StrokeCountBuilder.getGenerator()
class cjklib.build.builder.StrokeOrderBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between characters and their stroke order.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.StrokesBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of strokes and their names.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.TableBuilder(**options)

Bases: object

TableBuilder provides the abstract layout for classes that build a distinct table.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
DEPENDS
Contains the names of the tables needed for the build process.
PROVIDES
Contains the name of the table provided by this module.
build()

Build the table provided by the TableBuilder.

Methods should raise an IOError if reading a data source fails. The DatabaseBuilder knows how to handle this case and is able to proceed.

buildIndexObjects(tableName, indexKeyList)

Returns a SQLAlchemy Table object.

Parameters:
  • tableName (str) – name of table
  • indexKeyList (list of list of str) – a list of key combinations
Return type:

object

Returns:

SQLAlchemy Index

buildTableObject(tableName, columns, columnTypeMap=None, primaryKeys=None)

Returns a SQLAlchemy Table object.

Parameters:
  • tableName (str) – name of table
  • columns (list of str) – column names
  • columnTypeMap (dict of str and object) – mapping of column name to SQLAlchemy Column
  • primaryKeys (list of str) – list of primary key columns
findFile(fileNames, fileType=None)

Tries to locate a file with a given list of possible file names under the classes default data paths.

For each file name every given path is checked and the first match is returned.

Parameters:
  • fileNames (str/list of str) – possible file names
  • fileType (str) – textual type of file used in error msg
Return type:

str

Returns:

path to file of first match in search for existing file

Raises IOError:

if no file found

classmethod getDefaultOptions()

Returns the table builder’s default options.

The base class’ implementation returns an empty dictionary. The keyword ‘dbConnectInst’ is not regarded a configuration option of the operator and is thus not included in the dict returned.

Return type:dict
Returns:the reading operator’s default options.
classmethod getOptionMetaData(option)

Gets metadata on a given option.

Keys can come from the subset of:
  • type: string, int, bool, ...

  • action: action as used by optparse, extended by

    appendResetDefault

  • choices: allowed values

  • description: short description of option

Return type:dict
Returns:dictionary of metadata
remove()
Removes the table provided by the TableBuilder from the database.
class cjklib.build.builder.TimestampedCEDICTFormatBuilder(**options)

Bases: cjklib.build.builder.CEDICTFormatBuilder

Shared functionality for dictionaries whose file names include a timestamp.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • enableFTS3 – if True SQLite full text search (FTS3) will be supported, if the extension exists.
  • filePath – file path including file name, overrides dataPath
  • fileType – type of file (.zip, .tar, .tar.bz2, .tar.gz, .gz, .txt), overrides file type guessing
ARCHIVE_CONTENT_PATTERN
Regular expression specifying file in archive.
EXTRACT_TIMESTAMP
Regular expression to extract the timestamp from the file name.
extractTimeStamp(filePath)
findFile(fileGlobs, fileType=None)

Tries to locate a file with a given list of possible file names under the classes default data paths.

Uses the newest version of all files found.

Parameters:
  • fileGlobs (str/list of str) – possible file names
  • fileType (str) – textual type of file used in error msg
Return type:

str

Returns:

path to file of first match in search for existing file

Raises IOError:

if no file found

getArchiveContentName(nameList, filePath)
getPreferredFile(filePaths)
class cjklib.build.builder.UnicodeCSVFileIterator(fileHandle)

Bases: object

Provides a CSV file iterator supporting Unicode.

class DefaultDialect

Bases: csv.Dialect

Defines a default dialect for the case sniffing fails.

static UnicodeCSVFileIterator.byte_string_dialect(dialect)
UnicodeCSVFileIterator.next()
static UnicodeCSVFileIterator.utf_8_encoder(unicode_csv_data)
class cjklib.build.builder.UnihanBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds the Unihan database from the Unihan file provided by Unicode. By default only chooses characters from the Basic Multilingual Plane (BMP) with code values between U+0000 and U+FFFF.

Windows versions of Python by default are narrow builds and don’t support characters outside the 16 bit range. MySQL < 6 doesn’t support true UTF-8, and uses a Version with max 3 bytes: http://dev.mysql.com/doc/refman/6.0/en/charset-unicode.html.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • wideBuild – if True characters outside the BMP will be included.
  • slimUnihanTable – if True a limited set of columns specified by INCLUDE_KEYS will be supported.
CHARACTER_COLUMN
Name of column for Chinese character key.
class EntryGenerator(unihanGenerator)

Generates the entries of the Unihan table.

Parameter:unihanGenerator (instance) – a UnihanGenerator instance
generator()
Provides all data of one character per entry.
UnihanBuilder.INCLUDE_KEYS
Keys included in a slim version if explicitly specified.
UnihanBuilder.build()
classmethod UnihanBuilder.getDefaultOptions()
UnihanBuilder.getGenerator()
classmethod UnihanBuilder.getOptionMetaData(option)
UnihanBuilder.getUnihanGenerator()

Returns the UnihanGenerator. Constructs it if needed.

Return type:instance
Returns:instance of a UnihanGenerator
class cjklib.build.builder.UnihanCharacterSetBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a simple list of characters that belong to a specific class using the Unihan data.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
build()
getGenerator()
class cjklib.build.builder.UnihanDerivedBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Provides an abstract class for building a table with a relation between a Chinese character and another column using the Unihan database.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
COLUMN_SOURCE
Unihan table column providing content for the table. Needs to be overwritten in subclass.
COLUMN_TARGETS
Column names for new data in created table. Needs to be overwritten in subclass.
COLUMN_TARGETS_TYPES
Types of column for new data in created table.
GENERATOR_CLASS
Class defining the iterator for creating the table’s data. The constructor needs to take two parameters for the list of entries from the Unihan database and the ‘quiet’ flag. Needs to be overwritten in subclass.
build()
classmethod getDefaultOptions()
getGenerator()
classmethod getOptionMetaData(option)
class cjklib.build.builder.UnihanGenerator(fileNames, useKeys=None, wideBuild=True, quiet=False)

Regular expression matching one entry in the Unihan database (e.g. U+8682  kMandarin       MA3 MA1 MA4).

Parameters:
  • fileNames (list of str) – paths to the Unihan database files
  • useKeys (list) – if given only these keys will be read from the table, otherwise all keys will be returned
  • wideBuild (bool) – if True characters outside the BMP will be included.
  • quiet (bool) – if true no status information will be printed to stderr
generator()

Iterates over the Unihan entries.

The character definition is converted to the character’s representation, all other data is given as is. These are merged into one entry for each character.

getHandles()

Returns a list of handles of the Unihan database files.

Return type:dict
Returns:dictionary of names and handles of the Unihan files
keySet
Set of keys of the Unihan table.
keys()

Returns all keys read for the Unihan table.

If the whole table is read a seek through the file is needed first to find all keys, otherwise the predefined set is returned.

Return type:list of str
Returns:list of column names
class cjklib.build.builder.UnihanStrokeCountBuilder(**options)

Bases: cjklib.build.builder.UnihanDerivedBuilder

Builds a mapping between characters and their stroke count using the Unihan data.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
  • ignoreMissing – if True a missing source column will be ignored and a empty table will be built.
GENERATOR_CLASS
alias of StrokeCountExtractor
class StrokeCountExtractor(entries, quiet=False)

Extracts the character stroke count mapping.

Parameters:
  • entries (list of tuple) – character entries from the Unihan database
  • quiet (bool) – if true no status information will be printed
generator()
Provides one entry per radical and character.
class cjklib.build.builder.VersionBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Table for keeping track of version of installed dictionary.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
getGenerator()
class cjklib.build.builder.WadeGilesInitialFinalBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping from Wade-Giles syllables to their initial/final parts.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.WadeGilesPinyinMappingBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a mapping between syllables in Wade-Giles and Pinyin.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.WadeGilesSyllablesBuilder(**options)

Bases: cjklib.build.builder.CSVFileLoader

Builds a list of Wade-Giles syllables.

Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
class cjklib.build.builder.WordIndexBuilder(**options)

Bases: cjklib.build.builder.EntryGeneratorBuilder

Builds a translation word index for a given dictionary.

Searching for a word will return a headword and reading. This allows to find several dictionary entries with same headword and reading, with only one including the translation word.

Todo

  • Fix: Word regex is specialised for HanDeDict.
  • Fix: Using a row_id for joining instead of Headword(Traditional) and Reading would maybe speed up table joins. Needs a workaround to include multiple rows for one actual headword entry though.
Parameters:
  • options – extra options
  • dbConnectInst – instance of a DatabaseConnector
  • dataPath – optional list of paths to the data file(s)
  • quiet – if True no status information will be printed to stderr
HEADWORD_SOURCE
Source of headword
TABLE_SOURCE
Dictionary source
class WordEntryGenerator(entries)

Generates words for a list of dictionary entries.

Parameter:entries (list of tuple) – a list of headword and its translation
generator()
Provides all data of one word per entry.
WordIndexBuilder.getGenerator(*args, **kwargs)

Table Of Contents

Previous topic

cjklib.build — Build database

Next topic

cjklib.build.cli — Build command line interface

This Page