Methods for building the library’s database.
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a simple list of all characters in the standard BIG5-HKSCS.
Parameters: |
|
---|
New in version 0.3.
Bases: cjklib.build.builder.UnihanCharacterSetBuilder
Builds a simple list of all characters in the Taiwanese standard BIG5.
Parameters: |
|
---|
Bases: cjklib.build.builder.CEDICTFormatBuilder
Builds the CEDICT dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.EDICTFormatBuilder
Provides an abstract class for loading CEDICT formatted dictionaries.
Two column will be provided for the headword (one for traditional and simplified writings each), one for the reading (e.g. in CEDICT Pinyin) and one for the translation.
Parameters: |
|
---|
Bases: cjklib.build.builder.EDICTFormatBuilder
Builds the CEDICT-GR dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.WordIndexBuilder
Builds the word index of the CEDICT-GR dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.WordIndexBuilder
Builds the word index of the CEDICT dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.TimestampedCEDICTFormatBuilder
Builds the CFDICT dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.WordIndexBuilder
Builds the word index of the CFDICT dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.TableBuilder
Builds a table by loading its data from a list of comma separated values (CSV).
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping from Cantonese syllables in IPA to their initial/final parts.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping of Cantonese syllable in the Yale romanisation system to the syllables’ initial, nucleus and coda.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of Cantonese Yale syllables.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a mapping between characters and their components.
Parameters: |
|
---|
Generates the component to character mapping.
Parameters: |
|
---|
Gets all character components for the given glyph.
Parameters: |
|
---|---|
Return type: | set |
Returns: | all components of the character |
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between characters and their decomposition.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterReadingBuilder
Builds Pinyin mapping table using the Unihan database for syllables with diacritics.
Parameters: |
|
---|
Generates Pinyin syllables from Unihan entries in diacritic form.
Parameters: |
|
---|
Converts the entity with diacritics into an entity with tone mark as appended number.
Parameter: | entity (str) – entity with tonal information |
---|---|
Return type: | tuple |
Returns: | plain entity without tone mark and entity’s tone index (starting with 1) |
Bases: cjklib.build.builder.CharacterDiacriticPinyinBuilder
Builds the Hanyu Da Zidian Pinyin mapping table using the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterReadingBuilder
Builds the character Hangul mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterReadingBuilder
Builds the character Kun’yomi mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterReadingBuilder
Builds the character On’yomi mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterRadicalBuilder
Builds the character Japanese radical mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterReadingBuilder
Builds the character Jyutping mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterRadicalBuilder
Builds the character Dai Kan-Wa jiten radical mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterRadicalBuilder
Builds the character Kangxi radical mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterRadicalBuilder
Builds the character Korean radical mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Provides a mapping of character to Pinyin with additional data not found in other sources.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds the character Pinyin mapping table from the several sources.
Field ‘kMandarin’ from Unihan is not used, see http://unicode.org/faq/han_cjk.html#19 and thread under http://www.unicode.org/mail-arch/unicode-ml/y2010-m01/0246.html.
Parameters: |
|
---|
Bases: cjklib.build.builder.UnihanDerivedBuilder
Provides an abstract class for building a character radical mapping table using the Unihan database.
Parameters: |
|
---|
Generates the radical to character mapping from the Unihan table.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a mapping between characters and their radical with stroke count of residual components.
This class can be extended by inheriting CharacterRadicalStrokeCountBuilder.CharacterRadicalStrokeCountGenerator and overwriting CharacterRadicalStrokeCountBuilder.CharacterRadicalStrokeCountGenerator.getFormRadicalIndex() to implement which forms should be regarded as radicals as well as CharacterRadicalStrokeCountBuilder.filterForms() to filter entries before creation.
Parameters: |
|
---|
Generates the character to radical/residual stroke count mapping.
Parameters: |
|
---|
Filters the set of given radical form entries to return only one single occurrence of a radical.
Parameter: | formSet (set of dict) – radical/residual stroke count entries as generated by CharacterRadicalStrokeCountBuilder.CharacterRadicalStrokeCountGenerator.getEntries(). |
---|---|
Return type: | set of dict |
Returns: | subset of input |
Todo
Gets all radical/residual stroke count combinations from the given decomposition.
Return type: | list |
---|---|
Returns: | all radical/residual stroke count combinations for the character |
Raises ValueError: | |
if IDS is malformed or ambiguous residual stroke count is calculated |
Returns the Kangxi radical index for the given component.
Parameter: | form (str) – component |
---|---|
Return type: | int |
Returns: | radical index of the given radical form. |
Bases: cjklib.build.builder.UnihanDerivedBuilder
Provides an abstract class for building a character reading mapping table using the Unihan database.
Parameters: |
|
---|
Generates the reading entities from the Unihan table.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a mapping between characters and their residual stroke count when splitting of the radical form. This is stripped off information gathered from table CharacterRadicalStrokeCount.
Parameters: |
|
---|
Generates the character to residual stroke count mapping from the CharacterRadicalResidualStrokeCount table.
Parameters: |
|
---|
Gets a list of radical residual entries. For multiple radical occurrences (e.g. 伦) only returns the residual stroke count for the “main” radical form.
Parameters: |
|
---|---|
Return type: | list of tuple |
Returns: | list of residual stroke count entries |
Todo
Lang: Implement, find a good algorithm to turn down unwanted forms, don’t just choose random one. See the following list:
>>> from cjklib import characterlookup
>>> cjk = characterlookup.CharacterLookup('T')
>>> for char in cjk.db.selectSoleValue('CharacterRadicalResidualStrokeCount',
... 'ChineseCharacter', distinctValues=True):
... try:
... entries = cjk.getCharacterKangxiRadicalResidualStrokeCount(char, 'C')
... lastEntry = entries[0]
... for entry in entries[1:]:
... # print if diff. radical forms and diff. residual stroke count
... if lastEntry[0] != entry[0] and lastEntry[2] != entry[2]:
... print char
... break
... lastEntry = entry
... except:
... pass
...
渌
犾
玺
珏
缧
>>> cjk.getCharacterKangxiRadicalResidualStrokeCount(u'缧')
[(u'糸', 0, u'⿻', 0, 8), (u'纟', 0, u'⿰', 0, 11)]
Bases: cjklib.build.builder.CharacterReadingBuilder
Builds the character Pinyin mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a character variant mapping table from the Unihan database. By default only chooses characters from the Basic Multilingual Plane (BMP) with code values between U+0000 and U+FFFF.
Windows versions of Python by default are narrow builds and don’t support characters outside the 16 bit range. MySQL < 6 doesn’t support true UTF-8, and uses a Version with max 3 bytes: http://dev.mysql.com/doc/refman/6.0/en/charset-unicode.html.
Parameters: |
|
---|
Generates the character to variant mapping from the Unihan table.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterReadingBuilder
Builds the character Vietnamese mapping table from the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterDiacriticPinyinBuilder
Builds the Xiandai Hanyu Cidian Pinyin mapping table using the Unihan database.
Parameters: |
|
---|
Bases: cjklib.build.builder.UnihanDerivedBuilder
Builds the Xiandai Hanyu Pinlu Cidian Pinyin mapping table using the Unihan database.
Parameters: |
|
---|
Generates the Xiandai Hanyu Pinlu Cidian Pinyin syllables from the Unihan table.
Parameters: |
|
---|
Bases: cjklib.build.builder.CharacterResidualStrokeCountBuilder
Builds a mapping between characters and their residual stroke count when splitting of the radical form. Includes stroke count data from the Unihan database to make up for missing data in own data files.
Parameters: |
|
---|
Generates the character to residual stroke count mapping.
Parameters: |
|
---|
Bases: cjklib.build.builder.StrokeCountBuilder
Builds a mapping between characters and their stroke count. Includes stroke count data from the Unihan database to make up for missing data in own data files.
Parameters: |
|
---|
Generates the character stroke count mapping.
Parameters: |
|
---|
Gets the stroke count of the given character by summing up the stroke count of its components and using the Unihan table as fallback.
For the sake of consistency this method doesn’t take the stroke count given by Unihan directly but sums up the stroke counts of the components to make sure the sum of component’s stroke count will always give the characters stroke count. The result yielded will be in many cases even more precise than the value given in Unihan (not depending on the actual glyph form).
Once calculated the stroke count will be cached in the given strokeCountDict object.
Parameters: |
|
---|---|
Return type: | int |
Returns: | stroke count |
Raises ValueError: | |
if stroke count is ambiguous due to inconsistent values wrt Unihan vs. own data. |
|
Raises NoInformationError: | |
if decomposition is incomplete |
Bases: cjklib.build.builder.EDICTFormatBuilder
Builds the EDICT dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Provides an abstract class for loading EDICT formatted dictionaries.
One column will be provided for the headword, one for the reading (in EDICT that is the Kana) and one for the translation.
Todo
Parameters: |
|
---|
Generates the dictionary entries.
Parameters: |
|
---|
Build the table provided by the TableBuilder.
A search index is created to allow for fulltext searching.
Returns a SQL statement for creating a virtual table using FTS3 for SQLite.
Parameter: | table (object) – SQLAlchemy table object representing the FTS3 table |
---|---|
Return type: | str |
Returns: | Create table statement |
Builds a FTS3 table construct for supporting full text search under SQLite.
Parameters: |
|
---|
Function extracting the name of contained file from the zipped/tared archive using the file name. Reimplement and adapt to own needs.
Parameters: |
|
---|---|
Return type: | str |
Returns: | name of file in archive |
Returns a handle to the give file.
The file can be either normal content, zip, tar, .tar.gz, tar.bz2 or gz.
Parameter: | filePath (str) – path of file |
---|---|
Return type: | file |
Returns: | handle to file’s content |
Tests if the SQLite FTS3 extension is supported on the build system.
Return type: | bool |
---|---|
Returns: | True if the FTS3 extension exists, False otherwise. |
Bases: cjklib.build.builder.WordIndexBuilder
Builds the word index of the EDICT dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.TableBuilder
Implements an abstract class for building a table from a generator providing entries.
Parameters: |
|
---|
Bases: cjklib.build.builder.UnihanCharacterSetBuilder
Builds a simple list of all characters in the Chinese standard GB2312-80.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of Gwoyeu Romatzyh abbreviated spellings.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of Gwoyeu Romatzyh rhotacised finals.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of Gwoyeu Romatzyh syllables.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a list of glyph indices for characters.
Todo
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Virtual table (view) holding all characters with information about their glyph, i.e. stroke order or character decomposition.
Todo
Parameters: |
|
---|
Bases: cjklib.build.builder.UnihanCharacterSetBuilder
Builds a simple list of all supplementary characters in the Hong Kong standard HKSCS.
Parameters: |
|
---|
Bases: cjklib.build.builder.TimestampedCEDICTFormatBuilder
Builds the HanDeDict dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.WordIndexBuilder
Builds the word index of the HanDeDict dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.UnihanCharacterSetBuilder
Builds a simple list of all characters in IICore (Unicode International Ideograph Core).
see Chinese Wikipedia on IICore: http://zh.wikipedia.org/wiki/國際表意文字核心
Parameters: |
|
---|
Bases: cjklib.build.builder.UnihanCharacterSetBuilder
Builds a simple list of all characters in the Japanese standard JIS X 0208.
Parameters: |
|
---|
New in version 0.3.
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a simple list of all characters in the standards JIS X 0208 and JIS X 0213.
Parameters: |
|
---|
New in version 0.3.
Bases: cjklib.build.builder.UnihanCharacterSetBuilder
Builds a simple list of all supplementary characters in the Japanese standard JIS X 0213.
Parameters: |
|
---|
New in version 0.3.
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between syllables in Jyutping and their representation in IPA.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping from Jyutping syllables to their initial/final parts.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of Jyutping syllables.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between syllables in Jyutping and the Yale romanization system.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between Kangxi radical index and radical characters.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between Kangxi radical index and radical equivalent characters without radical form.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds the Kanjidic database from the Kanjidic2 XML file http://www.csse.monash.edu.au/~jwb/kanjidic2/.
Parameters: |
|
---|
Generates the KANJIDIC table.
Parameters: |
|
---|
Returns a handle of the KANJIDIC database file.
Return type: | file |
---|---|
Returns: | file handle of the KANJIDIC file |
Bases: xml.sax.handler.ContentHandler
Extracts a list of given tags.
Returns the Kanjidic2Builder.KanjidicGenerator.
Return type: | instance |
---|---|
Returns: | instance of a Kanjidic2Builder.KanjidicGenerator |
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between a character under a locale and its default glyph.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping of Mandarin Chinese syllable finals in Pinyin to Braille characters.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping of Mandarin Chinese syllable initials in Pinyin to Braille characters.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping from Mandarin syllables in IPA to their initial/final parts.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between syllables in Pinyin and Gwoyeu Romatzyh.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between syllables in Pinyin and their representation in IPA.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping from Pinyin syllables to their initial/final parts.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of Pinyin syllables.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between Unicode radical forms and Unicode radical variants on one side and equivalent characters on the other side.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Provides a builder for loading dictionaries following the Wenlin format:
*** 928171 ***
pinyin gǎngbì
characters 港币[-幣]
serial-number 1929047
definition Hong Kong dollar
Parameters: |
|
---|
Generates the dictionary entries.
Parameters: |
|
---|
Generates a traditional and simplified form from ‘characters’.
Parameter: | entry (tuple) – a dictionary entry |
---|---|
Return type: | tuple |
Returns: | the given entry with corrected ü-voul |
Returns a handle to the give file.
The file can be either normal content, zip or gz.
Parameter: | filePath (str) – path of file |
---|---|
Return type: | file |
Returns: | handle to file’s content |
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a mapping between characters and their stroke count.
Parameters: |
|
---|
Generates the character stroke count mapping.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between characters and their stroke order.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of strokes and their names.
Parameters: |
|
---|
Bases: object
TableBuilder provides the abstract layout for classes that build a distinct table.
Parameters: |
|
---|
Build the table provided by the TableBuilder.
Methods should raise an IOError if reading a data source fails. The DatabaseBuilder knows how to handle this case and is able to proceed.
Returns a SQLAlchemy Table object.
Parameters: |
|
---|---|
Return type: | object |
Returns: | SQLAlchemy Index |
Returns a SQLAlchemy Table object.
Parameters: |
|
---|
Tries to locate a file with a given list of possible file names under the classes default data paths.
For each file name every given path is checked and the first match is returned.
Parameters: |
|
---|---|
Return type: | str |
Returns: | path to file of first match in search for existing file |
Raises IOError: | if no file found |
Returns the table builder’s default options.
The base class’ implementation returns an empty dictionary. The keyword ‘dbConnectInst’ is not regarded a configuration option of the operator and is thus not included in the dict returned.
Return type: | dict |
---|---|
Returns: | the reading operator’s default options. |
Gets metadata on a given option.
type: string, int, bool, ...
appendResetDefault
choices: allowed values
description: short description of option
Return type: | dict |
---|---|
Returns: | dictionary of metadata |
Bases: cjklib.build.builder.CEDICTFormatBuilder
Shared functionality for dictionaries whose file names include a timestamp.
Parameters: |
|
---|
Tries to locate a file with a given list of possible file names under the classes default data paths.
Uses the newest version of all files found.
Parameters: |
|
---|---|
Return type: | str |
Returns: | path to file of first match in search for existing file |
Raises IOError: | if no file found |
Bases: object
Provides a CSV file iterator supporting Unicode.
Bases: csv.Dialect
Defines a default dialect for the case sniffing fails.
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds the Unihan database from the Unihan file provided by Unicode. By default only chooses characters from the Basic Multilingual Plane (BMP) with code values between U+0000 and U+FFFF.
Windows versions of Python by default are narrow builds and don’t support characters outside the 16 bit range. MySQL < 6 doesn’t support true UTF-8, and uses a Version with max 3 bytes: http://dev.mysql.com/doc/refman/6.0/en/charset-unicode.html.
Parameters: |
|
---|
Generates the entries of the Unihan table.
Parameter: | unihanGenerator (instance) – a UnihanGenerator instance |
---|
Returns the UnihanGenerator. Constructs it if needed.
Return type: | instance |
---|---|
Returns: | instance of a UnihanGenerator |
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a simple list of characters that belong to a specific class using the Unihan data.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Provides an abstract class for building a table with a relation between a Chinese character and another column using the Unihan database.
Parameters: |
|
---|
Regular expression matching one entry in the Unihan database (e.g. U+8682 kMandarin MA3 MA1 MA4).
Parameters: |
|
---|
Iterates over the Unihan entries.
The character definition is converted to the character’s representation, all other data is given as is. These are merged into one entry for each character.
Returns a list of handles of the Unihan database files.
Return type: | dict |
---|---|
Returns: | dictionary of names and handles of the Unihan files |
Returns all keys read for the Unihan table.
If the whole table is read a seek through the file is needed first to find all keys, otherwise the predefined set is returned.
Return type: | list of str |
---|---|
Returns: | list of column names |
Bases: cjklib.build.builder.UnihanDerivedBuilder
Builds a mapping between characters and their stroke count using the Unihan data.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Table for keeping track of version of installed dictionary.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping from Wade-Giles syllables to their initial/final parts.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a mapping between syllables in Wade-Giles and Pinyin.
Parameters: |
|
---|
Bases: cjklib.build.builder.CSVFileLoader
Builds a list of Wade-Giles syllables.
Parameters: |
|
---|
Bases: cjklib.build.builder.EntryGeneratorBuilder
Builds a translation word index for a given dictionary.
Searching for a word will return a headword and reading. This allows to find several dictionary entries with same headword and reading, with only one including the translation word.
Todo
Parameters: |
|
---|
Generates words for a list of dictionary entries.
Parameter: | entries (list of tuple) – a list of headword and its translation |
---|