cjklib.dictionary — High level dictionary access

New in version 0.3.

High level dictionary access.

This module provides classes for easy access to well known CJK dictionaries. Queries can be done using a headword, reading or translation.

Dictionary sources yield less structured information compared to other data sources exposed in this library. Owing to this fact, a flexible system is provided to the user.

This module provides classes for easy access to well known CJK dictionaries. Queries can be done using a headword, reading or translation.

Dictionary sources yield less structured information compared to other data sources exposed in this library. Owing to this fact, a flexible system is provided to the user.

Inheritance diagram of cjklib.dictionary

Examples

Examples how to use this module:

  • Create a dictionary instance:

    >>> from cjklib.dictionary import CEDICT
    >>> d = CEDICT()
    
  • Get dictionary entries by reading:

    >>> [e.HeadwordSimplified for e in
    ...     d.getForReading('zhi dao', reading='Pinyin', toneMarkType='numbers')]
    [u'制导', u'执导', u'指导', u'直到', u'直捣', u'知道']
    
  • Change a search strategy (here search for a reading without tones):

    >>> d = CEDICT(readingSearchStrategy=search.SimpleWildcardReading())
    >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers')
    []
    >>> d = CEDICT(readingSearchStrategy=search.TonelessWildcardReading())
    >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers')
    [EntryTuple(HeadwordTraditional=u'你好', HeadwordSimplified=u'你好', Reading=u'nǐ hǎo', Translation=u'/hello/hi/how are you?/')]
    
  • Apply a formatting strategy to remove all initial and final slashes on CEDICT translations:

    >>> from cjklib.dictionary import *
    >>> class TranslationFormatStrategy(format.Base):
    ...     def format(self, string):
    ...         return string.strip('/')
    ...
    >>> d = CEDICT(
    ...     columnFormatStrategies={'Translation': TranslationFormatStrategy()})
    >>> d.getFor(u'东京')
    [EntryTuple(HeadwordTraditional=u'東京', HeadwordSimplified=u'东京', Reading=u'Dōng jīng', Translation=u'Tōkyō, capital of Japan')]
    
  • A simple dictionary lookup tool:

    >>> from cjklib.dictionary import *
    >>> from cjklib.reading import ReadingFactory
    >>> def search(string, reading=None, dictionary='CEDICT'):
    ...     # guess reading dialect
    ...     options = {}
    ...     if reading:
    ...         f = ReadingFactory()
    ...         opClass = f.getReadingOperatorClass(reading)
    ...         if hasattr(opClass, 'guessReadingDialect'):
    ...             options = opClass.guessReadingDialect(string)
    ...     # search
    ...     d = getDictionary(dictionary, entryFactory=entry.UnifiedHeadword())
    ...     result = d.getFor(string, reading=reading, **options)
    ...     # print
    ...     for e in result:
    ...         print e.Headword, e.Reading, e.Translation
    ...
    >>> search('_taijiu', 'Pinyin')
    茅台酒(茅臺酒) máo tái jiǔ /maotai (a Chinese liquor)/CL:杯[bei1],瓶[ping2]/
    

Entry factories

Similar to SQL interfaces, entries can be returned in different fashion. An entry factory takes care of preparing the output. For this predefined factories exist: cjklib.dictionary.entry.Tuple, which is very basic, will return each entry as a tuple of its columns while the mostly used cjklib.dictionary.entry.NamedTuple will return tuple objects that are accessible by attribute also.

Formatting strategies

As reading formattings vary and many readings can be converted into each other, a formatting strategy can be applied to return the expected format. cjklib.dictionary.format.ReadingConversion provides an easy way to convert the reading given by the dictionary into the user defined reading. Other columns can also be formatted by applying a strategy, see the example above.

A hybrid approach makes it possible to apply strategies on single cells, giving a mapping from the cell name to the strategy, or a strategy that operates on the entire result entry, by giving a mapping from None to the strategy. In the latter case the formatting strategy needs to deal with the dictionary specific entry structure:

>>> from cjklib.dictionary import *
>>> d = CEDICT(columnFormatStrategies={
...     'Translation': format.TranslationFormatStrategy()})
>>> d = CEDICT(columnFormatStrategies={
...     None: format.NonReadingEntityWhitespace()})

Formatting strategies can be chained together using the cjklib.dictionary.format.Chain class.

Search strategies

Searching in natural language data is a difficult process and highly depends on the use case at hand. This task is provided by search strategies which account for the more complex parts of this module. Strategies exist for the three main parts of dictionary entries: headword, reading and translation. Additionally mixed searching for a headword partially expressed by reading information is supported and can augment the basic reading search. Several instances of search strategies exist offering basic or more sophisticated routines. For example wildcard searching is offered on top of many basic strategies offering by default placeholders '_' for a single character, and '%' for a match of zero to many characters.

Inheritance diagram of cjklib.dictionary.search

Headword search strategies

Searching for headwords is the most simple among the three. Exact searches are provided by class cjklib.dictionary.search.Exact. By default class cjklib.dictionary.search.Wildcard is employed which offers wildcard searches.

Reading search strategies

Readings have more complex and unique representations. Several classes are provided here: cjklib.dictionary.search.Exact again can be used for exact matches, and cjklib.dictionary.search.Wildcard for wildcard searches. cjklib.dictionary.search.SimpleReading and cjklib.dictionary.search.SimpleWildcardReading provide similar searching for transcriptions as found e.g. in CEDICT. A more complex search is provided by cjklib.dictionary.search.TonelessWildcardReading which offers search for readings missing tonal information.

Translation search strategies

A basic search is provided by cjklib.dictionary.search.SingleEntryTranslation which finds an exact entry in a list of entries separated by slashes (‘/‘). More flexible searching is provided by cjklib.dictionary.search.SimpleTranslation and cjklib.dictionary.search.SimpleWildcardTranslation which take into account additional information placed in parantheses. These classes have even more special implementations adapted to formats found in dictionaries CEDICT and HanDeDict.

More complex ones can be implemented on the basis of extending the underlying table in the database, e.g. using full text search capabilities of the database server. One popular way is using stemming algorithms for copying with inflections by reducing a word to its root form.

Mixed reading search strategies

Special support for a string with mixed reading and headword entities is provided by mixed reading search strategies. For example 'dui4 qi3' will find all entries with headwords whose middle character out of three is '不' and whose left character is read 'dui4' while the right character is read 'qi3'.

Case insensitivity & Collations

Case insensitive searching is done through collations in the underlying database system and for databases without collation support by employing function lower(). A default case independent collation is chosen in the appropriate build method in cjklib.build.builder.

SQLite by default has no Unicode support for string operations. Optionally the ICU library can be compiled in for handling alphabetic non-ASCII characters. The DatabaseConnector can register own Unicode functions if ICU support is missing. Queries with LIKE will then use function lower(). This compatibility mode has a negative impact on performance and as it is not needed for dictionaries like EDICT or CEDICT it is disabled by default.

Functions

cjklib.dictionary.getAvailableDictionaries(dbConnectInst=None)

Returns a list of available dictionaries for the given database connection.

Parameter:dbConnectInst (instance) – optional instance of a DatabaseConnector
Return type:list of class
Returns:list of dictionary class objects
cjklib.dictionary.getDictionary(dictionaryName, **options)

Get a dictionary instance by dictionary name.

Parameter:dictionaryName (str) – dictionary name
Return type:type
Returns:dictionary instance
cjklib.dictionary.getDictionaryClass(dictionaryName)

Get a dictionary class by dictionary name.

Parameter:dictionaryName (str) – dictionary name
Return type:type
Returns:dictionary class
cjklib.dictionary.getDictionaryClasses()

Gets all classes in module that implement BaseDictionary.

Return type:set
Returns:list of all classes inheriting form BaseDictionary

Classes

class cjklib.dictionary.BaseDictionary(**options)

Bases: object

Base dictionary access class. Needs to be implemented by child classes.

Initialises the BaseDictionary instance.

Parameters:
  • entryFactory – entry factory instance
  • columnFormatStrategies – column formatting strategy instances
  • headwordSearchStrategy – headword search strategy instance
  • readingSearchStrategy – reading search strategy instance
  • translationSearchStrategy – translation search strategy instance
  • mixedReadingSearchStrategy – mixed reading search strategy instance
  • databaseUrl – database connection setting in the format driver://user:pass@host/database.
  • dbConnectInst – instance of a DatabaseConnector
COLUMNS
Columns of the dictionary. Can be assigned a format strategy.
PROVIDES
Name of dictionary that is provided by this class.
classmethod available(dbConnectInst)

Returns True if the dictionary is available for the given database connection.

Parameter:dbConnectInst (instance) – instance of a DatabaseConnector
Return type:bool
Returns:True if the database exists, False otherwise.
columnFormatStrategies
Strategies for formatting columns.
getSolumnFormatStrategies()
Strategies for formatting columns.
setColumnFormatStrategies(columnFormatStrategies)
class cjklib.dictionary.CEDICT(**options)

Bases: cjklib.dictionary.EDICTStyleEnhancedReadingDictionary

CEDICT dictionary access.

See also

CEDICTBuilder

Initialises the CEDICT instance. By default the both, simplified and traditional, headword forms are used for lookup.

Parameters:
  • entryFactory – entry factory instance
  • columnFormatStrategies – column formatting strategy instances
  • headwordSearchStrategy – headword search strategy instance
  • readingSearchStrategy – reading search strategy instance
  • translationSearchStrategy – translation search strategy instance
  • mixedReadingSearchStrategy – mixed reading search strategy instance
  • databaseUrl – database connection setting in the format driver://user:pass@host/database.
  • dbConnectInst – instance of a DatabaseConnector
  • headword's' if the simplified headword is used as default, 't' if the traditional headword is used as default, 'b' if both are tried.

Get dictionary entries with reading IPA:

>>> from cjklib.dictionary import *
>>> d = CEDICT(
...     readingFormatStrategy=format.ReadingConversion('MandarinIPA'))
>>> print ', '.join([l['Reading'] for l in d.getForHeadword(u'行')])
xaŋ˧˥, ɕiŋ˧˥, ɕiŋ˥˩
class cjklib.dictionary.CEDICTGR(**options)

Bases: cjklib.dictionary.EDICTStyleEnhancedReadingDictionary

CEDICT-GR dictionary access.

See also

CEDICTGRBuilder

class cjklib.dictionary.CFDICT(**options)

Bases: cjklib.dictionary.HanDeDict

CFDICT dictionary access.

See also

CFDICTBuilder

class cjklib.dictionary.EDICT(**options)

Bases: cjklib.dictionary.EDICTStyleDictionary

EDICT dictionary access.

See also

EDICTBuilder

class cjklib.dictionary.EDICTStyleDictionary(**options)

Bases: cjklib.dictionary.BaseDictionary

Access for EDICT-style dictionaries.

COLUMNS
Columns of dictionary table.
DICTIONARY_TABLE
Name of dictionary table.
READING
Reading.
READING_OPTIONS
Options for reading of dictionary entries.
classmethod available(dbConnectInst)
getAll(limit=None, orderBy=None)

Get all dictionary entries.

Parameters:
  • limit (int) – limiting number of returned entries
  • orderBy (list) – list of column names or SQLAlchemy column objects giving the order of returned entries
getFor(searchStr, limit=None, orderBy=None, **options)

Get dictionary entries whose headword, reading or translation matches the given string.

Parameters:
  • limit (int) – limiting number of returned entries
  • orderBy (list) – list of column names or SQLAlchemy column objects giving the order of returned entries

Todo

  • bug: Specifying a limit might yield less results than possible.
getForHeadword(headwordStr, limit=None, orderBy=None, **options)

Get dictionary entries whose headword matches the given string.

Parameters:
  • limit (int) – limiting number of returned entries
  • orderBy (list) – list of column names or SQLAlchemy column objects giving the order of returned entries

Todo

  • bug: Specifying a limit might yield less results than possible.
getForReading(readingStr, limit=None, orderBy=None, **options)

Get dictionary entries whose reading matches the given string.

Parameters:
  • limit (int) – limiting number of returned entries
  • orderBy (list) – list of column names or SQLAlchemy column objects giving the order of returned entries
Raises ConversionError:
 

if search string cannot be converted to the dictionary’s reading.

Todo

  • bug: Specifying a limit might yield less results than possible.
getForTranslation(translationStr, limit=None, orderBy=None, **options)

Get dictionary entries whose translation matches the given string.

Parameters:
  • limit (int) – limiting number of returned entries
  • orderBy (list) – list of column names or SQLAlchemy column objects giving the order of returned entries

Todo

  • bug: Specifying a limit might yield less results than possible.
version
Version (date) of the dictionary. None if not available.
class cjklib.dictionary.EDICTStyleEnhancedReadingDictionary(**options)

Bases: cjklib.dictionary.EDICTStyleDictionary

Access for EDICT-style dictionaries with enhanced reading support.

The EDICTStyleEnhancedReadingDictionary dictionary class extends cjklib.dictionary.EDICT by:

class cjklib.dictionary.HanDeDict(**options)

Bases: cjklib.dictionary.CEDICT

HanDeDict dictionary access.

See also

HanDeDictBuilder