:mod:`cjklib.dictionary` --- High level dictionary access
=========================================================

.. versionadded:: 0.3

.. automodule:: cjklib.dictionary

This module provides classes for easy access to well known CJK dictionaries.
Queries can be done using a headword, reading or translation.

Dictionary sources yield less structured information compared to other data
sources exposed in this library. Owing to this fact, a flexible system is
provided to the user.

.. inheritance-diagram:: cjklib.dictionary

Examples
--------
Examples how to use this module:

- Create a dictionary instance:

    >>> from cjklib.dictionary import CEDICT
    >>> d = CEDICT()

- Get dictionary entries by reading:

    >>> [e.HeadwordSimplified for e in
    ...     d.getForReading('zhi dao', reading='Pinyin', toneMarkType='numbers')]
    [u'制导', u'执导', u'指导', u'直到', u'直捣', u'知道']

- Change a search strategy (here search for a reading without tones):

    >>> d = CEDICT(readingSearchStrategy=search.SimpleWildcardReading())
    >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers')
    []
    >>> d = CEDICT(readingSearchStrategy=search.TonelessWildcardReading())
    >>> d.getForReading('nihao', reading='Pinyin', toneMarkType='numbers')
    [EntryTuple(HeadwordTraditional=u'你好', HeadwordSimplified=u'你好', Reading=u'nǐ hǎo', Translation=u'/hello/hi/how are you?/')]

- Apply a formatting strategy to remove all initial and final slashes on
  CEDICT translations:

    >>> from cjklib.dictionary import *
    >>> class TranslationFormatStrategy(format.Base):
    ...     def format(self, string):
    ...         return string.strip('/')
    ...
    >>> d = CEDICT(
    ...     columnFormatStrategies={'Translation': TranslationFormatStrategy()})
    >>> d.getFor(u'东京')
    [EntryTuple(HeadwordTraditional=u'東京', HeadwordSimplified=u'东京', Reading=u'Dōng jīng', Translation=u'Tōkyō, capital of Japan')]

- A simple dictionary lookup tool:

    >>> from cjklib.dictionary import *
    >>> from cjklib.reading import ReadingFactory
    >>> def search(string, reading=None, dictionary='CEDICT'):
    ...     # guess reading dialect
    ...     options = {}
    ...     if reading:
    ...         f = ReadingFactory()
    ...         opClass = f.getReadingOperatorClass(reading)
    ...         if hasattr(opClass, 'guessReadingDialect'):
    ...             options = opClass.guessReadingDialect(string)
    ...     # search
    ...     d = getDictionary(dictionary, entryFactory=entry.UnifiedHeadword())
    ...     result = d.getFor(string, reading=reading, **options)
    ...     # print
    ...     for e in result:
    ...         print e.Headword, e.Reading, e.Translation
    ...
    >>> search('_taijiu', 'Pinyin')
    茅台酒（茅臺酒） máo tái jiǔ /maotai (a Chinese liquor)/CL:杯[bei1],瓶[ping2]/

.. index::
   pair: entry; factory

Entry factories
---------------
Similar to SQL interfaces, entries can be returned in different fashion. An
*entry factory* takes care of preparing the output. For this predefined
factories exist: :class:`cjklib.dictionary.entry.Tuple`, which is very basic,
will return each entry as a tuple of its columns while the mostly used
:class:`cjklib.dictionary.entry.NamedTuple` will return tuple objects
that are accessible by attribute also.

.. index::
   pair: formatting; strategy

Formatting strategies
---------------------
As reading formattings vary and many readings can be converted into each other,
a *formatting strategy* can be applied to return the expected format.
:class:`cjklib.dictionary.format.ReadingConversion` provides an easy way
to convert the reading given by the dictionary into the user defined reading.
Other columns can also be formatted by applying a strategy,
see the example above.

A hybrid approach makes it possible to apply strategies on single cells, giving
a mapping from the cell name to the strategy, or a strategy that operates on the
entire result entry, by giving a mapping from ``None`` to the strategy. In the
latter case the formatting strategy needs to deal with the dictionary specific
entry structure:

    >>> from cjklib.dictionary import *
    >>> d = CEDICT(columnFormatStrategies={
    ...     'Translation': format.TranslationFormatStrategy()})
    >>> d = CEDICT(columnFormatStrategies={
    ...     None: format.NonReadingEntityWhitespace()})

Formatting strategies can be chained together using the
:class:`cjklib.dictionary.format.Chain` class.

.. index::
   pair: search; strategy

Search strategies
-----------------
Searching in natural language data is a difficult process and highly depends on
the use case at hand. This task is provided by *search strategies* which
account for the more complex parts of this module. Strategies exist for the
three main parts of dictionary entries: headword, reading and translation.
Additionally mixed searching for a headword partially expressed by reading
information is supported and can augment the basic reading search. Several
instances of search strategies exist offering basic or more sophisticated
routines. For example wildcard searching is offered on top of many basic
strategies offering by default placeholders ``'_'`` for a single character, and
``'%'`` for a match of zero to many characters.

.. inheritance-diagram:: cjklib.dictionary.search

.. index::
   triple: headword; search; strategy

Headword search strategies
^^^^^^^^^^^^^^^^^^^^^^^^^^
Searching for headwords is the most simple among the three. Exact searches are
provided by class :class:`cjklib.dictionary.search.Exact`. By default class
:class:`cjklib.dictionary.search.Wildcard` is employed which offers
wildcard searches.

.. index::
   triple: reading; search; strategy

Reading search strategies
^^^^^^^^^^^^^^^^^^^^^^^^^
Readings have more complex and unique representations. Several classes are
provided here: :class:`cjklib.dictionary.search.Exact` again can be used
for exact matches, and :class:`cjklib.dictionary.search.Wildcard`
for wildcard searches. :class:`cjklib.dictionary.search.SimpleReading`
and :class:`cjklib.dictionary.search.SimpleWildcardReading` provide
similar searching for transcriptions as found e.g. in CEDICT.
A more complex search is provided by
:class:`cjklib.dictionary.search.TonelessWildcardReading`
which offers search for readings missing tonal information.

.. index::
   triple: translation; search; strategy

Translation search strategies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
A basic search is provided by
:class:`cjklib.dictionary.search.SingleEntryTranslation` which
finds an exact entry in a list of entries separated by slashes ('``/``'). More
flexible searching is provided by
:class:`cjklib.dictionary.search.SimpleTranslation` and
:class:`cjklib.dictionary.search.SimpleWildcardTranslation` which take
into account additional information placed in parantheses.
These classes have even more special implementations adapted to formats
found in dictionaries *CEDICT* and *HanDeDict*.

More complex ones can be implemented on the basis of extending the underlying
table in the database, e.g. using *full text search* capabilities of the
database server. One popular way is using stemming algorithms for copying with
inflections by reducing a word to its root form.

.. index::
   triple: mixed; reading; search
   triple: mixed; search; strategy

Mixed reading search strategies
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Special support for a string with mixed reading and headword entities is
provided by *mixed reading search strategies*. For example ``'dui4 不 qi3'``
will find all entries with headwords whose middle character out of three is
``'不'`` and whose left character is read ``'dui4'`` while the right character is
read ``'qi3'``.

Case insensitivity & Collations
-------------------------------
Case insensitive searching is done through collations in the underlying database
system and for databases without collation support by employing function
``lower()``. A default case independent collation is chosen in the appropriate
build method in :mod:`cjklib.build.builder`.

*SQLite* by default has no Unicode support for string operations. Optionally
the *ICU* library can be compiled in for handling alphabetic non-ASCII
characters. The *DatabaseConnector* can register own Unicode functions if ICU
support is missing. Queries with ``LIKE`` will then use function ``lower()``. This
compatibility mode has a negative impact on performance and as it is not needed
for dictionaries like EDICT or CEDICT it is disabled by default.


Functions
----------

.. autofunction:: getAvailableDictionaries

.. autofunction:: getDictionary

.. autofunction:: getDictionaryClass

.. autofunction:: getDictionaryClasses


Classes
--------

.. autoclass:: BaseDictionary
   :show-inheritance:
   :members:
   :undoc-members:
   

.. autoclass:: CEDICT
   :show-inheritance:
   :members:
   :undoc-members:
   

   Get dictionary entries with reading IPA:

        >>> from cjklib.dictionary import *
        >>> d = CEDICT(
        ...     readingFormatStrategy=format.ReadingConversion('MandarinIPA'))
        >>> print ', '.join([l['Reading'] for l in d.getForHeadword(u'行')])
        xaŋ˧˥, ɕiŋ˧˥, ɕiŋ˥˩


.. autoclass:: CEDICTGR
   :show-inheritance:
   :members:
   :undoc-members:
   

.. autoclass:: CFDICT
   :show-inheritance:
   :members:
   :undoc-members:
   

.. autoclass:: EDICT
   :show-inheritance:
   :members:
   :undoc-members:
   

.. autoclass:: EDICTStyleDictionary
   :show-inheritance:
   :members:
   :undoc-members:
   

.. autoclass:: EDICTStyleEnhancedReadingDictionary
   :show-inheritance:
   :members:
   :undoc-members:
   

.. autoclass:: HanDeDict
   :show-inheritance:
   :members:
   :undoc-members: