To do
Todo
- Lang: On multiple occurrences of same radical (may be in
different forms): Which one to choose? Implement to turn down
unwanted forms.
(The original entry is located in library/cjklib.build.builder.rst, line 55 and can be found here.)
Todo
Lang: Implement, find a good algorithm to turn down unwanted
forms, don’t just choose random one. See the following list:
>>> from cjklib import characterlookup
>>> cjk = characterlookup.CharacterLookup(‘T’)
>>> for char in cjk.db.selectSoleValue(‘CharacterRadicalResidualStrokeCount’,
... ‘ChineseCharacter’, distinctValues=True):
... try:
... entries = cjk.getCharacterKangxiRadicalResidualStrokeCount(char, ‘C’)
... lastEntry = entries[0]
... for entry in entries[1:]:
... # print if diff. radical forms and diff. residual stroke count
... if lastEntry[0] != entry[0] and lastEntry[2] != entry[2]:
... print char
... break
... lastEntry = entry
... except:
... pass
...
渌
犾
玺
珏
缧
>>> cjk.getCharacterKangxiRadicalResidualStrokeCount(u’缧’)
[(u’糸’, 0, u’⿻’, 0, 8), (u’纟’, 0, u’⿰’, 0, 11)]
(The original entry is located in library/cjklib.build.builder.rst, line 53 and can be found here.)
Todo
- Fix: Optimize insert, use transaction which disables autocommit and
cosider passing data all at once, requiring proper handling of row
indices.
(The original entry is located in library/cjklib.build.builder.rst, line 12 and can be found here.)
Todo
- Impl: Check if all glyphs in LocaleCharacterGlyph are included.
(The original entry is located in library/cjklib.build.builder.rst, line 9 and can be found here.)
Todo
- Impl: For implementation as view, we need the concept of runtime
dependency. All DEPENDS are actually BUILD_DEPENDS, while the
DEPENDS here will be a runtime dependency.
(The original entry is located in library/cjklib.build.builder.rst, line 10 and can be found here.)
Todo
- Fix: Word regex is specialised for HanDeDict.
- Fix: Using a row_id for joining instead of Headword(Traditional) and
Reading would maybe speed up table joins. Needs a workaround to
include multiple rows for one actual headword entry though.
(The original entry is located in library/cjklib.build.builder.rst, line 13 and can be found here.)
Todo
- bug: “Prefer” system does not work for additional builders
(The original entry is located in library/cjklib.build.cli.rst, line 6 and can be found here.)
Todo
Impl: Incorporate stroke lookup (bigram) techniques.
Impl: How to handle character forms (either decomposition or stroke
order), that can only be found as a component in other characters?
We already mark them by flagging it with an ‘S’.
Impl: Add option to component decomposition methods to stop on Kangxi
radical forms without breaking further down beyond those.
Impl: Further character domains for Japanese, Cantonese, Korean,
Vietnamese
Impl: There are more than 800 characters that have compatibility
mappings with its targets having same semantics. Those characters
do not need own data for stroke order and decomposition, but can
share with their targets:
>>> unicodedata.normalize(‘NFD’, u’嗀’)
u’嗀’
(The original entry is located in library/cjklib.characterlookup.rst, line 9 and can be found here.)
Todo
- Lang: Clarify on characters classified under a given radical
but without any proper radical glyph found as component.
- Lang: Clarify on different radical glyphs for the same radical
form. At best this method should return one and only one radical
form (glyph).
- Impl: Give the Unicode radical form and not the equivalent
character form in the relevant table as to always return the pure
radical form (also avoids duplicates). Then state:
If the included component has an appropriate
Unicode radical form or Unicode radical variant, then this
form is returned. In either case the radical form can be an
ordinary character.
(The original entry is located in library/cjklib.characterlookup.rst, line 309 and can be found here.)
Todo
- Docu: Write about different kinds of variants
- Impl: Give a source on variant information as information can
contradict itself
(http://www.unicode.org/reports/tr38/tr38-5.html#N10211). See
呆 (U+5446) which has one form each for semantic and specialised
semantic, each derived from a different source. Change also in
getAllCharacterVariants().
- Lang: What is the difference on Z-variants and compatible
variants? Some links between two characters are bidirectional,
some not. Is there any rule?
(The original entry is located in library/cjklib.characterlookup.rst, line 421 and can be found here.)
Todo
- Impl: Table of same character glyphs, including special radical
forms (e.g. 言 and 訁).
- Data: Adopt locale dependant glyph for parent characters
(e.g. 鬼 in 隗 愧 嵬).
- Data: Use radical forms and radical variant forms instead of
equivalent characters in decomposition data. Mapping looses
information.
- Lang: By default we get the equivalent character for a radical
form. In some cases these equivalent characters will be only
abstractly related to the given radical form (e.g. being the main
radical form), so that the result set will be too big and doesn’t
reflect the original query. Set up a table including only strict
visual relations between radical forms and equivalent characters.
Alternatively restrict decomposition data to only include radical
forms if appropriate, so there would be no need for conversion.
- Fix: Radical equivalent forms should be included independent of
the chosen locale. E.g. u’⻔’ for u’门’.
(The original entry is located in library/cjklib.characterlookup.rst, line 459 and can be found here.)
Todo
- Docu: Write about how Unihan maps characters to a Kangxi radical.
Especially Chinese simplified characters.
- Lang: 6954 characters have no Kangxi radical. Provide integration
for these (SELECT COUNT(*) FROM Unihan
WHERE kRSUnicode IS NOT NULL AND kRSKangxi IS NULL;).
(The original entry is located in library/cjklib.characterlookup.rst, line 515 and can be found here.)
Todo
- Lang: Check if radicals for which multiple radical forms exists
include a simplified form or other variation (e.g. ⻆, ⻝, ⺐).
There are radicals for which a Chinese simplified character
equivalent exists and that is mapped to a different radical under
Unicode.
(The original entry is located in library/cjklib.characterlookup.rst, line 668 and can be found here.)
Todo
- Lang: Narrow locales, not all variant forms are valid under all
locales.
(The original entry is located in library/cjklib.characterlookup.rst, line 727 and can be found here.)
Todo
- Impl: Add option to return converted entities even if conversion
fails for some entities. Represent those with None.
(The original entry is located in library/cjklib.characterlookup.rst, line 801 and can be found here.)
Todo
- Lang: Add stroke order source to stroke order data so that in
general different and contradicting stroke order information
can be given. The user then could prefer several sources
that in the order given would be queried.
(The original entry is located in library/cjklib.characterlookup.rst, line 958 and can be found here.)
Todo
- Impl: Implement means to check if the component is really not
- found, or if our data is just insufficient.
(The original entry is located in library/cjklib.characterlookup.rst, line 1047 and can be found here.)
Todo
- Fix: Conversion without tones will mostly break as the target
reading doesn’t support missing tone information. Prefering
‘diacritic’ version (Pinyin/CantoneseYale) over ‘numbers’ as
tone marks in the absence of any marks would solve this issue
(forcing fifth tone), but would mean we prefer possible false
information over the less specific estimation of the given
entities as missing tonal information.
(The original entry is located in library/cjklib.cjknife.rst, line 83 and can be found here.)
Todo
- Impl: Once mapping of similar radical forms exist (e.g. 言 and 訁)
include here.
(The original entry is located in library/cjklib.cjknife.rst, line 150 and can be found here.)
Todo
- Impl: Once mapping of similar radical forms exist (e.g. 言 and 訁)
include here.
(The original entry is located in library/cjklib.cjknife.rst, line 203 and can be found here.)
Todo
- Lang: Implementation is too simple to cover all aspects.
(The original entry is located in library/cjklib.cjknife.rst, line 260 and can be found here.)
Todo
- bug: Specifying a limit might yield less results than
possible.
(The original entry is located in library/cjklib.dictionary.rst, line 62 and can be found here.)
Todo
- bug: Specifying a limit might yield less results than
possible.
(The original entry is located in library/cjklib.dictionary.rst, line 78 and can be found here.)
Todo
- bug: Specifying a limit might yield less results than
possible.
(The original entry is located in library/cjklib.dictionary.rst, line 96 and can be found here.)
Todo
- bug: Specifying a limit might yield less results than
possible.
(The original entry is located in library/cjklib.dictionary.rst, line 112 and can be found here.)
(The original entry is located in library/cjklib.dictionary.install.rst, line 25 and can be found here.)
Todo
- Impl: Allow simple FTS3 searching as build support is already provided.
(The original entry is located in library/cjklib.dictionary.search.rst, line 6 and can be found here.)
Todo
- Fix: How to handle non-reading entities?
(The original entry is located in library/cjklib.dictionary.search.rst, line 10 and can be found here.)
Todo
- Impl: Support readings with toneless base forms but without support
for missing tone
(The original entry is located in library/cjklib.dictionary.search.rst, line 17 and can be found here.)
Todo
- Impl: What about hiding of inner classes?
_checkSpecialOperators()
method is called for internal converters and for external ones
delivered by
createReadingConverter().
Latter method doesn’t return internal cached copies though, but
creates new instances.
ReadingOperator also gets
copies from ReadingFactory objects for internal instances.
Sharing saves memory but changing one object
will affect all other objects using this instance.
- Impl: General reading options given for a converter with **options
need to be used on creating a operator. How to raise errors to save
user of specifying an operator twice, one per options, one per
concrete instance (similar to sourceOptions and targetOptions)?
(The original entry is located in library/cjklib.reading.rst, line 16 and can be found here.)
Todo
- Impl: Make parameters fromReading, toReading optional if only
one conversion direction is given. Same for
convertEntities().
(The original entry is located in library/cjklib.reading.converter.rst, line 55 and can be found here.)
Todo
- Impl: Strict mode for tone abbreviating spellings. Raise
AmbiguousConversionError, e.g. raise on a which could be
.a or a.
- Impl: Add option to remove hyphens, “A Grammar of Spoken Chinese,
p. xxii”, Conversion to Pinyin can use that.
(The original entry is located in library/cjklib.reading.converter.GRDialectConverter.rst, line 33 and can be found here.)
Todo
- Impl: Two different methods for tone sandhi and coarticulation
effects?
- Lang: Support for Erhua in mapping.
(The original entry is located in library/cjklib.reading.converter.PinyinIPAConverter.rst, line 13 and can be found here.)
Todo
- Lang: What to do on several following neutral tones?
(The original entry is located in library/cjklib.reading.converter.PinyinIPAConverter.rst, line 100 and can be found here.)
Todo
- Impl: Optimise decompose() as to incorporate segment() and prune the
tree while it is created. Does this though yield significant
improvement? Would at least be O(n).
(The original entry is located in library/cjklib.reading.operator.rst, line 11 and can be found here.)
Todo
- Lang: Shed more light on representations of tones in IPA.
- Impl: Get all diacritics used in IPA as tones for
TONE_MARK_REGEX.
- Fix: What about CompositionError? All romanisations raise it, but
they have a distinct set of characters that belong to the reading.
(The original entry is located in library/cjklib.reading.operator.rst, line 26 and can be found here.)
Todo
- Impl: Place diacritics on main vowel, derive from IPA
representation.
(The original entry is located in library/cjklib.reading.operator.rst, line 125 and can be found here.)
Todo
- Lang: Shed more light on tone sandhi in Cantonese language.
- Impl: Implement diacritics for Cantonese Tones. On which part of the
syllable should they be placed. Document.
- Lang: Binyām 變音
- Impl: What are the semantics of non-level tones given for unreleased
stop finals? Take high rising Binyam into account.
(The original entry is located in library/cjklib.reading.operator.CantoneseIPAOperator.rst, line 10 and can be found here.)
Todo
- Impl: Finals ing, ik, ung, uk, eun, eut, a differ from other
finals with same vowels. What semantics/view do we want to
provide on the syllable parts?
(The original entry is located in library/cjklib.reading.operator.CantoneseYaleOperator.rst, line 99 and can be found here.)
Todo
- Lang: Place the tone mark on the first character of the nucleus?
(The original entry is located in library/cjklib.reading.operator.CantoneseYaleOperator.rst, line 133 and can be found here.)
Todo
- Impl: Initial, medial, head, ending (ending1, ending2=l?)
- Lang: Y.R. Chao uses particle and interjection ㄝ è. For more see
‘Mandarin Primer’, Vocabulary and Index, pp. 301.
- Impl: Implement Erhua forms as stated in W. Simon: A Beginner’s
Chinese-English Dictionary.
- Impl: Implement a GRIPAConverter once IPA values are obtained for
the PinyinIPAConverter. GRIPAConverter can work around missing Erhua
conversion to Pinyin.
- Lang: Special rule for non-Chinese names with initial r- to be
transcribed with an r- cited by Ching-song Gene Hsiao: A Manual of
Transcription Systems For Chinese, 中文拼音手册. Far Eastern
Publications, Yale University, New Haven, Connecticut, 1985,
ISBN 0-88710-141-0.
(The original entry is located in library/cjklib.reading.operator.GROperator.rst, line 9 and can be found here.)
Todo
- Lang: tz is currently mapped to .tzy. Character 子 though
generally has 3rd tone, which then should be tzyy or
.tzyy. See ‘A Grammar of Spoken Chinese’, p. 36
(“-.tzy (which we abbreviate as -tz)”) and p. 55
(“suffix -tz (<tzyy)”)
(The original entry is located in library/cjklib.reading.operator.GROperator.rst, line 143 and can be found here.)
Todo
- Impl: Both options 'grRhotacisedFinalApostrophe' and
'grSyllableSeparatorApostrophe' can be set independantly as
the former one should only be found before an l and the
latter mostly before vowels.
(The original entry is located in library/cjklib.reading.operator.GROperator.rst, line 289 and can be found here.)
Todo
- Impl: Finals ing, ik, ung, uk differ from other finals with
same vowels. What semantics/view do we want to provide on the
syllable parts?
(The original entry is located in library/cjklib.reading.operator.JyutpingOperator.rst, line 57 and can be found here.)
Todo
- Impl: Punctuation marks in isFormattingEntity() and
getFormattingEntities(). Then change
PinyinBrailleConverter.convertEntitySequence() to use these methods.
(The original entry is located in library/cjklib.reading.operator.MandarinBrailleOperator.rst, line 10 and can be found here.)
Todo
- Impl: ISO 7098 asks for conversion of 。、·「」 to .,-«». What
about ,?《》:-? Implement a method for conversion to be
optionally used.
- Impl: Special marker for neutral tone: ‘mȧ’ (u’m\u0227’, reported by
Ching-song Gene Hsiao: A Manual of Transcription Systems For
Chinese, 中文拼音手册. Far Eastern Publications, Yale University,
New Haven, Connecticut, 1985, ISBN 0-88710-141-0. Seems like
left over from Pinjin, 1956), and ‘·ma’ (u’\xb7ma’, check!:
现代汉语词典(第5版)[Xiàndài Hànyǔ Cídiǎn 5. Edition].
商务印书馆 [Shāngwù Yìnshūguǎn], Beijing, 2005, ISBN 7-100-04385-9.)
- Impl: Consider handling \*nue and \*lue.
(The original entry is located in library/cjklib.reading.operator.PinyinOperator.rst, line 12 and can be found here.)
Todo
- Fix: don’t raise an ValueError here (delayed), raise an Exception
directly in the constructor. See also WadeGilesOperator.
(The original entry is located in library/cjklib.reading.operator.PinyinOperator.rst, line 223 and can be found here.)
Todo
- Lang: Asterisk (*) marking the entering tone (入聲): e.g. chio²*
and chüeh²* for 覺 used by Giles (A Chinese-English Dictionary,
second edition, 1912).
(The original entry is located in library/cjklib.reading.operator.WadeGilesOperator.rst, line 9 and can be found here.)
Todo
- Impl: Raise value error on invalid values for diacriticE,
zeroFinal, umlautU
(The original entry is located in library/cjklib.reading.operator.WadeGilesOperator.rst, line 56 and can be found here.)
Todo
- Impl: include script table from Unicode 5.2.0 to get character ranges
for Hangul and Kana
(The original entry is located in library/cjklib.test.characterlookup.rst, line 10 and can be found here.)
Todo
- Impl: Add second dimension to consistency check for converting between
dialect forms for all entities. Use cartesian product
option_list x dialects
(The original entry is located in library/cjklib.test.readingconverter.rst, line 6 and can be found here.)
Todo
- Impl: While this function is only needed as long as Python doesn’t
ship with a proper title casing algorithm as defined by Unicode, we
need a proper handling for Wade-Giles, as Pinyin Erhua forms
will convert to two entities being separated by a hyphen, which does
not fall in to the Unicode title casing algorithm’s definition of a
case-ignorable character.
(The original entry is located in library/cjklib.util.rst, line 18 and can be found here.)