Locale magic (literally)

Another programming-centric post and follow-up on Thursday's post about locale issues with Turkish.

So, I showed some Problems with locale-dependant mappings using the case of Turkish, that mapps small Latin character i to uppercase İ, which also has a dot on top. Now join me on some more magic. On Unix you need to have the proper locale generated, which under debian works with dpkg-reconfigure locales.

>>> import locale                                                     
>>> locale.setlocale(locale.LC_ALL, 'tr_TR.ISO-8859-9')               
>>> from cjklib.reading import ReadingFactory
>>> f = ReadingFactory()
>>> 'LIO' == u'LIO'
>>> f.isReadingEntity('LIO', 'WadeGiles')
>>> f.isReadingEntity(u'LIO', 'WadeGiles')

So what are we doing? We set the locale to Turkish, which will interpret strings using rules of the Turkish script. Then we make sure that the plain Python string 'LIO' equals the Unicode string u'LIO' (we can do this under Python 2.x, but Python 3 will lose the original string class). Now comes the fun part. We load cjklib and check if LIO is a valid syllable in the Wade-Giles romanization system. We yield False for the plain string and True for the Unicode string. Wow. What happend?

In case this reminds you of an earlier trick of mine, Unicode - Don't trust your eyes, then don't be mistaken. This time I'm not playing tricks on you.

The WadeGilesOperator that actually handles this request needs to normalize the string first and converts it to lowercase before it will lookup the form in its table. And here is the issue: LIO will be converted to lıo (without dot) under the current locale and this, of course, is not a valid syllable. So I learnt something today: I have to make sure we only work on Unicode objects in cjklib. Hope you, too, learnt something new.

Update: It seems Python in general has a problem with the Turkish locale. cjklib now breaks with Python2.6 under said locale.