Python, Unicode and the digital divide

Submitted by Christoph on 6 August, 2009 - 14:12

One could say that Unicode is the reflection of globalization in computing. So, being a computer scientist this huge project very much gets my attention and fascinates me on a daily basis. And Unicode is not just a feature, it is a foundation that bridges between languages and cultures in the digital world.

You might have heard of the digital gap that divides groups with good access to digital solutions and those without. One factor that drives this gap is the fact that technological solutions cannot be easily transferred and implemented throughout the global community. Say for example speech technology. Speech recognition started out for English and then got pushed to French, German, Japanese and now needs to take the next step. This technology comes at great costs and currently needs to be reaccessed for every language in question, having only low synergy effect.

Now, very often these problems are very much practical. Let me introduce you to one I have at hand. First, a short addition to Unicode. Known to most as the universal character encoding, it is more than that, coming with a wide range of language algorithms and solutions. Basic operations like (to-)uppercase/lowercase or titlecase, which I mentioned before.

SS, ß

Upper-/lowercase conversion is a algorithmic problem that for the initial ASCII (the mother of encodings) was pretty easy: add a 32 to the code point of 'A' to receive 'a', same for b to z. But it is not always that easy. German adds umlauts and a sharp-S: ß. Conversion from Fuß ("foot") to upper case results in FUSS which most likely will not change in the near future, though an uppercase sharp-S was added recently.

İ, I, i, ı

A more complicated case is Turkish, which next to other additional characters has "two Is", one without dot (ı), one with (i). This is consistently extended to uppercase writing so the former is mapped to I, the latter to İ. While this seems straightforward as other characters, there is a notable crosswise mapping. While I is the uppercase of i in most languages, Turkish is a special case.

How is this fact reflected in Python?
>>> u'I'.lower() u'i'
This is correct for most languages, but as said before, wrong for Turkish. People have come across this problem, and where told, that Python probably won't ship with a native solution, but most likely will rely on a binding of IBM's ICU library.

Using the PyICU module, you can already solve the issue at hand:
>>> from PyICU import UnicodeString, Locale >>> trLocale = Locale('tr_TR') >>> trLocale.getDisplayName(Locale('en')) u'Turkish (Turkey)' >>> print unicode(UnicodeString('i').toUpper(trLocale)) İ

Another problem special to Turkish still involving the upper "Is" is posed by the mix of English and Turkish, both having different upper-/lowercase mappings. A tool, even though translated on the user interface, might use an internal mapping of English commands and fail horrendously. So while for example command "QUIT" in English will terminate the program, given as "quit" under a Turkish locale will resolve to "QUİT", which is a different string.

Ever though that basic computing problems were solved in the 1980s?

Now I wonder, how does the Turkish community program in Python nowadays? 8-bit string classes act in a locale-dependent way, while the future Unicode-only strings will lose this behavior.

Another thought: Wouldn't it have been possible to keep the i to I mapping for the Turkish locale and create a "really-no-dot-I" as uppercase equivalent for ı, then letting the font take care of rendering the "big-i-I" with a dot? For the sake of backwards compatibility it for sure was to late to think about this when Unicode was born.

Update: It seems that following PEP 358 the bytes class which subsitutes the plain string class in Python3 ships with isupper() and friends which only work for Latin characters as in plain ASCII, A-Z, but will lose all locale-dependant behaviour.

Christoph's blog

Christoph's CJK-centered concerns

Navigation

tags in site content

Archive

Blogs I read

Python, Unicode and the digital divide