Unicode - Don't trust your eyes

Submitted by Christoph on 14 June, 2008 - 13:46

Unicode

This should be nothing new to you when I say "Don't trust your eyes".
But specifically when it comes to Unicode, I feel like saying it again: "Really don't".

This short Python code tries to make a point:
The two strings "Unicode" are equal, but the following two strings are not, though they look alike.

Actually the first 口 is a normal Chinese character meaning mouth, the second ⼝ is its radical form.

There are many characters in Unicode that look alike, several dots, look-alike characters from the roman alphabet for IPA and especially many for the CJK block in Unicode, not only for radicals but many coming from the so called "source separation".

Here's the full code:

>>> u'Unicode' == u'Unicode'
True
>>> u'口' == u'⼝'
False
>>> ord(u'口')
21475
>>> ord(u'⼝')
12061
>>>

Christoph's blog

Christoph's CJK-centered concerns

Navigation

tags in site content

Archive

Blogs I read

Unicode - Don't trust your eyes