Christoph's CJK-centered concerns

id3encodingconverter - A simple encoding converter for MP3 tags

Submitted by Christoph on 14 April, 2008 - 02:11

id3encodingconverter

I just checked in my small project id3encodingconverter which is designed to convert ID3 tags found in MP3 files from different encodings to Unicode with ID3v2.

Up till now I didn't find any easy tool to do the frequent task of conversion, which is needed when text in some other encoding than Latin1 is saved in a tag format that doesn't support something else then Latin1.

As long as you listen to music that only makes use of one character set you shouldn't have the problems I have: if your music is English, then you're a happy person, as ASCII is found in (nearly?) all encodings. If you need something different than Latin1, you can normally tell your music player to choose a different standard. But if you have music in different encodings, then you can only convert all your music tags to Unicode and that's what my tool is for.

It's hosted on Google Code which seems perfect for small projects like this. A wiki, SVN, download possibility, not bloated as sourceforge and clones.

Well, until now, only alpha. A lot of work still needs to be done.

Christoph's blog

Unicode finally takes the lead...

Submitted by Christoph on 31 March, 2008 - 19:12

...at least for Python 3.0.

Still hoping we can only wonder about sentences like the following in 10 years:
Some languages use special characters (Chinese, Japanese, Arabic, Klingon, etc.) that are difficult to handle with traditional software.
Although there are standards for using and displaying them, these standards are not widely used, and make life a lot more complicated than necessary. Since virtually all software (and even hardware) is
made to be used with the Roman alphabet (possibly with minor
language-dependent modifications) [...] (quoted from [1]

For Python it seems to get reality soon:

What is quite an old fact (from 2007) is new to me: Python 3.0 which is currently available as alpha? will bring together str and unicode objects and the stupid coexistence of two string classes will finally be overcome.

Quoting [2]:

There is only one string type; its name is str but its behavior and implementation are like unicode in 2.x.
Yay!
PEP 3137: There is a new type, bytes, to represent binary data (and encoded text, which is treated as binary data until you decide to decode it). The str and bytes types cannot be mixed; you must always explicitly convert between them, using the str.encode() (str -> bytes) or bytes.decode() (bytes -> str) methods.
Sounds like Java!
PEP 3120: UTF-8 default source encoding.
No stupid warnings anymore.
PEP 3131: Non-ASCII identifiers. [...]
Nice.

Feels like Christmas (replace that with your preferred holiday).

P.S.: Still hoping that the bug report I filed yesterday will be accepted as such. It's not a feature to me!

Update: To understand my happiness consider reading my Python Unicode rant.

Christoph's blog

I love Unicode

Submitted by Christoph on 31 March, 2008 - 00:29

Unicode

Stole this somewhere, I'll provide a link once I find it again.
See Unicode Specials on english Wikipedia for the "replacement character".
Update: You might see a different character beneath compared to that in the image. Forms rendered include the question mark in a black rhombus, a white block with black border or just a simple question mark.
Update2:the link

Christoph's blog

Python and Exceptions containing Unicode messages

Submitted by Christoph on 30 March, 2008 - 23:40

Python (2.4 and 2.5 tested) seems to have problems when an exception is thrown that contains non-ASCII text as a message.

In my case I constructed an exception like

>>> try:
...     raise Exception(u'Error when printing ü')
... except Exception, e:
...     print e
...
Traceback (most recent call last):
  File "", line 4, in ?
UnicodeEncodeError: 'ascii' codec can't encode character u'\xfc' in position 20:
ordinal not in range(128)

That's at least what the tutorial advocates [1].
Same with unicode(e) instead of the print directive.

Using unicode(e, 'utf8') doesn't do the job:
TypeError: coercing to Unicode: need string or buffer, instance found

A method __unicode__() doesn't seem to exist [2], and thus I'm only left with:

>>> try:
...     raise Exception(u'Error when printing ü'.encode('utf8'))
... except Exception, e:
...     print e
...
Error when printing ü

UTF-8 is my systems default encoding.

It seems I'm not the first one to stumble upon this behavior [3], I just wonder why an error like this can still exist in 2008. Maybe I should file a bug report.
Update: the bug report

Christoph's blog

New ISO 639 codes in February/March

Submitted by Christoph on 16 March, 2008 - 14:45

ISO 639

Small changes where made February the 5th (ISO 639-2), the 18th, the 28th and March the 5th (all ISO 639-3). gsw now includes Alsatian as a name in ISO 639-2 and ISO 639-3 and mly, muw and xst where split into new codes. I hope I didn't forget any changes.

File iso-639-3_Retirements_20080228.tab has a errorneous line break in line 69/70 which needs to be fixed and is actually encoded in Latin1 which I missed before.

Finally now these are the local versions:

+--------------------------+----------+
| Entity                   | Version  |
+--------------------------+----------+
| ISO 639-2                | 20080205 |
| ISO 639-3                | 20080218 |
| ISO 639-3 Name Index     | 20080305 |
| ISO 639-3 Macrolanguages | 20080228 |
| ISO 639-3 Retirements    | 20080228 |
+--------------------------+----------+

Christoph's blog

Navigation

tags in site content

Archive

Blogs I read

id3encodingconverter - A simple encoding converter for MP3 tags

Unicode finally takes the lead...

I love Unicode

Python and Exceptions containing Unicode messages

New ISO 639 codes in February/March