Language/Encoding guesser written in python

encoding.py is a module that allows the guessing (more formal classification) of languages and encodings of textual input.

It builds on the Textcat library by Gertjan van Noord and the python implementation ngram.py by Thomas Mangin.

Textcat offers language data in form of 400 most frequent N-grams for several languages including latin1 ones (e.g. english, german, french), chinese (gb, big5), japanese (shift_jis, euc_jp), cyrillic (windows1251, koi8_r, iso8859_5) and others.

This library is distributed under GNU General Public License.

I adapted ngram.py to allow a matching of the top 5 matches to a given list of preferred languages, to avoid detection of unlikely languages e.g. drents.

Todo: Textcat supports 76 language/encoding pairs, but only slightly more than 20 mention which encoding they use. Some were already analysed, but most of them are still missing in the library's list. Please submit encodings for Textcat data.

Attached are

  • encoding.py - encoding classifier
  • ngram.py - adapted textcat python implementation
  • ngram.top5.diff - diff to original implementation by Thomas Mangin
AttachmentSize
ngram.py.txt4.82 KB
ngram.top5.diff1.15 KB
encoding.py.txt8.42 KB

Segmenting Pinyin through regular expressions

Playing a bit with segmenting strings written in Pinyin I came up with a regular expression (regex) doing the job.

It's important to respect the vouls a, e, o which can stand on their own and can be written with an apostrophe before. Furthermore it is important to know which voul combinantions account for one syllable, which for more. Example: aa is equivalent to two characters, but ai only to one.

There are two final sounds n and ng where either n or g can already be the initial sound of the following syllable.

The whole complex regex, where tone marks are not respected:
(?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)(?:i(?:ao|[uae])?|u(?:ai|[iaeo])?|üe?))|
(?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)|')?(?:a[oi]?|ei?|ou?)))(?:(?:ng|n)(?![aeo]))?

We can break it down into major parts first:

  • (?:
    • Get syllables with vouls starting with i, u, ü first. All consonants except v can show up, sh, ch, zh are initial sounds with two consonants. Make sure voul combinations like ii or üa can't come together:
      (?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)(?:i(?:ao|[uae])?|u(?:ai|[iaeo])?|üe?))
    • |
    • Now get syllables starting with a, e, o. Consonants and voul combinations as above, but deal with an apostrophe:
      (?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)|')?(?:a[oi]?|ei?|ou?)))
  • Get finals n, ng only if no vouls a, e, o follow:
    (?:(?:ng|n)(?![aeo]))?
  • )

This regex works with the "garbage in garbage out" principle, syllables that can't occur in Pinyin might not be reported. Further more I can't gurantee this regex is free of errors, you might want to test it by yourself before using it.

Here's the code in Python:

>>> decomp = re.compile(u"((?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)" \
... + u"(?:i(?:ao|[uae])?|u(?:ai|[iaeo])?|üe?))" \
... + u"|(?:(?:(?:[bcdfghjklmnpqrstwxyz]|[zcs]h)|')?(?:a[oi]?|ei?|ou?)))" \
... + u"(?:(?:ng|n)(?![aeo]))?)")
>>> decomp.split(u"changan")
[u'', u'chan', u'', u'gan', u'']
>>> decomp.split(u"chang'an")
[u'', u'chang', u'', u"'an", u'']
>>> decomp.split(u"tiananmen")
[u'', u'tia', u'', u'nan', u'', u'men', u'']
>>> decomp.split(u"tian'anmen")
[u'', u'tian', u'', u"'an", u'', u'men', u'']

How to load ISO 639-3 tables provided by SIL into MYSQL

ISO 639-3 language codes are provided for download by SIL under http://www.sil.org/iso639-3/download.asp. As the codes are stored as a tab separated list and a SQL table create command is given, it is pretty easy to upload the data into MYSQL.

All what is written beneath applies to MySQL 5. For older versions Unicode can't be used, which shouldn't be necessary for this tables.

Here is how:

If you didn't install MYSQL yet, refer to your distribution how to do so.
Afterwards login as admin

mysql -u admin -p

and paste the following code to create a new user that has all rights on databases of the scheme 'username_*' where username is your username. Replace "username" with your username and "password" with a passwort of your choice.

CREATE USER 'username'@ '%' IDENTIFIED BY 'password';

GRANT USAGE ON *.* TO 'username'@'%' IDENTIFIED BY 'password' WITH MAX_QUERIES_PER_HOUR 0 MAX_CONNE
CTIONS_PER_HOUR 0 MAX_UPDATES_PER_HOUR 0 MAX_USER_CONNECTIONS 0 ;

GRANT ALL PRIVILEGES ON `username_%`.* TO 'username'@'%';

Login as user and create the database with the following code. Again substitute "username" with your username.

CREATE DATABASE `username_ISO639-3` DEFAULT CHARACTER SET utf8 COLLATE utf8_bin;

Now copy the create table code from the download site of SIL and paste it to MYSQL, you might have to remove the '-' in the TABLE NAME: 'CREATE TABLE ISO_639_3' instead 'CREATE TABLE ISO_639-3'.
After you created the table you can import the tab separated list:

LOAD DATA LOCAL INFILE 'iso-639-3_20070814.tab' INTO TABLE ISO_639_3 CHARACTER SET utf8 FIELDS TERMINATED BY '\t' IGNORE 1 LINES;

You maybe have to add a line break at the end of the last line to omit a warning.

If the import was successful you can start to ask for stuff like:
Give me all macrolanguages that don't have a ISO 639-2 equivalent code.

select * from ISO_6393 WHERE Scope='M' and Part2T = '';

Various extremes on Chinese characters

Sometimes to learn about something new it is good to look at the extremes of it. To get an insight on the various forms of chinese characters I want to show the different characteristics and compare.

I might extend the list once I found something new and interesting. Sources are added in the form of [1].

Frequency

Frequency of a character depends on the point of view. But as we don't want to satisfy any scientific requirements we can be a bit imprecise on that. I rather hope to focus on a few examples.

Most frequent character

A way of getting the most frequent character is to start a statistical evaluation on a text corpus. Somebody already did that, and wrote about how it technically works and which sources were used. You see that there basically exist two differnt evaluations: one for Modern Chinese, one for Classical Chinese.

  • According to the statistics for Modern Chinese

    is the most frequent character with about 4% of use.
    This word is a possessive particle and can be used like 我的书, wǒ de shū, "my book" (literally: "I particle book").

  • But for Classical Chinese

    is the most frequent one with about 1% of use. 不 has nearly the same frequency. In Classical Chinese 之 is partly used where 的 would be used in Modern Chinese, so both statistics seem to show the same result (more on it's usage).

What is 4%?

To have a quick look on how much 4% actually is, here's a chinese phrase. It is taken from [2], there are 3 occurences of 的 in 69 characters (not counting typhographic chars and arabic numerals), which is about 4%:
“中國”一詞自古有之,最早指居於“天下”(古人对世界称谓)中心中原地帶;在近代以来,特别是1912年中华民国成立后,“中國”一詞始成爲民族國家意義上法律和政治概念。

Interestingly the most frequent 6 characters in Modern Chinese sum up to 10% where as it needs the 12 most frequent characters in Classical Chinese to make up nearly the same amount.

Least frequent character

Again, it depends on the criterion which character is least frequent. But as we are working on examples we can overlook that.
The Chinese language unlike western languages tended to have a character for every particular thing. I would conclude it's in its nature to have a few characters that are never used except for one thing.

  • is a character as mentioned above. It's only used in association with 中壢, Zhōnglì, a city in the northwest of Taiwan[1]. As that I think it does qualify for one of the least frequently used characters.

More to come...

OpenSync with Sony Erricson T630

I had to synchronise a T630 with OpenSync [1] and couldn't really finish the job unless I did find out, that the phone doesn't like syncing some stuff. Now I omit syncing "Events" and "Notes":

mysnctool -sync setupname --filter-objtype note --filter-objtype event

It works like that, though I seem to have problems with pictures in vCards being duplicated on every synchronisation. Just deleted them all in my local addressbook, so the phone has enough memory free for the entries.

Syndicate content