Language/Encoding guesser written in python

Submitted by Christoph on 27 November, 2007 - 21:26

encoding.py is a module that allows the guessing (more formal classification) of languages and encodings of textual input.

It builds on the Textcat library by Gertjan van Noord and the python implementation ngram.py by Thomas Mangin.

Textcat offers language data in form of 400 most frequent N-grams for several languages including latin1 ones (e.g. english, german, french), chinese (gb, big5), japanese (shift_jis, euc_jp), cyrillic (windows1251, koi8_r, iso8859_5) and others.

This library is distributed under GNU General Public License.

I adapted ngram.py to allow a matching of the top 5 matches to a given list of preferred languages, to avoid detection of unlikely languages e.g. drents.

Todo: Textcat supports 76 language/encoding pairs, but only slightly more than 20 mention which encoding they use. Some were already analysed, but most of them are still missing in the library's list. Please submit encodings for Textcat data.

Attached are

encoding.py - encoding classifier
ngram.py - adapted textcat python implementation
ngram.top5.diff - diff to original implementation by Thomas Mangin

Attachment	Size
ngram.py.txt	4.82 KB
ngram.top5.diff	1.15 KB
encoding.py.txt	8.42 KB

Christoph's blog

Official page of textcat and list of languages

Submitted by Christoph on 28 November, 2007 - 22:57.

The official homepage of Textcat is http://software.wise-guys.nl/libtextcat/.

There is a list of supported languages, some with encodings.

CJK languages have a whitespacing problem, which will affect the detector.

It seems like the project needs some care.

Christoph's CJK-centered concerns

Navigation

tags in site content

Archive

Blogs I read

Language/Encoding guesser written in python

Official page of textcat and list of languages