Language/Encoding guesser written in python

encoding.py is a module that allows the guessing (more formal classification) of languages and encodings of textual input.

It builds on the Textcat library by Gertjan van Noord and the python implementation ngram.py by Thomas Mangin.

Textcat offers language data in form of 400 most frequent N-grams for several languages including latin1 ones (e.g. english, german, french), chinese (gb, big5), japanese (shift_jis, euc_jp), cyrillic (windows1251, koi8_r, iso8859_5) and others.

This library is distributed under GNU General Public License.

I adapted ngram.py to allow a matching of the top 5 matches to a given list of preferred languages, to avoid detection of unlikely languages e.g. drents.

Todo: Textcat supports 76 language/encoding pairs, but only slightly more than 20 mention which encoding they use. Some were already analysed, but most of them are still missing in the library's list. Please submit encodings for Textcat data.

Attached are

  • encoding.py - encoding classifier
  • ngram.py - adapted textcat python implementation
  • ngram.top5.diff - diff to original implementation by Thomas Mangin
AttachmentSize
ngram.py.txt4.82 KB
ngram.top5.diff1.15 KB
encoding.py.txt8.42 KB

Official page of textcat and list of languages

The official homepage of Textcat is http://software.wise-guys.nl/libtextcat/.

There is a list of supported languages, some with encodings.

CJK languages have a whitespacing problem, which will affect the detector.

It seems like the project needs some care.