Mapping from Ethnologue 14 to ISO 639-2

SIL provides a mapping table from Ethnologue 14 to ISO 639-2 codes. Using this table a mapping from old Ethnologue codes to ISO 639 can be made.

This table has its limitations but can provide some additional usage: First of all no mapping to ISO 639-3 exists. (update, see: Updating Ethnologue codes from the 14th Edition to the 15th Edition) Thus a mapping from code in Ethnologue 14 to ISO 639 may lose some information. If a code in Ethnologue isn't encoded in ISO 639, it will be mapped to its language family with a collective code. Furthermore the mapping might have not been updated to included the latest ISO 639-2 codes, so additional codes might not be mapped to. But this information can be helpful on how to actualy map a single language to a collective code in ISO 639-2, even if a ISO 639-3 code is present.

The code table can't be choosen directly on the site, but it can be found over this direct link.

The following code can be useful for loading this table into MySQL:

CREATE TABLE MAPPING_ETHNO14_ISO_639_2 (
    Part2BT char(3) NOT NULL,
    Ethno14 char(3) NOT NULL,
    INDEX (Ethno14),
    INDEX (Part2BT)
)

LOAD DATA LOCAL INFILE 'all_639-2.tab' INTO TABLE
MAPPING_ETHNO14_ISO_639_2 FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\r\n';

Furthermore when making this mapping, the language names might be useful. They can be found in the regular table under the site's download section: LanguageCodes.tab

Here is the code for MySQL:

CREATE TABLE LanguageCodes (
   LangID      char(3) NOT NULL,        -- Three-letter code
   CountryID   char(2) NOT NULL,        -- Main country where used
   LangStatus  char(1) NOT NULL,        -- L(iving), N(early extinct),
                                        -- E(xtinct)
   Name        varchar(75) collate utf8_general_ci NOT NULL,
                                        -- Primary name in that country
   PRIMARY KEY (LangID)
)

LOAD DATA LOCAL INFILE 'LanguageCodes.tab' INTO TABLE
LanguageCodes FIELDS TERMINATED BY '\t' LINES
TERMINATED BY '\r\n';

List of constructed languages in ISO 639-3

Here is a list on constructed languages in ISO 639-3. See Creating statistics on ISO 639 language codes for how this was generated.

Versions of relevant table(s):

TableVersion
ISO_639_220081107
ISO_639_320081110

Constructed language list

Count of constructed language codes in ISO 639-3: 20
Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='C'

ISO 639-3ISO 639-2 (B/T)ISO 639-1English Name
afhafhAfrihili
avkKotava
bztBrithenig
dwsDutton World Speedwords
epoepoeoEsperanto
eurEuropanto
idoidoioIdo
igsInterglossa
ileileieInterlingue
inainaiaInterlingua (International Auxiliary Language Association)
jbojboLojban
ldnL
lfnLingua Franca Nova
novNovial
qyaQuenya
rmvRomanova
sjnSindarin
tlhtlhKlingon
volvolvoVolap
zblzblBlissymbols

Python Unicode rant

If you are at the beginning of a new project and are currently considering which programming language to use, and furthermore you will definitely need Unicode support, then please consider some other languages than python.

Don't use python. Really only used it, if you have no other choice.

Everything python does inside needs some kind of an encoding. If you read from a command line set to utf8, your whole system actually is utf8, then you still need to decode your input from utf8 (as long it's not ascii).

It doesn't wrap everything in unicode (unlike Java), as the basic string is str and unicode strings are different objects. So str("hello") isn't the same as unicode("hello").

Furthermore you will stumple upon a lot of other culprits: e.g. htmlentitydefs has a table for mapping "entity definitions to their replacement text in ISO Latin-1". Ups, there it is again: Latin-1. You can't just use it, but you have to decode it first.

Or the csv module will say, that you actually can't import unicode strings that easily, you will need to build an extra wrapper.

The dream of using Unicode all over the world is still far away.

At least for python...

Creating statistics on ISO 639 language codes

I created a page showing simple ISO 639 language code statistics.

Creating this took some time as I had do figure out how to integrate that into PHP and I never did anything in that language before. Additionally though I already described how to load ISO 639-3 tables provided by SIL into MySQL, I had to make some changes to the given data, which I didn't do in the first place.

To not make MySQL search case sensitively it has to be told to use a proper collation. Where it seems needed I added it to the column specifications.

Steps:

  • Downloading all the tables from SIL and LoC
  • Modifying the database schemes, as I need indexing and keys:
    CREATE TABLE ISO_639_3 (
       Id      char(3) NOT NULL,  -- The three-letter 639-3 identifier
       Part2B  char(3) NULL,      -- Equivalent 639-2 identifier of the
                                  -- bibliographic applications code set, if
                                  -- there is one
       Part2T  char(3) NULL,      -- Equivalent 639-2 identifier of the terminology
                                  -- applications code set, if there is one
       Part1   char(2) NULL,      -- Equivalent 639-1 identifier, if there is one
       Scope   char(1) NOT NULL,  -- I(ndividual), M(acrolanguage), S(pecial)
       Type    char(1) NOT NULL,  -- A(ncient), C(onstructed),
                                  -- E(xtinct), H(istorical), L(iving), S(pecial)
       Ref_Name   varchar(150) collate utf8_general_ci NOT NULL,
                                  -- Reference language name
       Comment    varchar(150) collate utf8_general_ci NULL,
                                  -- Comment relating to one or more of the columns
       PRIMARY KEY (Id),
       INDEX (Part2B),
       INDEX (Part2T),
       INDEX (Part1)
    );
    
    CREATE TABLE ISO_639_3_Names (
       Id             char(3)     NOT NULL,  -- The three-letter 639-3 identifier
       Print_Name     varchar(75) collate utf8_general_ci NOT NULL,
                                             -- One of the names associated with
                                             -- this identifier
       Inverted_Name  varchar(75) collate utf8_general_ci NOT NULL,
                                             -- The inverted form of this
                                             -- Print_Name form
       INDEX (Id)
    );
    
    CREATE TABLE ISO_639_3_Macrolanguages (
       M_Id  char(3) NOT NULL,  -- The identifier for a macrolanguage
       I_Id  char(3) NOT NULL,  -- The identifier for an individual language
                                -- that is a member of the macrolanguage
       PRIMARY KEY (I_Id),
       INDEX (M_Id)
    );
    
    CREATE TABLE ISO_639_3_Retirements (
       Id          char(3)      NOT NULL,   -- The three-letter 639-3 identifier
       Ret_Reason  char(1)      NOT NULL,   -- code for retirement: C (change),
                                            -- D (duplicate), N (non-existent),
                                            -- S (split), M (merge)
       Change_To   char(3)      NULL,       -- in the cases of C, D, and M, the
                                            -- identifier to which all instances
                                            -- of this Id should be changed
       Ret_Remedy  varchar(200) NOT NULL,   -- The instructions for updating an
                                            -- instance of the retired (split)
                                            -- identifier
       Effective   date         NOT NULL,   -- The date the retirement became
                                            -- effective
       PRIMARY KEY (Id)
    );
    
    CREATE TABLE `ISO_639_2` (
       `Part2B` char(3) NOT NULL,
       `Part2T` char(3) default NULL,
       `Part1` char(2) default NULL,
       `Ref_Name_EN` varchar(150) collate utf8_general_ci NOT NULL,
       `Ref_Name_FR` varchar(150) collate utf8_general_ci NOT NULL,
       PRIMARY KEY (Part2B),
       INDEX (Part2T),
       INDEX (Part1)
    );
    
  • I added another table for saving the timestamps of the respective tables and added the values:
    CREATE TABLE `Versions` (
     `Entity` char(50) NOT NULL,
     `Version` char(20) NOT NULL,
     PRIMARY KEY (Entity)
     );
    


    insert into Versions values ('ISO 639-2', '20071029');
    ...

  • Before importing the data files, I had to do a bit of a make up, as the Byte Order Mark (BOM) had to be removed and line break had to be added here and there as to allow the import with MySQL 5 methods.
  • Then I used the LOAD DATA LOCAL FILE directive to import the text files into my local system (encoding utf8, downloaded files where created on a windows system, so use \r\n as line end):

    LOAD DATA LOCAL INFILE 'iso-639-3-macrolanguages_20070801.tab' INTO TABLE ISO_639_3_Macrolanguages CHARACTER SET utf8 FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
    ...
  • After that I had to transfer them to the server online, where I wouldn't have access to the local filesystem:

    mysqldump --compatible=mysql40 --skip-lock-tables -p ISO639-3 > iso639codes.sql
    cat iso639codes.sql | grep -v "LOCK TABLES" | grep -v "UNLOCK TABLES" > iso639codes_clean.sql

    I don't have lock table rights so I had to tell mysqldump to omit these commands. Furthermore I use 5.0 locally and 4.0 on the server (--compatible=mysql40).

    Uploading is then easy:

    cat iso639codes_clean.sql | mysql -p -h "hostname" ISO639-3

  • After that I had to create a page where I could quickly add SQL SELECTs as to generate quick statistics.
    Drupal makes it difficult here, as they have a way to access additional databases through drupal library methods, though not allowing to use PHP's own ones (which would screw up drupal), and interfering with the CMS system:

    user warning: Table 'ISO639-3.system' doesn't exist query: SELECT * FROM system WHERE type = 'theme' in /home/.../includes/database.mysql.inc on line 173.
  • Finally I have data in the current versions:
    +--------------------------+----------+
    | Entity                   | Version  |
    +--------------------------+----------+
    | ISO 639-2                | 20071029 |
    | ISO 639-3                | 20071018 |
    | ISO 639-3 Name Index     | 20070814 |
    | ISO 639-3 Macrolanguages | 20070801 |
    | ISO 639-3 Retirements    | 20070814 |
    +--------------------------+----------+
    
  • Some more interesting queries are hopefully yet to come, as I really think it's interesting to see those codes that don't fit into the regular scheme, like Bihari (bh, bih), or Serbo-Croatian (hbs).
  • A last remark: The LoC's ISO 639-2 table doesn't supply any data to the Part2T column if the value is equal to Part2B. This has to be reflected in all queries dealing with this table.

ISO 639 language code statistics

Here is a short overview on ISO 639 language code statistics. If you have any other queries to add, please tell me. See Creating statistics on ISO 639 language codes for how this was generated.

Versions of tables:

TableVersion
ISO_639_320081110
ISO_639_3_Names20081110
ISO_639_3_Macrolanguages20080218
ISO_639_3_Retirements20081110
ISO_639_220081107

ISO 639-1

  • Count of ISO 639-1 language codes: 184
    Query: SELECT COUNT(*) AS count FROM ISO_639_2 WHERE Part1 != ''
  • Count of ISO 639-1 collective language codes: 1
    Query: SELECT COUNT(*) as count FROM ISO_639_2 LEFT JOIN ISO_639_3 USING (Id) WHERE ISO_639_3.Id IS NULL AND ISO_639_2.Part1 != ''
    Alpha2Alpha3 (B/T)English Name
    bhbihBihari

ISO 639-2

  • Count of ISO 639-2 language codes (without reserved): 484
    Query: SELECT COUNT(*) as count FROM ISO_639_2
  • Count of languages entries with different B/T codes: 20
    Query: SELECT COUNT(*) as count FROM ISO_639_2 WHERE Part2T != ''
  • Count of collective language codes (not in ISO 639-3): 66
    Query: SELECT COUNT(*) as count FROM ISO_639_2 LEFT JOIN ISO_639_3 USING (Id) WHERE ISO_639_3.Id IS NULL
  • Count of collective language codes for languages of a language family which don't have an own code: 34
    Query: SELECT COUNT(*) as count FROM ISO_639_2 LEFT JOIN ISO_639_3 USING (Id) WHERE ISO_639_3.Id IS NULL AND Ref_Name_En like '%(Other)%'
  • Count of deprecated ISO 639-2(B) codes: 2
    Query: SELECT COUNT(*) as count FROM ISO_639_3 LEFT JOIN ISO_639_2 USING (Id) where ISO_639_2.Part2T = '' AND ISO_639_3.Part2T != ISO_639_3.Part2B;
    Alpha3 (T)deprecated Alpha3 (B)English Name
    hrvscrCroatian
    srpsccSerbian

ISO 639-3

  • Count of ISO 639-3 language codes: 7700
    Query: SELECT COUNT(*) as count FROM ISO_639_3
  • Count of special codes in ISO 639-3: 4
    Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Scope='S'
    Alpha3English Name
    misUncoded languages
    mulMultiple languages
    undUndetermined
    zxxNo linguistic content
  • Count of constructed language codes in ISO 639-3: 20
    Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='C'
  • Count of ancient language codes in ISO 639-3: 112
    Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='A'
  • Count of historical language codes in ISO 639-3: 63
    Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='H'
  • Count of extinct language codes in ISO 639-3: 427
    Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='E'
  • Count of living language codes in ISO 639-3: 7074
    Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='L'

Macrolanguage codes

  • Count of macrolanguage codes in ISO 639-3: 57
    Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Scope='M'
  • Count of (non deprecated) language codes belonging to a macrolanguage code: 400
    Query: SELECT COUNT(I_Id) as count FROM ISO_639_3_Macrolanguages WHERE I_Status = 'A'
  • Count of macrolanguage codes not in ISO 639-2: 3
    Query: SELECT COUNT(*) as count FROM ISO_639_3 LEFT JOIN ISO_639_2 USING (Id) WHERE ISO_639_2.Part2B IS NULL AND Scope='M'
    Alpha3English Name
    hbsSerbo-Croatian
    klnKalenjin
    luyLuyia
  • Count of macrolanguage codes not in ISO 639-1: 28
    Query: SELECT COUNT(*) as count FROM ISO_639_3 LEFT JOIN ISO_639_2 USING (Id) WHERE (ISO_639_2.Part2B IS NULL OR ISO_639_2.Part1 = '') AND Scope='M'
Syndicate content