Creating statistics on ISO 639 language codes

I created a page showing simple ISO 639 language code statistics.

Creating this took some time as I had do figure out how to integrate that into PHP and I never did anything in that language before. Additionally though I already described how to load ISO 639-3 tables provided by SIL into MySQL, I had to make some changes to the given data, which I didn't do in the first place.

To not make MySQL search case sensitively it has to be told to use a proper collation. Where it seems needed I added it to the column specifications.

Steps:

  • Downloading all the tables from SIL and LoC
  • Modifying the database schemes, as I need indexing and keys:
    CREATE TABLE ISO_639_3 (
       Id      char(3) NOT NULL,  -- The three-letter 639-3 identifier
       Part2B  char(3) NULL,      -- Equivalent 639-2 identifier of the
                                  -- bibliographic applications code set, if
                                  -- there is one
       Part2T  char(3) NULL,      -- Equivalent 639-2 identifier of the terminology
                                  -- applications code set, if there is one
       Part1   char(2) NULL,      -- Equivalent 639-1 identifier, if there is one    
       Scope   char(1) NOT NULL,  -- I(ndividual), M(acrolanguage), S(pecial)
       Type    char(1) NOT NULL,  -- A(ncient), C(onstructed),  
                                  -- E(xtinct), H(istorical), L(iving), S(pecial)
       Ref_Name   varchar(150) collate utf8_general_ci NOT NULL,
                                  -- Reference language name 
       Comment    varchar(150) collate utf8_general_ci NULL,
                                  -- Comment relating to one or more of the columns
       PRIMARY KEY (Id),
       INDEX (Part2B),
       INDEX (Part2T),
       INDEX (Part1)
    );
    
    CREATE TABLE ISO_639_3_Names (
       Id             char(3)     NOT NULL,  -- The three-letter 639-3 identifier
       Print_Name     varchar(75) collate utf8_general_ci NOT NULL,
                                             -- One of the names associated with
                                             -- this identifier 
       Inverted_Name  varchar(75) collate utf8_general_ci NOT NULL,
                                             -- The inverted form of this
                                             -- Print_Name form
       INDEX (Id)
    );
    
    CREATE TABLE ISO_639_3_Macrolanguages (
       M_Id  char(3) NOT NULL,  -- The identifier for a macrolanguage
       I_Id  char(3) NOT NULL,  -- The identifier for an individual language
                                -- that is a member of the macrolanguage
       PRIMARY KEY (I_Id),
       INDEX (M_Id)
    );
    
    CREATE TABLE ISO_639_3_Retirements (
       Id          char(3)      NOT NULL,   -- The three-letter 639-3 identifier
       Ret_Reason  char(1)      NOT NULL,   -- code for retirement: C (change),
                                            -- D (duplicate), N (non-existent),
                                            -- S (split), M (merge)
       Change_To   char(3)      NULL,       -- in the cases of C, D, and M, the
                                            -- identifier to which all instances
                                            -- of this Id should be changed
       Ret_Remedy  varchar(200) NOT NULL,   -- The instructions for updating an
                                            -- instance of the retired (split)
                                            -- identifier
       Effective   date         NOT NULL,   -- The date the retirement became
                                            -- effective
       PRIMARY KEY (Id)
    );
    
    CREATE TABLE `ISO_639_2` (
       `Part2B` char(3) NOT NULL,
       `Part2T` char(3) default NULL,
       `Part1` char(2) default NULL,
       `Ref_Name_EN` varchar(150) collate utf8_general_ci NOT NULL,
       `Ref_Name_FR` varchar(150) collate utf8_general_ci NOT NULL,
       PRIMARY KEY (Part2B),
       INDEX (Part2T),
       INDEX (Part1)
    );
    
  • I added another table for saving the timestamps of the respective tables and added the values:
    CREATE TABLE `Versions` (
     `Entity` char(50) NOT NULL,
     `Version` char(20) NOT NULL,
     PRIMARY KEY (Entity)
     );
    


    insert into Versions values ('ISO 639-2', '20071029');
    ...

  • Before importing the data files, I had to do a bit of a make up, as the Byte Order Mark (BOM) had to be removed and line break had to be added here and there as to allow the import with MySQL 5 methods.
  • Then I used the LOAD DATA LOCAL FILE directive to import the text files into my local system (encoding utf8, downloaded files where created on a windows system, so use \r\n as line end):

    LOAD DATA LOCAL INFILE 'iso-639-3-macrolanguages_20070801.tab' INTO TABLE ISO_639_3_Macrolanguages CHARACTER SET utf8 FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
    ...
  • After that I had to transfer them to the server online, where I wouldn't have access to the local filesystem:

    mysqldump --compatible=mysql40 --skip-lock-tables -p ISO639-3 > iso639codes.sql
    cat iso639codes.sql | grep -v "LOCK TABLES" | grep -v "UNLOCK TABLES" > iso639codes_clean.sql

    I don't have lock table rights so I had to tell mysqldump to omit these commands. Furthermore I use 5.0 locally and 4.0 on the server (--compatible=mysql40).

    Uploading is then easy:

    cat iso639codes_clean.sql | mysql -p -h "hostname" ISO639-3

  • After that I had to create a page where I could quickly add SQL SELECTs as to generate quick statistics.
    Drupal makes it difficult here, as they have a way to access additional databases through drupal library methods, though not allowing to use PHP's own ones (which would screw up drupal), and interfering with the CMS system:

    user warning: Table 'ISO639-3.system' doesn't exist query: SELECT * FROM system WHERE type = 'theme' in /home/.../includes/database.mysql.inc on line 173.
  • Finally I have data in the current versions:
    +--------------------------+----------+
    | Entity                   | Version  |
    +--------------------------+----------+
    | ISO 639-2                | 20071029 |
    | ISO 639-3                | 20071018 |
    | ISO 639-3 Name Index     | 20070814 |
    | ISO 639-3 Macrolanguages | 20070801 |
    | ISO 639-3 Retirements    | 20070814 |
    +--------------------------+----------+
    
  • Some more interesting queries are hopefully yet to come, as I really think it's interesting to see those codes that don't fit into the regular scheme, like Bihari (bh, bih), or Serbo-Croatian (hbs).
  • A last remark: The LoC's ISO 639-2 table doesn't supply any data to the Part2T column if the value is equal to Part2B. This has to be reflected in all queries dealing with this table.