Creating statistics on ISO 639 language codes
Submitted by Christoph on 7 December, 2007 - 15:31
I created a page showing simple ISO 639 language code statistics.
Creating this took some time as I had do figure out how to integrate that into PHP and I never did anything in that language before. Additionally though I already described how to load ISO 639-3 tables provided by SIL into MySQL, I had to make some changes to the given data, which I didn't do in the first place.
To not make MySQL search case sensitively it has to be told to use a proper collation. Where it seems needed I added it to the column specifications.
Steps:
- Downloading all the tables from SIL and LoC
- Modifying the database schemes, as I need indexing and keys:
CREATE TABLE ISO_639_3 ( Id char(3) NOT NULL, -- The three-letter 639-3 identifier Part2B char(3) NULL, -- Equivalent 639-2 identifier of the -- bibliographic applications code set, if -- there is one Part2T char(3) NULL, -- Equivalent 639-2 identifier of the terminology -- applications code set, if there is one Part1 char(2) NULL, -- Equivalent 639-1 identifier, if there is one Scope char(1) NOT NULL, -- I(ndividual), M(acrolanguage), S(pecial) Type char(1) NOT NULL, -- A(ncient), C(onstructed), -- E(xtinct), H(istorical), L(iving), S(pecial) Ref_Name varchar(150) collate utf8_general_ci NOT NULL, -- Reference language name Comment varchar(150) collate utf8_general_ci NULL, -- Comment relating to one or more of the columns PRIMARY KEY (Id), INDEX (Part2B), INDEX (Part2T), INDEX (Part1) ); CREATE TABLE ISO_639_3_Names ( Id char(3) NOT NULL, -- The three-letter 639-3 identifier Print_Name varchar(75) collate utf8_general_ci NOT NULL, -- One of the names associated with -- this identifier Inverted_Name varchar(75) collate utf8_general_ci NOT NULL, -- The inverted form of this -- Print_Name form INDEX (Id) ); CREATE TABLE ISO_639_3_Macrolanguages ( M_Id char(3) NOT NULL, -- The identifier for a macrolanguage I_Id char(3) NOT NULL, -- The identifier for an individual language -- that is a member of the macrolanguage PRIMARY KEY (I_Id), INDEX (M_Id) ); CREATE TABLE ISO_639_3_Retirements ( Id char(3) NOT NULL, -- The three-letter 639-3 identifier Ret_Reason char(1) NOT NULL, -- code for retirement: C (change), -- D (duplicate), N (non-existent), -- S (split), M (merge) Change_To char(3) NULL, -- in the cases of C, D, and M, the -- identifier to which all instances -- of this Id should be changed Ret_Remedy varchar(200) NOT NULL, -- The instructions for updating an -- instance of the retired (split) -- identifier Effective date NOT NULL, -- The date the retirement became -- effective PRIMARY KEY (Id) ); CREATE TABLE `ISO_639_2` ( `Part2B` char(3) NOT NULL, `Part2T` char(3) default NULL, `Part1` char(2) default NULL, `Ref_Name_EN` varchar(150) collate utf8_general_ci NOT NULL, `Ref_Name_FR` varchar(150) collate utf8_general_ci NOT NULL, PRIMARY KEY (Part2B), INDEX (Part2T), INDEX (Part1) );
- I added another table for saving the timestamps of the respective tables and added the values:
CREATE TABLE `Versions` ( `Entity` char(50) NOT NULL, `Version` char(20) NOT NULL, PRIMARY KEY (Entity) );
insert into Versions values ('ISO 639-2', '20071029');
... - Before importing the data files, I had to do a bit of a make up, as the Byte Order Mark (BOM) had to be removed and line break had to be added here and there as to allow the import with MySQL 5 methods.
- Then I used the LOAD DATA LOCAL FILE directive to import the text files into my local system (encoding utf8, downloaded files where created on a windows system, so use \r\n as line end):
LOAD DATA LOCAL INFILE 'iso-639-3-macrolanguages_20070801.tab' INTO TABLE ISO_639_3_Macrolanguages CHARACTER SET utf8 FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
... - After that I had to transfer them to the server online, where I wouldn't have access to the local filesystem:
mysqldump --compatible=mysql40 --skip-lock-tables -p ISO639-3 > iso639codes.sql
cat iso639codes.sql | grep -v "LOCK TABLES" | grep -v "UNLOCK TABLES" > iso639codes_clean.sql
I don't have lock table rights so I had to tell mysqldump to omit these commands. Furthermore I use 5.0 locally and 4.0 on the server (--compatible=mysql40).Uploading is then easy:
cat iso639codes_clean.sql | mysql -p -h "hostname" ISO639-3 - After that I had to create a page where I could quickly add SQL SELECTs as to generate quick statistics.
Drupal makes it difficult here, as they have a way to access additional databases through drupal library methods, though not allowing to use PHP's own ones (which would screw up drupal), and interfering with the CMS system:
user warning: Table 'ISO639-3.system' doesn't exist query: SELECT * FROM system WHERE type = 'theme' in /home/.../includes/database.mysql.inc on line 173. - Finally I have data in the current versions:
+--------------------------+----------+ | Entity | Version | +--------------------------+----------+ | ISO 639-2 | 20071029 | | ISO 639-3 | 20071018 | | ISO 639-3 Name Index | 20070814 | | ISO 639-3 Macrolanguages | 20070801 | | ISO 639-3 Retirements | 20070814 | +--------------------------+----------+
- Some more interesting queries are hopefully yet to come, as I really think it's interesting to see those codes that don't fit into the regular scheme, like Bihari (bh, bih), or Serbo-Croatian (hbs).
- A last remark: The LoC's ISO 639-2 table doesn't supply any data to the Part2T column if the value is equal to Part2B. This has to be reflected in all queries dealing with this table.