Mapping from Ethnologue 14 to ISO 639-2
Submitted by Christoph on 28 December, 2007 - 16:43SIL provides a mapping table from Ethnologue 14 to ISO 639-2 codes. Using this table a mapping from old Ethnologue codes to ISO 639 can be made.
This table has its limitations but can provide some additional usage: First of all no mapping to ISO 639-3 exists. (update, see: Updating Ethnologue codes from the 14th Edition to the 15th Edition) Thus a mapping from code in Ethnologue 14 to ISO 639 may lose some information. If a code in Ethnologue isn't encoded in ISO 639, it will be mapped to its language family with a collective code. Furthermore the mapping might have not been updated to included the latest ISO 639-2 codes, so additional codes might not be mapped to. But this information can be helpful on how to actualy map a single language to a collective code in ISO 639-2, even if a ISO 639-3 code is present.
The code table can't be choosen directly on the site, but it can be found over this direct link.
The following code can be useful for loading this table into MySQL:
CREATE TABLE MAPPING_ETHNO14_ISO_639_2 ( Part2BT char(3) NOT NULL, Ethno14 char(3) NOT NULL, INDEX (Ethno14), INDEX (Part2BT) ) LOAD DATA LOCAL INFILE 'all_639-2.tab' INTO TABLE MAPPING_ETHNO14_ISO_639_2 FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n';
Furthermore when making this mapping, the language names might be useful. They can be found in the regular table under the site's download section: LanguageCodes.tab
Here is the code for MySQL:
CREATE TABLE LanguageCodes ( LangID char(3) NOT NULL, -- Three-letter code CountryID char(2) NOT NULL, -- Main country where used LangStatus char(1) NOT NULL, -- L(iving), N(early extinct), -- E(xtinct) Name varchar(75) collate utf8_general_ci NOT NULL, -- Primary name in that country PRIMARY KEY (LangID) ) LOAD DATA LOCAL INFILE 'LanguageCodes.tab' INTO TABLE LanguageCodes FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n';
List of constructed languages in ISO 639-3
Here is a list on constructed languages in ISO 639-3. See Creating statistics on ISO 639 language codes for how this was generated.
Versions of relevant table(s):
Table | Version |
---|---|
ISO_639_2 | 20081107 |
ISO_639_3 | 20081110 |
Constructed language list
Count of constructed language codes in ISO 639-3: 20
Query: SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='C'
ISO 639-3 | ISO 639-2 (B/T) | ISO 639-1 | English Name |
---|---|---|---|
afh | afh | Afrihili | |
avk | Kotava | ||
bzt | Brithenig | ||
dws | Dutton World Speedwords | ||
epo | epo | eo | Esperanto |
eur | Europanto | ||
ido | ido | io | Ido |
igs | Interglossa | ||
ile | ile | ie | Interlingue |
ina | ina | ia | Interlingua (International Auxiliary Language Association) |
jbo | jbo | Lojban | |
ldn | L | ||
lfn | Lingua Franca Nova | ||
nov | Novial | ||
qya | Quenya | ||
rmv | Romanova | ||
sjn | Sindarin | ||
tlh | tlh | Klingon | |
vol | vol | vo | Volap |
zbl | zbl | Blissymbols |
Python Unicode rant
Submitted by Christoph on 11 December, 2007 - 00:51If you are at the beginning of a new project and are currently considering which programming language to use, and furthermore you will definitely need Unicode support, then please consider some other languages than python.
Don't use python. Really only used it, if you have no other choice.
Everything python does inside needs some kind of an encoding. If you read from a command line set to utf8, your whole system actually is utf8, then you still need to decode your input from utf8 (as long it's not ascii).
It doesn't wrap everything in unicode (unlike Java), as the basic string is str and unicode strings are different objects. So str("hello") isn't the same as unicode("hello")
.
Furthermore you will stumple upon a lot of other culprits: e.g. htmlentitydefs has a table for mapping "entity definitions to their replacement text in ISO Latin-1". Ups, there it is again: Latin-1. You can't just use it, but you have to decode it first.
Or the csv module will say, that you actually can't import unicode strings that easily, you will need to build an extra wrapper.
The dream of using Unicode all over the world is still far away.
At least for python...
Creating statistics on ISO 639 language codes
Submitted by Christoph on 7 December, 2007 - 15:31I created a page showing simple ISO 639 language code statistics.
Creating this took some time as I had do figure out how to integrate that into PHP and I never did anything in that language before. Additionally though I already described how to load ISO 639-3 tables provided by SIL into MySQL, I had to make some changes to the given data, which I didn't do in the first place.
To not make MySQL search case sensitively it has to be told to use a proper collation. Where it seems needed I added it to the column specifications.
Steps:
- Downloading all the tables from SIL and LoC
- Modifying the database schemes, as I need indexing and keys:
CREATE TABLE ISO_639_3 ( Id char(3) NOT NULL, -- The three-letter 639-3 identifier Part2B char(3) NULL, -- Equivalent 639-2 identifier of the -- bibliographic applications code set, if -- there is one Part2T char(3) NULL, -- Equivalent 639-2 identifier of the terminology -- applications code set, if there is one Part1 char(2) NULL, -- Equivalent 639-1 identifier, if there is one Scope char(1) NOT NULL, -- I(ndividual), M(acrolanguage), S(pecial) Type char(1) NOT NULL, -- A(ncient), C(onstructed), -- E(xtinct), H(istorical), L(iving), S(pecial) Ref_Name varchar(150) collate utf8_general_ci NOT NULL, -- Reference language name Comment varchar(150) collate utf8_general_ci NULL, -- Comment relating to one or more of the columns PRIMARY KEY (Id), INDEX (Part2B), INDEX (Part2T), INDEX (Part1) ); CREATE TABLE ISO_639_3_Names ( Id char(3) NOT NULL, -- The three-letter 639-3 identifier Print_Name varchar(75) collate utf8_general_ci NOT NULL, -- One of the names associated with -- this identifier Inverted_Name varchar(75) collate utf8_general_ci NOT NULL, -- The inverted form of this -- Print_Name form INDEX (Id) ); CREATE TABLE ISO_639_3_Macrolanguages ( M_Id char(3) NOT NULL, -- The identifier for a macrolanguage I_Id char(3) NOT NULL, -- The identifier for an individual language -- that is a member of the macrolanguage PRIMARY KEY (I_Id), INDEX (M_Id) ); CREATE TABLE ISO_639_3_Retirements ( Id char(3) NOT NULL, -- The three-letter 639-3 identifier Ret_Reason char(1) NOT NULL, -- code for retirement: C (change), -- D (duplicate), N (non-existent), -- S (split), M (merge) Change_To char(3) NULL, -- in the cases of C, D, and M, the -- identifier to which all instances -- of this Id should be changed Ret_Remedy varchar(200) NOT NULL, -- The instructions for updating an -- instance of the retired (split) -- identifier Effective date NOT NULL, -- The date the retirement became -- effective PRIMARY KEY (Id) ); CREATE TABLE `ISO_639_2` ( `Part2B` char(3) NOT NULL, `Part2T` char(3) default NULL, `Part1` char(2) default NULL, `Ref_Name_EN` varchar(150) collate utf8_general_ci NOT NULL, `Ref_Name_FR` varchar(150) collate utf8_general_ci NOT NULL, PRIMARY KEY (Part2B), INDEX (Part2T), INDEX (Part1) );
- I added another table for saving the timestamps of the respective tables and added the values:
CREATE TABLE `Versions` ( `Entity` char(50) NOT NULL, `Version` char(20) NOT NULL, PRIMARY KEY (Entity) );
insert into Versions values ('ISO 639-2', '20071029');
... - Before importing the data files, I had to do a bit of a make up, as the Byte Order Mark (BOM) had to be removed and line break had to be added here and there as to allow the import with MySQL 5 methods.
- Then I used the LOAD DATA LOCAL FILE directive to import the text files into my local system (encoding utf8, downloaded files where created on a windows system, so use \r\n as line end):
LOAD DATA LOCAL INFILE 'iso-639-3-macrolanguages_20070801.tab' INTO TABLE ISO_639_3_Macrolanguages CHARACTER SET utf8 FIELDS TERMINATED BY '\t' LINES TERMINATED BY '\r\n' IGNORE 1 LINES;
... - After that I had to transfer them to the server online, where I wouldn't have access to the local filesystem:
mysqldump --compatible=mysql40 --skip-lock-tables -p ISO639-3 > iso639codes.sql
cat iso639codes.sql | grep -v "LOCK TABLES" | grep -v "UNLOCK TABLES" > iso639codes_clean.sql
I don't have lock table rights so I had to tell mysqldump to omit these commands. Furthermore I use 5.0 locally and 4.0 on the server (--compatible=mysql40).Uploading is then easy:
cat iso639codes_clean.sql | mysql -p -h "hostname" ISO639-3 - After that I had to create a page where I could quickly add SQL SELECTs as to generate quick statistics.
Drupal makes it difficult here, as they have a way to access additional databases through drupal library methods, though not allowing to use PHP's own ones (which would screw up drupal), and interfering with the CMS system:
user warning: Table 'ISO639-3.system' doesn't exist query: SELECT * FROM system WHERE type = 'theme' in /home/.../includes/database.mysql.inc on line 173. - Finally I have data in the current versions:
+--------------------------+----------+ | Entity | Version | +--------------------------+----------+ | ISO 639-2 | 20071029 | | ISO 639-3 | 20071018 | | ISO 639-3 Name Index | 20070814 | | ISO 639-3 Macrolanguages | 20070801 | | ISO 639-3 Retirements | 20070814 | +--------------------------+----------+
- Some more interesting queries are hopefully yet to come, as I really think it's interesting to see those codes that don't fit into the regular scheme, like Bihari (bh, bih), or Serbo-Croatian (hbs).
- A last remark: The LoC's ISO 639-2 table doesn't supply any data to the Part2T column if the value is equal to Part2B. This has to be reflected in all queries dealing with this table.
ISO 639 language code statistics
Here is a short overview on ISO 639 language code statistics. If you have any other queries to add, please tell me. See Creating statistics on ISO 639 language codes for how this was generated.
Versions of tables:
Table | Version |
---|---|
ISO_639_3 | 20081110 |
ISO_639_3_Names | 20081110 |
ISO_639_3_Macrolanguages | 20080218 |
ISO_639_3_Retirements | 20081110 |
ISO_639_2 | 20081107 |
ISO 639-1
- Count of ISO 639-1 language codes: 184
Query:SELECT COUNT(*) AS count FROM ISO_639_2 WHERE Part1 != ''
- Count of ISO 639-1 collective language codes: 1
Query:SELECT COUNT(*) as count FROM ISO_639_2 LEFT JOIN ISO_639_3 USING (Id) WHERE ISO_639_3.Id IS NULL AND ISO_639_2.Part1 != ''
Alpha2 Alpha3 (B/T) English Name bh bih Bihari
ISO 639-2
- Count of ISO 639-2 language codes (without reserved): 484
Query:SELECT COUNT(*) as count FROM ISO_639_2
- Count of languages entries with different B/T codes: 20
Query:SELECT COUNT(*) as count FROM ISO_639_2 WHERE Part2T != ''
- Count of collective language codes (not in ISO 639-3): 66
Query:SELECT COUNT(*) as count FROM ISO_639_2 LEFT JOIN ISO_639_3 USING (Id) WHERE ISO_639_3.Id IS NULL
- Count of collective language codes for languages of a language family which don't have an own code: 34
Query:SELECT COUNT(*) as count FROM ISO_639_2 LEFT JOIN ISO_639_3 USING (Id) WHERE ISO_639_3.Id IS NULL AND Ref_Name_En like '%(Other)%'
- Count of deprecated ISO 639-2(B) codes: 2
Query:SELECT COUNT(*) as count FROM ISO_639_3 LEFT JOIN ISO_639_2 USING (Id) where ISO_639_2.Part2T = '' AND ISO_639_3.Part2T != ISO_639_3.Part2B;
Alpha3 (T) deprecated Alpha3 (B) English Name hrv scr Croatian srp scc Serbian
ISO 639-3
- Count of ISO 639-3 language codes: 7700
Query:SELECT COUNT(*) as count FROM ISO_639_3
- Count of special codes in ISO 639-3: 4
Query:SELECT COUNT(*) as count FROM ISO_639_3 WHERE Scope='S'
Alpha3 English Name mis Uncoded languages mul Multiple languages und Undetermined zxx No linguistic content - Count of constructed language codes in ISO 639-3: 20
Query:SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='C'
- Count of ancient language codes in ISO 639-3: 112
Query:SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='A'
- Count of historical language codes in ISO 639-3: 63
Query:SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='H'
- Count of extinct language codes in ISO 639-3: 427
Query:SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='E'
- Count of living language codes in ISO 639-3: 7074
Query:SELECT COUNT(*) as count FROM ISO_639_3 WHERE Type='L'
Macrolanguage codes
- Count of macrolanguage codes in ISO 639-3: 57
Query:SELECT COUNT(*) as count FROM ISO_639_3 WHERE Scope='M'
- Count of (non deprecated) language codes belonging to a macrolanguage code: 400
Query:SELECT COUNT(I_Id) as count FROM ISO_639_3_Macrolanguages WHERE I_Status = 'A'
- Count of macrolanguage codes not in ISO 639-2: 3
Query:SELECT COUNT(*) as count FROM ISO_639_3 LEFT JOIN ISO_639_2 USING (Id) WHERE ISO_639_2.Part2B IS NULL AND Scope='M'
Alpha3 English Name hbs Serbo-Croatian kln Kalenjin luy Luyia - Count of macrolanguage codes not in ISO 639-1: 28
Query:SELECT COUNT(*) as count FROM ISO_639_3 LEFT JOIN ISO_639_2 USING (Id) WHERE (ISO_639_2.Part2B IS NULL OR ISO_639_2.Part1 = '') AND Scope='M'
