PanLex: Printed knowledge sources

Identifying and prioritizing printed sources in libraries

Introduction

In our experience, many low-density languages’ lexicons are still documented only in print. Major research libraries appear to have copies of thousands of lexical resources for low-density languages. Therefore, in 02012 we decided to process a major catalogue of library resources in order to identify such resources and prioritize them as potential knowledge sources for PanLex. This document describes our method and results.

Catalog

We chose for this project a catalog titled “OL Dump”, maintained by the Open Library. We retrieved its version of 02012-04-30. It was available as a gzip-compressed file 37 GB in size after decompression. It contained 40.2 million (newline-delimited) publication (i.e. “edition” or “work”) records.

Subject extraction

We chose to identify PanLex-relevant resources by means of the subjects in the catalog entries. We began by extracting the subjects from the records with subjects.pl. This produced a 341-MB file, subjects.txt, containing only (unique) subjects and their occurrence counts. There were 6.4 million unique subjects in it.

The subjects were not uniformly structured. In the subjects.pl routine, we corrected two inconsistencies:

Some but not all subjects ended with periods. The routine deleted such periods when they occurred.
Delimiters between subject elements were usually but not always double hyphen-minus characters surrounded by single spaces. The routine forced single spaces to surround all double hyphen-minus characters, and it converted single hyphen-minuses surrounded by single and/or multiple spaces to double hyphen-minuses surrounded by single spaces.

Subject relevance classification

Stage 1

We classified the 6.4 million subjects extracted from the catalog in terms of their relevance to PanLex, using dicsubs.pl. This subroutine selected 55 thousand potentially relevant subjects and produced a 3.3-MB file, dicsubs.txt, containing them and their occurrence counts.

The rule applied by dicsubs.pl was simply that at least one of the following strings must appear in the subject:

dictionary
dictionaries
lexicon
vocabulary
vocabularies
glossary
glossaries
thesaurus
thesauri
dictionnaire
glossaire
vocabulaire

Stage 2

We further classified the 55 thousand potentially relevant subjects into two classes, apparently useful and apparently useless subjects, with plxsubs.pl. It produced a 500-KB file of 9 thousand apparently useful subjects, possubs.txt, and a 2.8-MB file of 47 thousand apparently useless subjects, negsubs.txt. The plxsubs.pl routine was designed to classify monolingual resources and nonlexical resources as apparently useless.

Stage 3

We inspected the files of apparently useful and apparently useless subjects and identified misclassified subjects. We created for each file a file containing copies of those subjects that we found wrongly included. Subjects that were in the apparently useful file but should not have been were copied into possub-not.txt, and subjects that were in the apparently useless file but should not have been were copied into negsub-not.txt.

We applied the files of misclassified subjects (possub-not.txt and negsub-not.txt) to the file of apparently useful subjects (possubs.txt) to improve its quality, with fixsubs.pl. This script produced a 434-KB file of 9 thousand subjects, goodsubs.txt.

Language identification

Introduction

Source prioritization depends on the formal properties of sources (bilingual or multilingual dictionary, vocabulary, etc.) and also on the particular languages that it documents. The highest-priority sources document languages that PanLex has no or meager data about.

Multilingual resources are also more useful than bilingual ones, in general, for translation inference with PanLex.

Stage 1

To prioritize sources according to language, we identified the languages named in the surviving subjects (those in goodsubs.txt). PanLex uniquely identifies language varieties with a language code and a variety code (combined into language-variety UIDs, such as “deu-000” for German), but OL Dump subjects identify them with English or French names. Thus, we needed to extract English or French language names from goodsubs.txt and convert those names to PanLex UIDs.

To identify multilingual sources, we needed to classify subjects according to whether they were bilingual or multilingual.

For these purposes we annotated the subjects in goodsubs.txt with three additional parameters:

Multilinguality
- t = true (multilingual)
- f = false (only bilingual)
- ~ = not determined whether multilingual
Source language
Target language

The language-variety values were blank unless the multilinguality value was “f”. Where it was “f”, the language-variety parameters had “-” as values if they were not determined, or otherwise values of this format: “<k>x”. In this format, “k” was replaced with a type, and “x” was replaced with a name. The type was one of the following:

l = language
d = dialect
f = family (of languages)
t = territory (of languages)

The script that produced these annotations was langfind.pl. It performed up to 60 tests per subject in order to populate these columns. The complexity of the script was due to the subjects not systematically encoding the properties we needed to elicit and also not uniformly representing language names. For example, Greek was variously represented as:

Greek
Greek (Modern Greek)
Greek language, Modern
Greek language,Modern
Greek, Modern

The script also normalized the encodings of the language names, by:

Converting hexadecimal Unicode encodings, used in the catalog for non-ASCII characters, to Unicode characters.
Applying NFC normalization to the names, as PanLex does (among other things, thereby replacing decomposed characters with precomposed ones when they exist).

The script produced a 606-KB file, sublangs.txt, containing these columns:

Multilinguality
Source language
Target language
Occurrence count
Subject

We extracted the unique type-name combinations from sublangs.txt with unique.pl, producing a 33-KB file, ulangs.txt, with the type in the first column and the name in the second column. It contained 2,342 entries.

Stage 2

We used PanLex to begin mapping the names in ulangs.txt to language varieties. We began by importing ulangs.txt into the “interim” schema of PanLex as a table and adding three more columns to it. The definition of this table, interim.ulangs, was:

 Column |     Type     | Modifiers | Storage  |                                    Description                                     
--------+--------------+-----------+----------+------------------------------------------------------------------------------------
 tp     | character(1) | not null  | extended | type
 nm     | text         | not null  | extended | name
 lcvcct | smallint     |           | plain    | count of lvs whose tt = name
 lcct   | smallint     |           | plain    | count of lcs that are translations of English ex = name
 ex     | integer      |           | plain    | ex of nm in eng-000
 lcvc   | character(7) |           | extended | lcvc if exactly 1 label is identical to the name
 lc     | character(3) |           | extended | lc if exactly 1 lc is a translation of an eng-000 ex with tt identical to the name
Indexes:
    "ulangs_pkey" PRIMARY KEY, btree (tp, nm) CLUSTER

The initially populated columns were “tp” and “nm”. To populate the other columns, we executed queries on the table.

We populated column “ex” with the ID of the English (language variety 187) expression, if any, whose text was identical to the name in column “nm”. There were 1,645 such names. The query was:

update ulangs set ex = ex.ex from ex where lv = 187 and tt = nm;

We populated column “lcvcct” with the counts of language varieties whose labels were identical to the names in column “nm”. There were 586 such names. The distribution of counts was:

 lcvcct | count 
--------+-------
      1 |   544
      2 |    30
      3 |     8
      5 |     1
      6 |     3

The query was:

update ulangs set lcvcct = lvct from (select tt, count (lv) as lvct from (select distinct tt, lv from ulangs, lv where tt = nm) as tbl group by tt) as lvct where tt = nm;

The most inner table “tbl” contained the (integer) IDs and labels of the language varieties whose labels were identical to the names in column “nm”. The less inner table “lvct” contained the names in column “nm” and the counts of language varieties whose labels were identical to them.

We populated column “lcct” with the counts of expressions in the ISO 639 language variety (whose expressions are generally ISO 639 alpha-3 codes identifying languages) that are translations of the (English) expressions in column “ex”. There were 1,400 such expressions with at least one ISO 639 translation. The distribution of counts was:

 lcct | count 
------+-------
    1 |  1159
    2 |   151
    3 |    43
    4 |    25
    5 |     7
    6 |     8
    7 |     3
    8 |     2
   10 |     1
   14 |     1

The query was:

update ulangs set lcct = trct from (select ex0, count (ex1) as trct from (select distinct ulangs.ex as ex0, dn2.ex as ex1 from ulangs, dn as dn1, dn as dn2, ex where dn1.ex = ulangs.ex and dn2.mn = dn1.mn and ex.ex = dn2.ex and lv = 41) as tbl group by ex0) as outbl where ex0 = ex;

The most inner table “tbl” contained the distinct translations from the (English) expressions in column “ex” to expressions in language variety ISO 639 (variety 41). The latter expressions were generally alpha-3 ISO 639 codes. A given English expression could have translations into 0 or more ISO 639 expressions. The less inner table “outbl” contained the IDs of the English expressions whose texts were identical to the names in column “nm” and the counts of their ISO 639 translations.

We populated column “lcvc” with the UIDs of the 544 language varieties whose labels were uniquely identical to the names in column “nm”. The query was:

update ulangs set lcvc = lcvc (lv) from lv where lcvcct = 1 and tt = nm;

We populated column “lc” with the 1,159 unique ISO 639 translations of (English) expressions in column “ex”. The query was:

update ulangs set lc = tt from (select distinct ulangs.ex, tt from ulangs, dn as dn1, dn as dn2, ex where lcct = 1 and dn1.ex = ulangs.ex and dn2.mn = dn1.mn and ex.ex = dn2.ex and lv = 41) as tbl where tbl.ex = ulangs.ex;

The inner table “tbl” contained the distinct pairs of (English) expression IDs and (if unique) the texts (codes) of their ISO 639 translations.

Stage 3

We documented the mappings in the 107 cases where a given name in column “nm” was identical to the labels of 2 or more language varieties. For this purpose we created a table interim.ulangsq, containing the types, names, and corresponding language-variety UIDs. The query was:

create table interim.ulangsq as select tp, nm, lcvc (lv) as lcvc from ulangs, lv where lcvcct > 1 and tt = nm order by tp, nm, lcvc;

The definition of this table was:

                       Table "interim.ulangsq"
 Column |     Type     | Modifiers | Storage  |     Description      
--------+--------------+-----------+----------+----------------------
 tp     | character(1) | not null  | extended | type
 nm     | text         | not null  | extended | name
 lcode  | text         | not null  | extended | candidate lcvc or lc
Indexes:
    "ulangsq_pkey" PRIMARY KEY, btree (tp, nm, lcode) CLUSTER

We documented the 675 mappings where a given name in column “nm” was an English expression with translations into 2 or more ISO 639 expressions (codes). For this purpose we created a temporary table ulangsq1, containing the types, names, and corresponding ISO 639 expression (code) texts. The query was:

create temporary table ulangsq1 as select distinct tp, nm, tt as lc from ulangs, dn as dn1, dn as dn2, ex where lcct > 1 and dn1.ex = ulangs.ex and dn2.mn = dn1.mn and ex.ex = dn2.ex and lv = 41 order by tp, nm, tt;

We added the second of these mapping documents to the first, enlarging table interim.ulangsq. Its column “lcvc” thereby became populated with a mixture of language-variety UIDs and ISO 639 expressions (codes). The query was:

insert into ulangsq select * from ulangsq1;

We documented the mapping failures by creating a table interim.ulangsz, containing 896 type-name pairs that had led to no UIDs or ISO 639 expressions, whether via language-variety labels or English expressions. The table also contained those names’ IDs as English expressions (if any). The query was:

create table interim.ulangsz as select tp, nm, ex from ulangs where lcvcct is null and lcct is null order by tp, nm;

The definition of this table was:

                              Table "interim.ulangsz"
 Column |     Type     | Modifiers | Storage  |             Description             
--------+--------------+-----------+----------+-------------------------------------
 tp     | character(1) | not null  | extended | type
 nm     | text         | not null  | extended | name
 ex     | integer      |           | plain    | ID of name as an eng-000 expression
Indexes:
    "ulangsz_pkey" PRIMARY KEY, btree (tp, nm) CLUSTER

Stage 4

We discovered additional mappings. We limited our investigation to type-name pairs labeled as languages (l) or dialects (d), omitting family (f) and territorial (t) pairs. This investigation led to the tentative identification of 450 mappings of names to language-variety UIDs. Some of the names had not been mapped at all, and others had been mapped to ISO 639 expressions but not to language-variety UIDs. We recorded these mappings in the file edlangs.txt and imported it into the database as table interim.edlangs. The definition of this table was:

                                Table "interim.edlangs"
 Column |     Type     | Modifiers | Storage  |               Description               
--------+--------------+-----------+----------+-----------------------------------------
 tp     | character(1) | not null  | extended | type
 lcvc   | character(7) | not null  | extended | variety UID
 nm     | text         | not null  | extended | name
 etc    | text         |           | extended | autonym and/or other miscellaneous data
Indexes:
    "edlangs_pkey" PRIMARY KEY, btree (tp, nm) CLUSTER

We consolidated all of the mappings into the existing table interim.ulangsq. The queries were:

insert into ulangsq select tp, nm, lcvc from ulangs where lcvc is not null except select * from ulangsq; insert into ulangsq select tp, nm, lc from ulangs where lc is not null except select * from ulangsq; insert into ulangsq select tp, nm, lcvc from edlangs except select * from ulangsq;

These queries added 544, 1,159, and 450 mappings, respectively, to interim.ulangsq. This consolidation expanded the list of mappings in interim.ulangsq to 2,935 items, with 1,779 names being mapped to 1 or more codes.

These mappings exhibited language-name ambiguity. Some language names were translatable into multiple ISO 639 expressions or were labels of multiple language varieties. Some of these ambiguities involved names that mapped to very low-priority language varieties and also to very high-priority ones, but the vast majority of publications with subjects containing those names documented the low-priority (i.e. high-density) varieties.

Language prioritization

We defined and implemented a rule for the prioritization of language varieties. Given PanLex’s panlingual goal, the rule aimed to prioritize those language varieties that were not yet extensively documented in PanLex and for which no sources were in the queue awaiting consultation.

The rule defined a language code as high-priority if no variety of that code’s language had 20 or more expressions in PanLex and no approver in the queue declared any variety of that language.

The rule defined a language variety as high-priority if it had fewer than 20 expressions in PanLex and no approver in the queue declared it.

We implemented this rule by creating a table of high-priority language codes and language-variety UIDs with these queries:

create temporary table biglcvc as select lc, vc from lv, (select lv, count (ex) as exs from ex group by lv) as tbl where exs > 19 and lv.lv = tbl.lv union select lc, vc from aped, lv, av where im and av.ap = aped.ap and lv.lv = av.lv order by lc, vc;

create table interim.prilcode as select lc::text as lcode from lc where tp = 'i' except select lc from biglcvc;

insert into interim.prilcode select lcvc (lv) from (select lv from lv except select lv from lv, biglcvc where lv.lc = biglcvc.lc and lv.vc = biglcvc.vc) as tbl;

insert into interim.prilcode select lcvc from edlangs except select lcvc (lv) from lv;

These queries produced table interim.prilcode, with 7,439 rows (4,571 language codes and 2,868 language-variety UIDs), whose definition was:

                                 Table "interim.prilcode"
 Column | Type | Modifiers | Storage  |                    Description                    
--------+------+-----------+----------+---------------------------------------------------
 lcode  | text | not null  | extended | ISO 639-3 individual language code or variety UID
Indexes:
    "prilcode_pkey" PRIMARY KEY, btree (lcode) CLUSTER

Language-name prioritization

We used the interim.prilcode table to prioritize the language names that we had mapped to codes in the interim.ulangsq table. For this purpose, we classified each type-name pair as either high- or low-priority. A type-name pair was classified as high-priority if and only if every mapping of the name was to a code in the interim.prilcode table. Thus, if a name was mapped to a high-priority code but also to a low-priority code, we assumed that the variety represented by the name was low-priority. The queries that performed this classification were:

create temporary table killname as select distinct nm from interim.ulangsq, (select lcode from interim.ulangsq except select lcode from interim.prilcode) as tbl where ulangsq.lcode = tbl.lcode; create table interim.priname as select tp, nm from interim.ulangsq, interim.prilcode where ulangsq.lcode = prilcode.lcode except select tp, ulangsq.nm from interim.ulangsq, killname where ulangsq.nm = killname.nm;

The first query produced a temporary table of 1,222 language names mapped to low-priority codes (even if also mapped to high-priority ones). The second query produced a table with 580 rows, each containing a unique high-priority type-name combination. (This is 25% of the unique 2,342 type-name combinations in the interim.ulangs table.) The table definition was:

                  Table "interim.priname"
 Column |     Type     | Modifiers | Storage  | Description 
--------+--------------+-----------+----------+-------------
 tp     | character(1) | not null  | extended | type
 nm     | text         | not null  | extended | name
Indexes:
    "priname_pkey" PRIMARY KEY, btree (tp, nm) CLUSTER

Subject prioritization

We classified a subject as high-priority if it had been classified as relevant and thus included in the goodsubs.txt and sublangs.txt files, and also satisfied either of these two criteria:

Its source or target variety was a high-priority (typed) language.
It was multilingual.

Between these two criteria, we assigned a higher value to the first, because a large fraction of the multilingual lexical resources document only high-density languages. Thus, we considered a subject “top-priority” if it specified a high-priority language or language variety, and merely “high-priority” if it was multilingual. (As stated above, no multilingual subject specified a source or target language.)

To perform this classification, we first imported the sublangs.txt file into the “interim” schema as a table “sublangs”. The import command did not use the default parameters, because sublangs.txt contained (many) backslashes, which would by default be treated as metacharacters. The command was:

copy sublangs from '/var/local/panlex/sublangs.txt' (format csv, delimiter '\t');

The table’s definition was:

                                      Table "interim.sublangs"
 Column |     Type     | Modifiers | Storage  |                     Description                     
--------+--------------+-----------+----------+-----------------------------------------------------
 ml     | character(1) | not null  | extended | whether the subject is multilingual
 src    | text         |           | extended | source language, or “-” or blank if unknown or none
 tgt    | text         |           | extended | target language, or “-” or blank if unknown or none
 ct     | integer      | not null  | plain    | occurrence count
 sj     | text         | not null  | extended | subject
Indexes:
    "sublangs_pkey" PRIMARY KEY, btree (sj)

We extracted from this table a new table, interim.prisub, whose subjects were top- or high-priority, with this query:

create table interim.prisub as select true as top, sj from sublangs, priname where src = '<' || tp || '>' || nm or tgt = '<' || tp || '>' || nm union select false as top, sj from sublangs where ml = 't';

The table’s definition was:

                            Table "interim.prisub"
 Column |  Type   | Modifiers | Storage  |             Description             
--------+---------+-----------+----------+-------------------------------------
 top    | boolean | not null  | plain    | whether the subject is top-priority
 sj     | text    | not null  | extended | subject
Indexes:
    "prisub_pkey" PRIMARY KEY, btree (sj) CLUSTER

There were 2,620 rows in the table, of which 988 were top-priority and 1,632 were high-priority. These had been drawn from the 9 thousand subjects in interim.sublangs.

Book prioritization

We identified top- and high-priority books by extracting from the OL Dump file the “edition” records that were catalogued as having at least one top- or high-priority subject. To do this, we first created a file prisub.txt from the interim.prisub table. The command for this operation did not use the default parameters, because by default the (many) backslashes in the table would be exported as double backslashes. The command was:

copy prisub to '/var/local/panlex/prisub.txt' (format csv, delimiter '\t');

We identified the top- and high-priority records with prigrep.pl. It identified each subject of each publication, normalized the subject, and determined whether that subject was one of the top- or high-priority subjects in prisub.txt. If at least one subject was top-priority, it classified the publication as top-priority. If not, but at least one subject was high-priority, it classified the publication as high-priority. It output each top-priority record into booktop.txt, containing 1,172 records, and each high-priority record into bookhigh.txt, containing 6,854 records.

We reformatted the top-priority list for easier readability with cleantop.pl, whose output is the file cleantop.txt.

PanLex: Printed knowledge sources

Home Goal Technology Research Try it Help us People Partners Contact

Identifying and prioritizing printed sources in libraries

Introduction

Catalog

Subject extraction

Subject relevance classification

Stage 1

Stage 2

Stage 3

Language identification

Introduction

Stage 1

Stage 2

Stage 3

Stage 4

Language prioritization

Language-name prioritization

Subject prioritization

Book prioritization