PanLex: Archive types

The major PanLex archive types are:

Approver files
Tools
Dumps

Approver files

Approver files are files associated with particular approvers. A digital dictionary’s HTML or PDF file is a typical example of an approver file that project editors acquire and then consult while adding content to the database. In addition to the files that contain the approver’s source’s original data, there are often other files that document the source, which editors use when recording the catalog data about the approver’s source. Finally, editors create other files while consulting sources: files that extract, validate, and modify data in the original files. These all constitute “approver files”.

The approver files are organized in two directories. One directory, “queued”, contains files of sources awaiting editorial consultation. The other, “used”, contains files of approvers that have already contributed data to PanLex. The latter typically include editorial files as well as original source files.

Tools

Tools are programs and data that PanLex editors use repeatedly when consulting approver sources.

The basic tools are located in the “tools” directory. Within this directory, the tools are organized in two subdirectories: “tabularize” and “serialize”. As described by Davis 02012, editors usually perform two operations on approver source files: First editors tabularize them, and then editors serialize the tabularized versions of them. The most commonly used tools for automating these operations are located in the respective subdirectories.

Dumps

Introduction

Project personnel can create archives that contain all or much of the content of the database, for delivery to researchers and other purposes. These files can be called dumps.

Standard dumps

PostgreSQL dumps can be created (by the database owner or a cluster superuser) with the “pg_dump” command.

Minimal-size dumps

One set of arguments to the “pg_dump” command produces a dump of minimal size by omitting derivative and secondary tables and omitting non-table objects (e.g., functions). The command with these arguments is:

pg_dump -t ap -t apli -t av -t cp -t cu -t df -t dm -t dn -t ex -t i1 -t lc -t lv -t md -t mi -t mn -t us -t wc -t wcex -OU postgres plx | bzip2 > plx-main.sql.bz2

XML dumps

A pair of routines produces an XML version of the PanLex data. The first is a function in the “util” schema that repopulates a table containing the XML dump. The table has this definition:

                               Table "interim.xmldump"
 Column |  Type   | Modifiers | Storage  |                Description                
--------+---------+-----------+----------+-------------------------------------------
 mn     | integer |           | plain    | meaning ID
 line   | text    |           | extended | XML representation of the meaning’s facts

The function, “xmldump ()”, has this definition:

List of functions
Schema	Name	Result data type	Argument data types	Type	Volatility	Owner	Language	Source code	Description
util	xmldump	void		normal	volatile	pool	plpgsql	--create or replace function util.xmldump () --returns void language plpgsql as declare mnrow record; begin -- Empty the output table. truncate table interim.xmldump; -- For each meaning (i.e. entry): for mnrow in select mn from mn order by mn loop ---- Output its XML representation. insert into interim.xmldump select mnrow.mn, util.mnxml (mnrow.mn); end loop; return; end;	Act: Output an XML representation of the facts of all meanings to table interim.xmldump

The “xmldump” function calls the “util.mnxml (integer)” function, with this definition:

List of functions
Schema	Name	Result data type	Argument data types	Type	Volatility	Owner	Language	Source code	Description
util	mnxml	text	integer	normal	volatile	pool	plpgsql	--create or replace function util.mnxml (integer) --returns text language plpgsql as declare item text; qrow record; sqrow record; string text; begin -- Initialize the result with an -- “en” opening tag. string := ' <en>\n'; -- Append an “mn” element. string := string \|\| ' <mn>' \|\| $1 \|\| '</mn>\n'; -- Identify the meaning’s approver’s label. select tt from mn, ap into item where mn = $1 and ap.ap = mn.ap; -- Append an “ap” element with it. string := string \|\| ' <ap>' \|\| item \|\| '</ap>\n'; -- Identify the meaning’s meaning ID, if any. select tt from mi into item where mn = $1; -- If it exists: if item is not null then ---- Append an “mi” element with it. string := string \|\| ' <mi>' \|\| item \|\| '</mi>\n'; end if; -- For each definition of the meaning: for qrow in select lc, vc, df.tt from df, lv where mn = $1 and lv.lv = df.lv order by lc, vc, df.tt loop ---- Append its XML representation. string := string \|\| ' <df>\n' \|\| ' <lc>' \|\| qrow.lc \|\| '</lc>\n' \|\| ' <vc>' \|\| qrow.vc \|\| '</vc>\n' \|\| ' <tt>' \|\| qrow.tt \|\| '</tt>\n' \|\| ' </df>\n'; end loop; -- For each domain of the meaning: for qrow in select lc, vc, ex.tt from dm, ex, lv where mn = $1 and ex.ex = dm.ex and lv.lv = ex.lv order by lc, vc, ex.tt loop ---- Append its XML representation. string := string \|\| ' <dm>\n' \|\| ' <lc>' \|\| qrow.lc \|\| '</lc>\n' \|\| ' <vc>' \|\| qrow.vc \|\| '</vc>\n' \|\| ' <tt>' \|\| qrow.tt \|\| '</tt>\n' \|\| ' </dm>\n'; end loop; -- For each denotation of the meaning: for qrow in select dn, lc, vc, ex.tt from dn, ex, lv where mn = $1 and ex.ex = dn.ex and lv.lv = ex.lv order by lc, vc, ex.tt loop ---- Append the start of its XML representation. string := string \|\| ' <dn>\n' \|\| ' <lc>' \|\| qrow.lc \|\| '</lc>\n' \|\| ' <vc>' \|\| qrow.vc \|\| '</vc>\n' \|\| ' <tt>' \|\| qrow.tt \|\| '</tt>\n'; ---- For each wc of it: for sqrow in select tt from wc, wcex where dn = qrow.dn and wcex.ex = wc.ex order by tt loop ------ Append its XML representation. string := string \|\| ' <wc>' \|\| sqrow.tt \|\| '</wc>\n'; end loop; ---- For each md of it: for sqrow in select vb, vl from md where dn = qrow.dn order by vb, vl loop ------ Append its XML representation. string := string \|\| ' <md>\n' \|\| ' <vb>' \|\| sqrow.vb \|\| '</vb>\n' \|\| ' <vl>' \|\| sqrow.vl \|\| '</vl>\n' \|\| ' </md>\n'; end loop; ---- Append the end of the XML representation ---- of the denotation. string := string \|\| ' </dn>\n'; end loop; -- Append an “en” closing tag. string := string \|\| ' </en>\n'; return string; end;	In: mn of a meaning. Out: XML representation of the meaning’s facts.

Once the “interim.xmldump” table has been repopulated, it is converted to an XML file with the script “xmldumpcopy.pl”, which has this definition:

#!/usr/bin/perl -w

# xmldumpcopy.pl
# Copy interim.xmldump’s “line” column to file panlex-mn.xml in the current directory,
# with modifications to make the contents of all elements PCDATA-compliant.

use warnings 'FATAL', 'all';
# Make every warning fatal.

use strict;
# Require strict checking of variable references, etc.

use utf8;
# Make Perl interpret the script as UTF-8.

use DBI;
# Import the general database-interface module. It imports DBD::Pg for PostgreSQL automatically.

my $dbh = DBI->connect(
	"dbi:Pg:dbname=plx;host=uf.utilika.org;port=5432", '', '',
	{ (AutoCommit => 0), (pg_enable_utf8 => 1) }
);
# Specify & connect to the PostgreSQL 9.0.1 database “plx”, with AutoCommit off
# and the UTF-8 flag on (without which strings read from the database and split into
# characters are split into bytes rather than Unicode character values. DBI automatically
# issues a begin_work statement, requiring an explicit commit statement before the
# disconnection to avoid an automatic rollback. Username and password will be obtained
# from local (client) environment variables PGUSER and PGPASSWORD, respectively.

my $sth = $dbh->prepare ('select line from interim.xmldump order by mn');
# Prepare a statement to read entries from the XML dump table.

$sth->execute ();
# Start the query.

my ($entry, @row);

open XMLOUT, '>:encoding(utf8)', 'panlex-mn.xml';
# Create or truncate the output file and open it for writing.

print XMLOUT '<?xml version="1.0" encoding="utf-8"?>', "\n", "<panlex>\n";
# Output the file header.

while (@row = $sth->fetchrow_array) {
# As long as the next row exists, read it and:

	print XMLOUT (&Xml ($row[0]));
	# Output its XML-safe conversion.

}

print XMLOUT "</panlex>\n";
# Output the file footer.

$dbh->commit;
# Commit the automatically begun database transaction.

$dbh->disconnect;
# Disconnect from the database.

close XMLOUT;
# Close the output file.

`bzip2 -fk panlex-mn.xml`;
# Create or replace a compressed version of the output file.

warn "interim.xmlout unchanged; truncate if panlex-mn.xml OK.\n";
# Report result to the user.

#### Xml
# Convert the specified string to an XML-tag-safe string, i.e. a string that may be sent to an
# XML file as parsed character data (PCDATA).
# Arguments:
#	0: string.

sub Xml {

	my $ret = $_[0];
	# Identify the string.

	if ($ret =~ m%^ +<([a-z]+)>.*[&<].*</\1>$%m) {
	# If it contains any complete element with nonblank content containing an
	# XML metacharacter:

		$ret =~ s/&/&amp;/g;
		# Replace all instances of “&” in it with "&amp;". (“&” occurs only in
		# element contents.)

		while ($ret =~ s%^ +<([a-z]+)>.*\K<(?=.*</\1>$)%&lt;%m) {}
		# Replace all instances of "<" in the contents of all its elements
		# with "&lt;".

	}

	return $ret;
	# Return the XML-safe string.

}

The “en” elements of the resulting XML file have a format exemplified by this entry:

  <en>
   <mn>10178755</mn>
   <ap>fra-eng:Deschamp2000</ap>
   <mi>accord</mi>
   <df>
    <lc>fra</lc>
    <vc>0</vc>
    <tt>Réglage destiné à produire ou à capter des signaux électriques ou acoustiques de fréquences déterminées</tt>
   </df>
   <dm>
    <lc>fra</lc>
    <vc>0</vc>
    <tt>audiovisuel</tt>
   </dm>
   <dn>
    <lc>eng</lc>
    <vc>0</vc>
    <tt>tuning</tt>
   </dn>
   <dn>
    <lc>fra</lc>
    <vc>0</vc>
    <tt>accord</tt>
    <wc>noun</wc>
    <md>
     <vb>gram</vb>
     <vl>n.m.</vl>
    </md>
   </dn>
  </en>

Expression-only dumps

To be completed.