logo

PanLex: Database procedures

PanLex: Database procedures

Jonathan Pool
Revision date: 1 September 02013

Contents

  1. Introduction
  2. Source-file conversion

Introduction

The PanLex database contains about 120 stored procedures. They are mainly self-documenting. Each procedure has a comment describing its arguments, outputs, and other actions. In addition, some procedures contain interlinear comments explaining the component statements. Additional documentation of some procedures appears below.

Source-file conversion

Conversion from source to file

The procedure apsf (integer) converts a source to a final source file. The procedure analyzes the source to determine the most compact format that its file can have and returns the lines of a file in that format, each line being returned as a single text value.

The procedure aptf (integer) converts a source to a tagged source file. The procedure returns the lines of the file, each line being returned as a single text value.

Conversion from file to source

The procedure sffad (integer, integer) ingests source data contained in a full-text-format final source file. The procedure validates the file while ingesting its data. If any error is encountered, the ingestion is entirely canceled and facts about the the error are returned. The procedure requires the file content to have been inserted into a table, visible to the procedure, named sffad, with a format described below.

The procedure sfsad (integer, integer) ingests source data contained in a simple-text-format final source file. The procedure validates the file while ingesting its data. If any error is encountered, the ingestion is entirely canceled and facts about the the error are returned. The procedure requires the file content to have been inserted into a table, visible to the procedure, named sfsad, with a format described below.

The procedure sffck (integer, integer) validates source data contained in a full-text-format final source file. If any error is encountered, the validation ends and facts about the the error are returned. The procedure requires the file content to have been inserted into a table, visible to the procedure, named sffck, with a format described below.

The procedure sfsck (integer, integer) validates source data contained in a simple-text-format final source file. If any error is encountered, the validation ends and facts about the the error are returned. The procedure requires the file content to have been inserted into a table, visible to the procedure, named sfsck, with a format described below.

The above procedures require the file content to have been inserted into a table. The table must have 2 columns: seq (type integer) and tt (type text). Column seq must contain the (1-origin) index of the line. Column tt must contain the text of the line. No requirement is imposed on the order of the rows of the table.

Tagged source files

As stated above, the procedure aptf (integer) creates a tagged source file. No procedure ingests or validates data in a tagged source file, but it can still be useful. For some purposes, it is easier to edit a tagged source file than to edit a final source file. Once a tagged source file has been edited, it can be converted to a final source file with a tool external to the database (reserialize.pl), and that final source file’s data can then be validated and ingested. Alternatively, with or without editing, a tagged source file can be converted to an untagged tab-delimited file with another external tool (retabularize.pl), and that file can be operated on with other external tools and then converted to a final source file.

A tagged source file contains one line per meaning. Each line contains a variable-size tab-delimited set of columns. Column 0 is always the meaning ID. The subsequent columns contain the other data of the meaning. Each such subsequent column begins with a tag identifying the datum type and ends with the datum. The tag formats, shown by means of examples, are:

The wc and md columns pertaining to any ex column must follow that ex column without any other intervening columns. For any metadatum, the md:vb and md:vl columns must be adjacent and must appear in that order.

The value of any word-classification datum (in this format and those below) must be selected from this list:

noun
verb
adjv
advb
name
pron
vpar
auxv
detr
prep
post
conj
ijec
affx
misc

Final source files

All the above-described procedures except aptf (integer) either produce or consume final source files. Text-format final source files (XML-format final source files are not documented here) can have a family of 6 related formats. There are 2 levels of complexity: simple and full. And there are 3 lingualities: varilingual, centrilingual, and bilingual. Each linguality is possible at each level of complexity, so there are 6 permutations.

Every final source file is composed of lines. Each line ends with a line break. Generally, it is a newline character (LF), but it can instead be a carriage return (CR), a carriage-return-linefeed pair (CRLF), or a Unicode line separator (LS). The line break must be uniform within any file.

Varilingual full-text

We can begin with the least compact format: the varilingual full-text format. All the others are variations on it.

A varilingual full-text final source file contains a header and entries, all delimited from one another with double line breaks.

The header begins with a line containing only a colon (:).

The next line contains only the number 0.

That is the end of the header.

Each entry contains meaning data and denotation data. Denotation data for multiple denotations may exist within a single entry. All the data for any denotation must be contiguous; they may not be interrupted by any meaning data or by data for any other denotation.

Meaning data of 3 kinds exist: meaning identifiers, domain specifications, and definitions. A datum of any of these kinds consists of 2 or more consecutive lines, whose formats are shown below by means of examples.

Meaning-identifier:

mi
1234z

Domain specification:

dm
eng-000
electricity

Definition:

df
eng-000
lighting fixture or portable lighting apparatus

Denotation data of 3 kinds exist: expressions, word classifications, and metadata. A datum of any of these kinds consists of 2 or more consecutive lines, whose formats are shown below by means of examples.

Expression:

ex
eng-000
luminaire

Word classification:

wc
noun

Metadatum:

md
prag
tech
Centrilingual

A centrilingual file differs from a varilingual file as follows, but is otherwise identical.

The header’s number is 1, rather than 0.

The header contains a third line, consisting of the UID of a language variety, such as eng-000.

The first expression datum in each entry does not contain the UID of a language variety. That expression must be in the variety whose UID is in the header.

Bilingual

A bilingual file differs from a centrilingual file as follows, but is otherwise identical.

The header’s number is 2, rather than 1.

The header contains a fourth line, consisting of the UID of a language variety, such as fra-000.

No expression datum in any entry contains the UID of a language variety. The first expression in each entry must be in the variety whose UID is first in the header, and all other expressions must be in the variety whose UID is second in the header.

Simple-text

A simple-text file differs from a full-text file as follows, but is otherwise identical.

The first line of the header contains a period (.), rather than a colon.

No meaning data may appear in the file.

No denotation data except expression data may appear in the file.

An expression datum does not include a line containing ex.

Minimum content

The above-described procedures that validate and ingest data from final source files treat the files as erroneous if an entry does not contain at least one definition or expression.

Valid XHTML 1.1!