logo

PanLex: Database design

PanLex: Database design

Jonathan Pool
Revision date: 11 July 02012

Contents

  1. Introduction
  2. Constraints
  3. Users
  4. Sources
  5. Language Varieties
  6. Expressions
  7. Meanings
  8. Denotations
  9. Meaning Identifiers
  10. Domain Descriptors
  11. Definitions
  12. Word Classifications
  13. Metadata
  14. Source Varieties
  15. Exemplar Characters
  16. Approved Characters
  17. Source Editors
  18. Language Variety Editors
  19. Design and Implementation Issues

Introduction

This narrative description of the design of the PanLex database complements a set of slides on the same subject.

Constraints

Users

Sources

Language Varieties

Expressions

An expression is an object that lexically expresses a meaning in a language variety. Each expression has four properties: an ID, the ID of its language variety, a text, and a degraded text. These properties are the values of the fields in the ex table.

The “expression” concept differs from some similar concepts:

When there is no lexicalization of a meaning in a language variety, there is no expression in that variety with that meaning. There may still be an explanation of the meaning, but that is classified as a definition, rather than an expression. This classification is a matter of judgment, made by editors. Definitions typically consist of four or more words and express meanings compositionally.

A trigger function (td ()) automatically derives the degraded text of an expression from the text whenever a new expression is created or an existing expression is modified. The degraded text could be omitted from the expression record and computed whenever needed, but it is precomputed and stored for efficiency. Users can search for all expressions with particular degraded texts or parts thereof. It would be impractically inefficient to compute all degraded texts for each such search.

The algorithm for the derivation of the degraded text is implemented in the stored function td (text). It first subjects the text to NKFD normalization. If the text contains no characters in Indic scripts, the algorithm converts all upper-case characters to lower-case characters, converts “ı” (dotless i) to “i”, and deletes all characters except those having the Unicode character properties Ll (lower-case letter), Lo (other letter), and Nd (decimal number). If the text contains any characters in Indic scripts, the algorithm subjects the text to a more complex transformation, which varies from one Indic script to another. The portions of the algorithm for Indict-script characters were developed by Yadav Gowda in 02013.

The motivation for making degraded texts available is that sources of lexical data and users specifying the texts of expressions are not always exact or consistent in their specification of the texts. Texts differing from one another in many ways (such as texts with and without hyphens, texts written as one word and as two words, texts differing only in letter case, or texts with and without diacritical marks) can be perceptually the same. With degraded texts, a search can retrieve all expressions whose degraded texts are the same as the degraded text that is specified in the search query. However, the PanLex algorithm for text degradation fails to capture some similarities (such as “color” versus “colour”, or “pant” versus “pants”). Thus, it is only one of the possible algorithms that could be used for the improvement of the intuitive efficacy of text matching.

Meanings

Denotations

Meaning Identifiers

Domain Descriptors

A meaning may optionally have domain descriptors. They are expressions. Attaching a domain descriptor to a meaning asserts that the meaning is within the domain described by the expression.

In earlier versions of PanLex, domain descriptors were structured identically to definitions. Specifically, each was an arbitrary strings specified as being in a language variety. Experience showed that the texts of domain descriptors were usually identical to the texts of expressions. The design was modified to require domain descriptors to be expressions. Now, in the table of domain descriptors (“dm”), the column identifying the descriptor has as its value the ID of the expression serving as the descriptor.

This constraint on domains is advocated by Gerard de Melo and Gerhard Weikum in their 02010 article, “Towards Universal Multilingual Knowledge Bases”. They say that “some knowledge bases rely on a separate vocabulary of domain labels …. We instead advocate following WordNet in using identifiers already present in the knowledge base …. This has the advantage of extensive information about the domains being readily available ….”

Because domain descriptors are expressions, I consider it a good practice to select lemmatic forms as domain descriptors, rather than adding expressions to PanLex for the purpose of having them act as domain descriptors. For example, when consulting a resource that uses “mammals” as an English domain descriptor, I recommend using “mammal” instead.

Definitions

Word Classifications

Metadata

Source Varieties

Exemplar Characters

Approved Characters

Source Editors

Language Variety Editors

Design and Implementation Issues

Valid XHTML 1.1!