logo

PanLex: Archive storage

PanLex archives

The PanLex project manages archives in addition to the PanLex database. These include:

Local storage

Editors and other project personnel store working files on local workstations for rapid access. Automatic hourly duplication of new and changed files onto external storage drives, and occasional rotation of those drives between home offices and the office of the project’s sponsor, provides for redundancy.

Remote storage

Need

Some files are shared by multiple users. These are stored not only on local workstations but also on a shared remote server for access by all who need them.

Host configuration

The remote storage system has been procured from Amazon Web Services. The system is configured as a t1.micro instance of the Elastic Compute Cloud (EC2) service. After initially choosing the Amazon Linux platform and the machine image “Amazon Linux AMI x86_64 EBS”, we found that the current Ubuntu Linux (64-bit) AMI had more suitable functionality and reconfigured the system to be based on that. The instance has 2 attached Elastic Block Store (EBS) devices: an 8-gigabyte volume attached as /dev/sda1 and mounted as “/” (the root device), and a 50-gigabyte volume attached as /dev/sdf and mounted as “opt/www/main”. These “sd…” volumes are described as “xvd…” in the /etc/mtab file. The instance is running in the AWS US West (Oregon) region (us-west-2).

As of September 02012, the root device used only about 1 gigabyte of its 8-gigabyte capacity, and the /dev/sdf device used about 42 gigabytes of its 50-gigabyte capacity.

The project initially established a remote storage system based on Amazon Web Services’ S3 service. This system was found to contain limitations and unreliabilities that rendered it useless.

Directory structure

The storage directories are subdirectories of a “panlex” directory:

Access

Console

Administration of the remote storage system is possible via the EC2 Management Console.

HTTP

The remote storage system is also accessible to web browsers. It has a static IP address, 50.112.103.83, which is accessible via the domain “panlex.net”.

Access to the remote storage system is regulated with HTTP basic authentication. This minimally secure regulation is intended to prevent massive automated downloads, which could impose substantial costs on the project. Project personnel have access to a username and password that entitles them to access all the subdirectories. Users who are invited to download dump files are given access to a username and password that entitles them to access only the “dumps” subdirectory.

To configure the remote storage system for this controlled HTTP access, we have added this code to the Apache web server’s configuration file /etc/httpd/conf/httpd.conf on the remote server:

# <CUSTOM>

<Directory opt/www/main/html/panlex/dumps>
    AuthType Basic
    AuthName "Guest Dump Files"
    AuthBasicProvider file
    AuthGroupFile /var/local/htgroups
    AuthUserFile /var/local/htpasswords
    Require group staff&guest
</Directory>

<Directory opt/www/main/html/panlex/sources>
    AuthType Basic
    AuthName "Staff Source Files"
    AuthBasicProvider file
    AuthUserFile /var/local/htpasswords
    Require user staff
</Directory>

<Directory opt/www/main/html/panlex/tools>
    AuthType Basic
    AuthName "Staff Tool Files"
    AuthBasicProvider file
    AuthUserFile /var/local/htpasswords
    Require user staff
</Directory>

# </CUSTOM>

Accordingly, we have created the specified user and group authentication files in /var/local. The “staff&guest” group includes the “staff” and “guest” users.

SSH

Administration of the remote storage system is also possible via SSH connection. In accord with the Amazon Linux AMI default, it is permitted only with RSA key-based authentication. Password authentication is disabled. In addition, as advised by AWS in its article “Tips for Security Your EC2 Instance”, SSH connection is permitted only to particular users, including the default user (i.e. “ubuntu” in the case of the Ubuntu Linux instances). The authorized users use “sudo” to perform actions limited to the “root” user.

Key-based authentication requires the local client to have two files within the user’s own “.ssh” directory: a file named “config” and an RSA private key file. The “config” file identifies the various synonymous addresses of the EC2 host and names the private key file that is to be used for connections to that host. The “config” file contains these lines:

Host db.panlex.org 54.245.107.128 panlex.net 50.112.103.83 ec2-50-112-103-83.us-west-2.compute.amazonaws.com ec2-54-245-107-128.us-west-2.compute.amazonaws.com
IdentityFile ~/.ssh/privatekeyfile.name

The string “privatekeyfile.name” is replaced with whatever the private key file’s name is. By default, for the ubuntu user, that file’s name is “ec2keypair.pem”.

The corresponding public key must be in the “authorized_keys” file in the .ssh directory in the user’s home directory on the server.

Synchronization

As of early 02013, almost all content development takes place within the PanLex office network, and only one workstation is regularly used for the storage of local files. On that basis, the remote directories “sources” and “tools” are configured to mirror the corresponding local directories. The synchronization is managed by the cron daemon on the local host, ego.utilika.org. Six times a day, the local script /Users/Shared/Library/panlex/org.panlex.syncd.txt is executed. The script’s code is:

# Synchronize the EC2 remote repositories of approver files and tools with the local
# repositories on Ego.

# used (files of approvers already used for addition of content to PanLex)

/usr/local/bin/rsync --recursive --update --perms --times --omit-dir-times --delete --delete-excluded --compress-level=9 --exclude=.* --8-bit-output --progress /Topics/panlex/panlex-dics/panlex-dics-data/used ubuntu@panlex.net:/opt/panlex/sources

# queued (files of approvers catalogued, but not yet used for addition of
# content to PanLex)

/usr/local/bin/rsync --recursive --update --perms --times --omit-dir-times --delete --delete-excluded --compress-level=9 --exclude=.* --8-bit-output --progress /Topics/panlex/panlex-dics/panlex-dics-data/panlex-dics-data-todo/queued ubuntu@panlex.net:/opt/panlex/sources

# tabularize (scripts that convert approver files to tab-delimited column files)

/usr/local/bin/rsync --recursive --update --perms --times --omit-dir-times --delete --delete-excluded --compress-level=9 --exclude=.* --8-bit-output --progress /Topics/panlex/panlex-dics/panlex-dics-tools/panlex-dics-tools-parsing/tabularize ubuntu@panlex.net:/opt/panlex/tools

# serialize (scripts that convert tab-delimited column files to PanLex uploadable simple-text and full-text files)

/usr/local/bin/rsync --recursive --update --perms --times --omit-dir-times --delete --delete-excluded --compress-level=9 --exclude=.* --8-bit-output --progress /Topics/panlex/panlex-dics/panlex-dics-tools/panlex-dics-tools-parsing/serialize ubuntu@panlex.net:/opt/panlex/tools

Instead of “rsync”, the “rdiff-backup” package would offer additional security, by storing changes separately from mirrors of the source directories. This could, however, add substantially to the size of the remote archive, because whole approver directories are moved from “queued” to “used” whenever the approvers’ data are ingested. Thus, approver directories would often be held in duplicate there.

Dumps of the database are copied into the remote storage discretionarily, not automatically, generally in response to specific requests.

Valid XHTML 1.1!