PanLex: Copying the PanLex database onto AWS

Introduction

This document describes a procedure for copying the PanLex database into a suitably configured PostgreSQL database cluster on an AWS EC2 instance.

Database export

Open a terminal window and log in or “su” as a database superuser on the existing database server from which the PanLex database will be copied.

Make a suitable directory (such as ~/temp) the current directory.

Create an SQL script to clone all the databases in the database cluster with the command pg_dumpall > yyyymmdd-pgdumpall.sql, replacing “yyyymmdd-pgdumpall” with any desired filename. The output script file is about 3.5 GB in size.

Compress the script file with the command bzip2 yyyymmdd-pgdumpall.sql. The compressed file yyyymmdd-pgdumpall.sql.bz2 is about 1 GB in size.

Logout from the SSH session. The terminal session continues as a session on your workstation.

From the workstation terminal session, copy the compressed script file to the current directory of your workstation with the command scp user@oldserver.tld:dir/yyyymmdd-pgdumpall.sql.bz2 ., replacing “user” with your username on the server, “oldserver.tld” with the server’s URL, and “dir” with the directory containing the compressed script file.

Open a new terminal window or tab. In it, navigate to the directory containing the compressed script file. Then copy the compressed script file from your workstation to the AWS server with the command scp yyyymmdd-pgdumpall.sql.bz2 ubuntu@host.domain.tld:pgsql/temp. This operation will require about an hour.

Delete the “yyyymmdd-pgdumpall.sql.bz2” file from your workstation and from the server on which the file was created.

Log in as the “ubuntu” user on the EC2 instance, with that user’s home directory as the current directory, with the command ssh ubuntu@db.domain.tld (e.g., ssh ubuntu@db.panlex.org).

Navigate to the directory containing the compressed script file with the command cd pgsql/temp.

Decompress the script file with the command bunzip2 yyyymmdd-pgdumpall.sql.bz2 and make it readable by the “postgres” user with the command sudo chgrp postgres yyyymmdd-pgdumpall.sql.bz2.

Database recreation

Verify that PostgreSQL can be run by stopping it if it is running it and then starting it, using the command sudo service postgresql restart.

Connect to the (empty) “postgres” database as the “postgres” user and recreate all the exported databases with the command sudo -u postgres psql -f /home/ubuntu/pgsql/temp/yyyymmdd-pgdumpall.sql. The recreation of the databases from this file will take about 1 to 30 hours, depending on the instance type.

Connect to the “plx” database as the “postgres” user with the command sudo -u postgres psql plx. List all the databases in the cluster with the command \l+. The databases “plx”, “postgres”, “template0”, and “template1” are necessary. Delete each of the other databases with the command drop database x;, where “x” is replaced with that database’s name.

Optimize performance in the database with the command analyze;.

Database protection

In an EC2 PanLex implementation, read/write access to the database comes from another server, and it is permitted (see “Connection configuration” above) to any existing login-capable role in the cluster, with authentication. Thus, the “plx” database’s owner, “apache”, must have a password. If the database that you copied onto this instance was on a server that also hosted the “apache” user running the only interfaces to the database, then the “apache” user was not required to have a password. In that case, while still connected to the “postgres” database as the “postgres” user, give a password to the “apache” user with the command alter role apache password 'x';, where “x” is replaced with the desired password.

Storage reclamation

Disconnect from the database cluster with the command \q.

While connected to the instance as the “ubuntu” user, reclaim some storage by either (1) deleting the export file with the command rm pgsql/temp/yyyymmdd-pgdumpall.sql or (2) recompressing the export file with the command bzip2 pgsql/temp/yyyymmdd-pgdumpall.

Database cacheing

On advice from PostgreSQL Experts, we archive the main tables and indices daily in order to keep them in memory. These operations are presumably less useful in lower-memory instances, but for comparative testing we replicate the operation in all instances. We also force the vacuuming and analysis of the database weekly to maintain query execution efficiency. Implement these operations by editing the ubuntu user’s “crontab” file with the command crontab -e and including in it these lines:

# Create a PanLex archive every 24 hours for redundancy and table cacheing,
# replacing the previous archive. Store any errors in a file. There is no mail service
# for error reporting.
12 0 * * * sudo -u postgres pg_dump -a -t dn -t ex -t mn -t lv -t ap -Z 9 -f /home/ubuntu/pgsql/temp/plxdaydump.gz plx 2>>/home/ubuntu/pgsql/temp/errors.txt
# Cache large PanLex indices every 24 hours.
44 2 * * * sudo -u postgres psql -c 'select * from ixck ()' plx 2>>&1 /home/ubuntu/pgsql/temp/errors.txt
# Vacuum and analyze all databases every week, supplementing programmatic operations.
52 2 * * 0 sudo -u postgres vacuumdb -az 2>>/mnt/db/temp/errors.txt

The “pg_dump” operation takes about 2 hours on a Micro instance and about 15 minutes on a High-Memory Double Extra Large instance.