PanLex: Local database management

Introduction

The PanLex database begin in 2005 as the TransGraph database at the University of Washington Turing Center, managed as a Microsoft SQL Server database. In 2006 Utilika Foundation began the PanLex branch, converting the SQL Server database to a PostgreSQL database. The host for the PostgreSQL server was a Red Hat Enterprise Linux (RHEL) server housed in the office of the foundation. Utilika Foundation contributed it to The Long Now Foundation in 2012, and the server remained in operation, running under RHEL 5 by 02012.

This page documents the installation, configuration, and maintenance of PostgreSQL and of the PanLex database on a local server.

PostgreSQL version

The version of PostgreSQL that is currently hosting the database is 9.0.1. This version is more recent than the one provided with Red Hat Enterprise Linux 5.

PostgreSQL installation

We built and installed it from source files. Two other installation modes, using a pre-built binary package in the PostgreSQL Yum repository and using Postgres Plus published by Enterprise DB, were deemed too poorly documented.

We installed PostgreSQL in the standard directory, /usr/local/pgsql. We used the following commands, which elect optional Perl support and specify a particular version of Perl (likewise more recent than the RHEL 5 version):

cd /var/local/packages/distributions/postgresql-9.0.1
pg_dumpall -U postgres -f dumpall20101213
gmake distclean
configure --with-perl PERL='/opt/perl/bin/perl'
gmake world
    (Output ends with “PostgreSQL, contrib, and documentation successfully made. Ready to install.)
sudo -u postgres gmake check
    (Output ends with “All 122 tests passed.”)
gmake install-world
    (Output ends with “PostgreSQL, contrib, and documentation installation complete.”)
gmake clean

Post-Installation Optimizations

The documentation in section 15.6 of the PostgreSQL documentation on shared-library optimization advises executing the command “/sbin/ldconfig /usr/local/pgsql/lib”. The alternative that we have implemented is to put “/usr/local/pgsql/lib” into a file named /etc/ld.so.conf.d/postgresql-9.0.1.conf, and then to execute “ldconfig”.

Environment variables are defined in /etc/profile:

PGDATA=/usr/local/pgsql/data
PGPORT=5432

PATH=/usr/local/pgsql/bin:$PATH:/var/local/utils

MANPATH=/opt/perl/man:/opt/python2.6/share/man:/usr/local/pgsql/share/man:/usr/kerberos/man:/usr/local/share/man:/usr/share/man/en:/usr/share/man:/usr/share/locale/man

PostgreSQL configuration

Locale

The server’s main purpose is to host the PanLex database, which is not locale-specific. Locales other than the generic POSIX locale have a “performance impact” according to the PostgreSQL documentation on cluster creation. Therefore, the database cluster’s locale is set to POSIX. Several locale categories can be overridden, but we do not override any of them. For example, we leave the LC_COLLATE category as POSIX, partly because any other string-sort order imposes “a performance penalty” according to the PostgreSQL documentation on “initdb”.

The database cluster’s default encoding, if not set, is derived automatically from the cluster’s locale by the “initdb” program. The encoding derived from the POSIX locale is SQL_ASCII. However, PostgreSQL documentation advises (in section 22.2) against using SQL_ASCII as the encoding of a database containing non-ASCII data. It doesn’t assert that SQL_ASCII offers any performance advantage. Accordingly, starting with PostgreSQL version 9.0.1 we specify UTF8 as the default encoding of the database cluster when running “initdb”, making UTF8 the default encoding of any new database. The cluster-creation procedure was:

su postgres
/usr/local/pgsql/bin/initdb -D /usr/local/pgsql/data -E UTF8 --locale=POSIX

The path to “initdb” was specified because an older version of PostgreSQL still existed, with its version of initdb, in a different directory.

The default encoding before PostgreSQL 9.0.1 was SQL_ASCII, and this was also the encoding of all databases in the cluster. Our upgrade to PostgreSQL 9.0.1 required the conversion of all of the databases from SQL_ASCII to UTF8 encoding. We performed this conversion by executing pg_dumpall (in its latest version) to create a database-recreation script for all databases in the (previous) cluster, and then editing the script to change the encoding and the client encoding of each database to UTF8.

Memory

As documented in section 17.4 ("Linux Memory Overcommit") of the PostgreSQL documentation, we encountered out-of-memory crashes in PostgreSQL until we modified the /etc/sysctl.conf file to set the vm.overcommit_memory parameter's value to 2.

In 2010 we discovered that both the old and the new versions of PostgreSQL crashed with out-of memory errors when executing some index-creation commands, if both versions were running simultaneously. Investigation by PostgreSQL Experts produced the advice to increase the swap space from 2 GB to 8 GB. To accomplish this, we created a 6GB swap file. Nonetheless, we found both versions crashing when we next tried to install a later version (9.1.5) in parallel with 9.0.1.

Automatic Launch

Per section 17.3 of the PostgreSQL documentation, we (as root) copied contrib/start-scripts/linux from the source files into /etc/rc.d/init.d/postgresql-901, changed its permissions with “chmod 755 /etc/rc.d/init.d/postgresql-901”, and executed “chkconfig --add postgresql-901”, in order to cause the host to launch PostgreSQL automatically on each reboot. To stop the previous version of PostgreSQL from launching automatically, we removed its script from the init.d directory and the symbolic links to it from the runlevel directories where they existed.

Initial Launch

To launch a new version of PostgreSQL while the old version is still running, we executed the shell command “PGPORT=nnnn pppp/bin/pg_ctl start -D dddd/data”, where “nnnn” was the number of the new version’s port, “pppp” was the path to the new version’s “bin” directory, and “dddd” was the path to the new version’s “data” directory.

Upgrading PostgreSQL

We have attempted to upgrade to a new version of PostgreSQL by installing it in parallel with the current version and running them both until we have adequately tested the new version. This strategy has caused installation failures and post-installation crashes, as well as administrative difficulties. A reasonable conclusion is that this strategy should be avoided.

The parallel strategy is illustrated by the following description of our attempted upgrade from version 9.0.1 to version 9.1.5 in August 02012. The installation strategy is use of the PostgreSQL Yum Repository. It is documented by PGDG. If it fails, we resort to installation from source files.

If not already installed, download the appropriate PostgreSQL Yum Repository package (Red Hat Enterprise Linux 5 - x86_64) from the repository list, make its directory the current directory, and then install it with the command sudo yum install pgdg-redhat91-9.1-5.noarch.rpm.
Using the PostgreSQL Yum Repository package, fetch and install the PostgreSQL packages with the command sudo yum install postgresql91-server postgresql91-contrib postgresql91-docs postgresql91-plperl. The files are installed in /var/lib/pgsql/9.1, in /usr/pgsql-9.1, and (with a version-specific name) in /etc/rc.d/init.d. Thus, they don’t overwrite the currently running version.
Become the “postgres” user with “su” and “su postgres”.
Create a database cluster with the command /usr/pgsql-9.1/bin/initdb -D /var/lib/pgsql/9.1/data -E UTF8 --locale=C > /var/lib/pgsql/9.1/pgstartup.log. (The documentation advises using the command service postgresql-9.1 initdb -E UTF8 --no-locale for this, but it ignores the “no-locale” switch.)
Make /var/lib/pgsql/9.1/data the current directory.
Make a copy of /var/lib/pgsql/9.1/data/postgresql.conf as “postgresql-std.conf”, with the command cp -p postgresql.conf postgresql-std.conf. Then replace the original’s content with the cluster’s custom dual-operation main configuration, shown below.
Make a copy of /var/lib/pgsql/9.1/data/pg_hba.conf as “pg_hba-std.conf”, with the command cp -p pg_hba.conf pg_hba-std.conf. Then replace the original’s content with the cluster’s custom access configuration, shown below.
Make a copy of /var/lib/pgsql/9.1/data/pg_ident.conf as “pg_ident-std.conf”, with the command cp -p pg_ident.conf pg_ident-std.conf. Then replace the original’s content with the cluster’s custom name-mapping configuration, shown below.
Exit from “postgres” identity to “root” identity with the command “exit”.
Make the server launch on every reboot with the command chkconfig postgresql-9.1 on.
Perform manipulations required by a bug in the package that was downloaded: (a) mv /etc/ld.so.conf.d/postgresql-9.0.1.conf /etc/ld.so.conf.d/postgresql-temp-9.0.1.conf; (b) ldconfig; (c) mv /etc/ld.so.conf.d/postgresql-temp-9.0.1.conf /etc/ld.so.conf.d/postgresql-9.0.1.conf. Without these, the upgrade fails 2 steps below, when the attempt to launch PostgreSQL 9.1 succeeds in starting the PostgrSQL processes, but also elicits the error message “/usr/pgsql-9.1/bin/pg_ctl: symbol lookup error: /usr/pgsql-9.1/bin/pg_ctl: undefined symbol: PQping”.
Revise the configuration of the currently running version of PostgreSQL by changing the solo configuration to a dual configuration in its postgresql.conf file and, as the “postgres” user, executing “pg_ctl restart”.
Launch PostgreSQL 9.1 with the command sudo -u postgres /usr/pgsql-9.1/bin/pg_ctl start -w -D /var/lib/pgsql/9.1/data -l /var/lib/pgsql/9.1/start.log. (That log file is created, showing the time, but has no content.)
While the new and old versions of PostgreSQL are running in parallel, the old one is the production version, so leave the paths to executables and the environment variables as-is, but elevate the new version’s documentation to default status, as follows. Edit /etc/man.config to include /usr/pgsql-9.1/share/man in MANPATH. Then execute makewhatis -w to index the man pages on the MANPATH. Edit /etc/httpd/conf/httpd.conf to include Alias /pg91doc "/usr/share/doc/postgresql91-docs-9.1.5/html". (If you have a client browser bookmark for this documentation, edit it to point to http://panlex.org/pg91doc.)
Create a new incrementally numbered version of the PanLem interface and edit its CGI script to make use of the new version of PostgreSQL. Create the new version with the command cp -p opt/www/main/cgi-bin/plxu.cgi opt/www/main/cgi-bin/plxu29.cgi. Edit its definitions of $plvb to '9' and $act to '/cgi-bin/plxu29.cgi'. Edit the SRImp subroutine in that script to define the PostgreSQL listening port as 54321. Create the new version’s library of input and output routines with the command cp -Rp opt/www/main/local/panlex/u/28 opt/www/main/local/panlex/u/29. Change the ownership of the directory and its files with the command chown -R apache:wheel opt/www/main/local/panlex/u/29.
Copy the database cluster to the new version of PostgreSQL with the commands su postgres and /usr/pgsql-9.1/bin/pg_dumpall -p 5432 | psql -d postgres -p 54321.
Restore it with the new version for testing.

If the new version of PostgreSQL has different directories and a different port from the old version, other files must be edited. These may include:

/etc/crontab
/etc/profile
/etc/services
/etc/sysconfig/ip6tables
/etc/sysconfig/iptables
/etc/sysconfig/system-config-securitylevel
/etc/selinux/targeted/contexts/netfilter_contexts
/etc/selinux/targeted/modules/active/netfilter_contexts
/var/local/utils/reserved/config.inc.php
opt/www/main/cgi-bin/plrefs.cgi
opt/www/main/cgi-bin/plxp.cgi
opt/www/main/cgi-bin/plxu.cgi

Configuration

The custom configuration of the database cluster in which PanLex resides is contained in three files. The configuration shown below is that for PostgreSQL version 9.0.1. The configuration for version 9.1.5 is identical, except that the value of the “port” parameter is 54321. The release notes for version 9.1 apparently require or recommend no configuration changes.

There are 2 versions of “postgresql.conf”. One is for solo operation (only 1 version of PostgreSQL running, and the other is for dual operation (2 versions running in parallel).

postgresql.conf (solo operation)

# See postgresql-std.conf for embedded documentation and other options.
# This version is tuned for 32 GB of RAM.

authentication_timeout = 15s
# Per J Berkus, 2009-10-13 lecture.

autovacuum = off
# Per J Berkus, 2010-12-02 message.
# When autovacuum was on, other autovacuum settings were:
# autovacuum_analyze_scale_factor = 0.05
# autovacuum_naptime = 10800
# autovacuum_vacuum_scale_factor = 0.2

checkpoint_segments = 32
# 16 per J Berkus, 2009-10-13 lecture.
# Increased because of occasional log warnings of checkpoint intervals about 15 seconds.

constraint_exclusion = off
# Per J Berkus, 2009-10-13 lecture.

cpu_index_tuple_cost = 0.001
# Per J Berkus, 2009-10-13 lecture.

cpu_operator_cost = 0.0005
# Per J Berkus, 2009-10-13 lecture.

datestyle = 'iso, ymd'

effective_cache_size = 24GB
# effective_cache_size to be 0.667 of RAM
# J Berkus, 2009-10-13 lecture, recommends 0.75.

escape_string_warning = 0
lc_messages = 'POSIX'
lc_monetary = 'POSIX'
lc_numeric = 'POSIX'
lc_time = 'POSIX'
listen_addresses = 'localhost,70.36.157.216'
log_destination = 'syslog'
log_error_verbosity = default
log_min_duration_statement = 10000
log_min_error_statement = warning
log_connections = off
log_line_prefix = '%h '

max_connections = 30
max_prepared_transactions = 30
max_stack_depth = 8192
port = 5432

# statement_timeout = 30min
# Per J Berkus, 2009-10-13 lecture.
# Apparently caused function execution to abort.

synchronous_commit = off
# Per J Berkus, 2009-10-13 lecture.

syslog_facility = 'LOCAL0'
syslog_ident = 'pg'

track_functions = pl
# Per J Berkus, 2009-10-13 lecture.

vacuum_cost_delay = 0
# VCD = 1000ms slowed vacuuming enormously.
# J Berkus, 2009-10-13 lecture, recommends VCD less than 100ms.

wal_buffers = 8MB
# Per J Berkus, 2009-10-13 lecture.

###############################################################
# SOLO OPERATION
###############################################################

maintenance_work_mem = 2GB
# maintenance_work_mem to be 0.0625 of RAM
# temporarily doubled to diagnose out-of-memory failures

shared_buffers = 8GB
# shared_buffers to be 0.25 of RAM

work_mem = 512MB

postgresql.conf (dual operation)

Same as above, except replace the section starting “SOLO OPERATION” with this section:

###############################################################
# DUAL OPERATION
###############################################################

maintenance_work_mem = 1GB
# maintenance_work_mem to be 0.0625 of RAM
# normally 2GB; divided by 2 for dual-PostgreSQL operation
# can be temporarily doubled to diagnose out-of-memory failures

shared_buffers = 4GB
# shared_buffers to be 0.25 of RAM
# normally 8GB; divided by 2 for dual-PostgreSQL operation

work_mem = 256MB
# normally 512MB; divided by 2 for dual-PostgreSQL operation

pg_hba.conf

# See pg_hba-std.conf for documentation.
# TYPE  DATABASE    USER        CIDR-ADDRESS          METHOD
# "local" is for Unix domain socket connections only
# <CUSTOM>
local   all         all                               trust
local   all         postgres                          ident map=postgres
local   all         apache                            ident map=apache
local   all         pool                              ident map=pool
# </CUSTOM>
local   all         all                               md5
# IPv4 local connections:
# <CUSTOM>
host    all         all         127.0.0.1/32          trust
host    all         postgres    127.0.0.1/32          ident map=postgres
# </CUSTOM>
host    all         all         127.0.0.1/32          md5
# IPv6 local connections:
host    all         all         ::1/128               md5
# <CUSTOM>
# IPv4 remote connections:
host    all         all         0.0.0.0/0             md5
# </CUSTOM>

pg_ident.conf

# See pg_ident-std.conf for documentation.
# MAPNAME       SYSTEM-USERNAME         PG-USERNAME
# <CUSTOM>
apache        apache            apache
apache        pool              apache
apache        root              apache
postgres      postgres          postgres
postgres      pool              postgres
postgres      root              postgres
pool          apache            pool
pool          pool              pool
# </CUSTOM>

Redundancy

Our automatic periodic duplication process makes copies of our active databases weekly. They are saved in “/var/local/archives/dbms”, then combined and compressed into a file in “/var/local/archives/general”. We copy the compressed archives to our workstation and periodically onto removal storage drives for off-site deposit.

We can specially archive any particular database named “dbname” to a file named “fname” by executing the command “pg_dump -f fname dbname” as the user “postgres”.

To restore any database from its archived file, we must (at least during the restoration) make the database’s owner a superuser in the database cluster. That user makes the “/var/local/archives/dbms” directory current, then deletes the database with the “dropdb” command. The user then executes the “createdb -T template0” command, appending the database name as the last argument. The user then executes the “psql” command, giving the name of that database as the command argument, e.g. “psql ac”. If we are restoring the database to a fresh database cluster, we then assure that any users with privileges on any objects in the database already exist in the cluster; for this purpose, the user issues the applicable statements. For all databases, they are “create role pool login createdb” and “create role smc login”. For “panlex” and “ac”, they also include “create role apache login”. For “cal”, they also include “create role webcalng login”. Finally, the user, still in psql, issues the “\i” command, giving the file name as the argument, e.g. “\i ac_dump”. (The dump file contains a create-language command that only a superuser is permitted to execute.) The confirmation messages after this command should contain no warnings and no errors. This procedure is the only one that seems to work, among those described in the PostgreSQL documentation.

Documentation

The current version’s PostgreSQL documentation is accessible via an HTML client at "http://utilika.org/pgdoc". We have provided for this access in "/etc/httpd/conf/httpd.conf". If the directory containing the documentation changes, and we amend httpd.conf accordingly (in 2 places where the string "postgresql" appears), after which we execute "apachectl restart".