logo

PanLex: Performance

Introduction

The PanLex project’s principal current objectives are the collection of (particular kinds of) interoperable data and the deployment of those data for access by others. If the targeted users were large numbers of individuals wanting immediate answers to questions (e.g., “What is the best translation of ‘নিষ্পাদন’ from Bengali into Vietnamese?”), high performance of the database server in computing and delivering responses to queries would be necessary. If, however, the project intends to delegate the fulfillment of end-user queries to other organizations with their own servers operating on their own copies of the PanLex data, then satisfactory performance of the PanLex database server will be whatever the PanLex team needs for its own research and development.

In reality, the project is acting as if it intends to provide data to other organizations for end-user query fulfillment, but the project has also created end-user interfaces for demonstration purposes and invited interested persons to try them, as well as creating a search-engine-oriented interface (PanLinx) that it permits search engines to crawl. Therefore, performance of the database server in rapid query fulfillment is a secondary, but still significant, concern.

Obstacles to performance

Realistic queries of the PanLex database require complex enough computations to make while-you-wait query fulfillment difficult. Thus, performance has been a significant problem in the development of practical applications based on PanLex’s predecessors (Mausam, Soderland, Etzioni, Weld, Skinner, and Bilmes, 2009). Answering elementary questions about what is in the database is relatively easy, but implementing algorithms that traverse the graph of expressions, meanings, language varieties, and approvers to the depth that is useful for high-quality inference of additional truths from what is in the database can take minutes, hours, or even days.

Performance-improvement strategies

Monitoring

Performance can be monitored in more detail, so that performance problems can be diagnosed better. Monitoring tools include:

Memory expansion

One hypothesis about performance improvement for PanLex has been that the fitting the entire database in memory is crucial. This hypothesis is based on the assumption that slow performance is always a result of a need to read data from and write data to disk.

If this hypothesis is correct, the main strategy for performance improvement should be to expand the server’s memory, configure the database server, and exercise the database routinely in such ways as to keep the data and indexes in the server’s memory.

If the cost of memory decreases and the limits on memory per server increase faster than the database grows, this strategy may be or become practical. One way to make it more practical would be to consume and pay for memory only when needed, but according to one expert this will probably require new standard forms of cloud computing, because memory is the resource that is most difficult to make elastic.

Observation has, however, cast some doubt on the truth of this hypothesis. When memory has been increased beyond about 8 GB, only small decreases in the execution times of slow queries have been achieved (see “Testing AWS”), and the operating system has usually reported only fractional utilization of memory during the execution of slow queries.

Storage acceleration

Another approach is to use faster storage hardware. One type is solid-state drives. On AWS EC2, local instance storage might be faster than EBS storage (one Amazon expert says [on page 13] it’s faster for sequential I/O but slower for random I/O), and storing the database on local instance storage might be practical if modifications in it were continuously replicated on an EBS copy.

Precomputation

Another strategy is to precompute the final or intermediate answers to anticipated queries and store these for re-use. This is the basic strategy studied by Mausam et al. (see above) for the implementation of inference algorithms, and some additional exploration of this approach has taken place.

Micro-redesign

Another strategy is to redesign the tables and stored procedures so execution is faster. This strategy has been used several times and has resulted in major improvements.

Macro-redesign

Another strategy is to reimplement the database with an architecture that is designed for efficient graph queries. This strategy has been discussed in the project and between the project and some partners, but no attempts have been made to test it.

Precomputation for arbitrary queries

As distinct from the predecessor project’s PanDictionary system, PanLex has no particular inference algorithm that it wants to implement efficiently. It wants any researcher to be able to define any inference algorithm and implement it efficiently with the PanLex data. If the PanLex database server were to be the host for such an implementation, then any queries required by the algorithm should be fulfilled rapidly on that server.

Some inference algorithms can be expected to require answers to questions about indirect translations, such as “What are all the translation paths of length h from expression ex to language variety v?”. We have not been able to guarantee answers to such questions in less than several minutes, even when h = 2 (i.e. when there is only 1 intermediate translation in the path). We have experimentally created derivative tables that might simplify the fulfillment of such queries. These tables have made fulfillment of some queries faster, but of other queries slower. Such tables also consume substantial storage and slow other processing, because they must be updated with trigger functions whenever tables on which they depend are modified.

Derivative tables

The database contains some derivative tables that contain redundant data but facilitate query processing. These include:

Since all of these derivative tables make the database more complex, we occasionally review whether their existence is warranted.

“trlv” experiment

An experiment conducted in September, 02012, created a derivative table identifying the language varieties into which translations of expressions existed. The table, “aux.trlv”, was defined as follows:

                                                Table "aux.trlv"
 Column |  Type   | Modifiers | Storage |                              Description                               
--------+---------+-----------+---------+------------------------------------------------------------------------
 ex     | integer | not null  | plain   | expression
 lv     | integer | not null  | plain   | variety of any other expression(s) sharing a meaning of the expression
Indexes:
    "trlv_pkey" PRIMARY KEY, btree (ex, lv)

Function aux.trlvw
Schema Name Result data type Argument data types Type Volatility Owner Language Source code Description
public trlvw void   normal volatile pool sql --create or replace function trlvw ()
--returns void language sql as
truncate aux.trlv;
insert into aux.trlv
select distinct dn0.ex, lv from dn as dn0, dn as dn1, ex
where dn1.mn = dn0.mn
and dn1.ex != dn0.ex
and ex.ex = dn1.ex
order by ex, lv;
Act: repopulate trlv.

This table was intended to filter intermediate translations. If a user wanted chains of length 2 from some expression to some variety, this table would permit truncating paths at intermediate expressions known not to have any translations into the target variety.

Testing revealed that this table improved performance in “hot” conditions, but worsened performance in “cold” conditions. Since typical queries would be processed in cold conditions, this result did not indicate a clear improvement. The table contained almost 700 million rows, and maintaining the table would enlarge the database by 37 GB (table 23 GB, primary-key index 14 GB) to 61 GB from its previous size of 23 GB. This size increase would make it impractical to keep the entire database in memory and would make the transfer of copies of the database to others or between servers slower and more expensive. The table would also require the use of trigger functions to update the table upon any modification of a dependent table, thereby worsening performance whenever expression or denotation data were modified. In the event of any failure to keep it updated, the regeneration of the table would require almost 2 hours.

Testing also revealed that queries selecting rows of that table with a particular “ex” value were fulfilled rapidly, but queries selecting rows with a particular “lv” value were very slow.

On balance, the “trlv” table appeared to do more harm than good, so the table was deleted and procedures were not redesigned to use it.

Redesign of stored procedures

Introduction

PanLex stored procedures exist in the database and also in scripts that are used by the Apache web server to process user queries. Occasionally we have found it possible to accelerate query fulfillment with the redesign of some procedures.

Indirect translations

In October 02012 we experimented with variations in a subroutine for the discovery of translation chains of length 2. Several versions of the subroutine “Trtr”, in the script “tvviz5w.pl” in version 2.8 of the PanLem interface, were tested. They all performed the equivalent operation (“Return a list of references to the IDs and texts of the 2-hop translations of the specified expression into the specified variety”), but with different degrees and kinds of sequentiality (i.e. performing the operation all at once, versus step-wise with intermediate results).

The “Trtr” subroutine began identically in all versions with:

	$dbh->do(
		'create temporary table temp0tr on commit drop as '
		. 'select distinct dn2.ex as ex1, ap from dn as dn1, dn as dn2, mn '
		. "where dn1.ex = $_[0] and dn2.mn = dn1.mn and dn2.ex != $_[0] and mn.mn = dn1.mn"
	);
	# Create a temporary table of the ex and ap of all translations of the source expression
	# except itself.

	$dbh->do(
		'create temporary table temp1tr on commit drop as select distinct ex1 from temp0tr'
	);
	# Create a temporary table of their distinct ex.

What followed is what varied from one version to another.

Version 0 was:

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select dn1.ex as ex1, dn2.ex as ex2, tt, dn2.mn from temp1tr, dn as dn1, dn as dn2, ex '
		. "where dn1.ex = ex1 and dn2.mn = dn1.mn and lv = $_[1] and ex.ex = dn2.ex "
		. "and dn2.ex != ex1 and dn2.ex != $_[0]"
	);
	# Create a temporary table of them and the ex, tt, and mn of all their translations, excluding
	# translations into the source expression and into themselves, in the variety.

	return (&QRs (
		'distinct ex2, tt from temp0tr, temp0trtr, mn '
		. 'where temp0trtr.ex1 = temp0tr.ex1 and mn.mn = temp0trtr.mn and mn.ap != temp0tr.ap '
		. 'order by tt'
	));
	# Return a list of references to rows of a table of the ex and tt of the translations’
	# translations in which the translations’ translations’ approvers differ from the translations’
	# approvers.

Version 1 was:

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select distinct dn1.ex as ex1, dn2.ex as ex2, ap from temp1tr, dn as dn1, dn as dn2, mn '
		. "where dn1.ex = ex1 and dn2.mn = dn1.mn and dn2.ex != ex1 and dn2.ex != $_[0] "
		. 'and mn.mn = dn1.mn'
	);
	# Create a temporary table of them and the ex and ap of all their translations, excluding
	# translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table temp1trtr on commit drop as '
		. 'select distinct ex2 from temp0tr, temp0trtr '
		. 'where temp0trtr.ex1 = temp0tr.ex1 and temp0trtr.ap != temp0tr.ap'
	);
	# Create a temporary table of the distinct ex of the translations’ translations in which the
	# translations’ translations’ approvers differ from the translations’ approvers.

	return (&QRs ("ex, tt from temp1trtr, ex where ex = ex2 and lv = $_[1] order by tt"));
	# Return a list of references to rows of a table of the ex and tt of those that are in the
	# variety.

Version 2 was:

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select distinct dn1.ex as ex1, dn2.ex as ex2, dn1.mn from temp1tr, dn as dn1, dn as dn2 '
		. "where dn1.ex = ex1 and dn2.mn = dn1.mn and dn2.ex != ex1 and dn2.ex != $_[0] "
	);
	# Create a temporary table of them and the ex and mn of all their translations, excluding
	# translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table temp1trtr on commit drop as '
		. 'select distinct ex1, ex2, ap from temp0trtr, mn '
		. 'where mn.mn = temp0trtr.mn'
	);
	# Create a temporary table of the ex of the transations and their translations and the latter’s
	# ap.

	$dbh->do(
		'create temporary table temp2trtr on commit drop as '
		. 'select distinct ex2 from temp0tr, temp1trtr '
		. 'where temp1trtr.ex1 = temp0tr.ex1 and temp1trtr.ap != temp0tr.ap'
	);
	# Create a temporary table of the distinct ex of the translations’ translations in which the
	# translations’ translations’ approvers differ from the translations’ approvers.

	return (&QRs ("ex, tt from temp2trtr, ex where ex = ex2 and lv = $_[1] order by tt"));
	# Return a list of references to rows of a table of the ex and tt of those that are in the
	# variety.

Version 3 was:

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select ex1, mn from temp1tr, dn where ex = ex1'
	);
	# Create a temporary table of them and the mn of all their denotations.

	$dbh->do(
		'create temporary table temp1trtr on commit drop as '
		. 'select distinct ex1, ex as ex2, dn.mn from temp0trtr, dn '
		. "where dn.mn = temp0trtr.mn and dn.ex != ex1 and dn.ex != $_[0] "
	);
	# Create a temporary table of them and the ex and mn of all their translations, excluding
	# translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table temp2trtr on commit drop as '
		. 'select distinct ex1, ex2, ap from temp1trtr, mn '
		. 'where mn.mn = temp1trtr.mn'
	);
	# Create a temporary table of the ex of the translations and of their translations and the
	# latter’s ap.

	$dbh->do(
		'create temporary table temp3trtr on commit drop as '
		. 'select distinct ex2 from temp0tr, temp2trtr '
		. 'where temp2trtr.ex1 = temp0tr.ex1 and temp2trtr.ap != temp0tr.ap'
	);
	# Create a temporary table of the distinct ex of the translations’ translations in which the
	# translations’ translations’ approvers differ from the translations’ approvers.

	return (&QRs ("ex, tt from temp3trtr, ex where ex = ex2 and lv = $_[1] order by tt"));
	# Return a list of references to rows of a table of the ex and tt of those that are in the
	# variety.

Version 4 was:

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select dn1.ex as ex1, dn2.ex as ex2, dn2.mn from temp1tr, dn as dn1, dn as dn2 '
		. "where dn1.ex = ex1 and dn2.mn = dn1.mn and dn2.ex != ex1 and dn2.ex != $_[0]"
	);
	# Create a temporary table of them and the ex and mn of all their translations, excluding
	# translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table temp1trtr on commit drop as '
		. "select ex1, ex2, tt, mn from temp0trtr, ex where ex = ex2 and lv = $_[1]"
	);
	# Create a temporary table of them and the ex, tt, and mn of all their translations, excluding
	# translations into the source expression and into themselves, in the variety.

	return (&QRs (
		'distinct ex2, tt from temp0tr, temp1trtr, mn '
		. 'where temp1trtr.ex1 = temp0tr.ex1 and mn.mn = temp1trtr.mn and mn.ap != temp0tr.ap '
		. 'order by tt'
	));
	# Return a list of references to rows of a table of the ex and tt of the translations’
	# translations in which the translations’ translations’ approvers differ from the translations’
	# approvers.

Version 5 was:

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select distinct dn1.ex as ex1, dn2.ex as ex2 from temp1tr, dn as dn1, dn as dn2 '
		. "where dn1.ex = temp1tr.ex1 and dn2.mn = dn1.mn and dn2.ex != dn1.ex and dn2.ex != $_[0]"
	);
	# Create a temporary table of them and the unique ex of all their translations, excluding
	# translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table temp1trtr on commit drop as '
		. "select ex1, ex2 from temp0trtr, ex where ex = ex2 and lv = $_[1]"
	);
	# Create a temporary table of them limited to those in which the translations’ translations
	# are in the variety.

	$dbh->do(
		'create temporary table temp2trtr on commit drop as '
		. 'select ex1, ex2, dn2.mn from temp1trtr, dn as dn1, dn as dn2 '
		. 'where dn1.ex = ex1 and dn2.ex = ex2 and dn2.mn = dn1.mn'
	);
	# Create a temporary table of them and the mn of the translations’ translations.

	return (&QRs (
		'distinct ex2, tt from temp0tr, temp2trtr, mn, ex '
		. 'where temp2trtr.ex1 = temp0tr.ex1 and mn.mn = temp2trtr.mn and mn.ap != temp0tr.ap '
		. 'and ex.ex = ex2 order by tt'
	));
	# Return a list of references to rows of a table of the ex and tt of the translations’
	# translations in which the translations’ translations’ approvers differ from the translations’
	# approvers.

Version 6 was:

	$dbh->do(
		'create temporary table lvex on commit drop as '
		. "select ex from ex where lv = $_[1]"
	);
	# Create a temporary table of the unique ex and tt of all their translations in the variety,
	# excluding translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table lvdn on commit drop as '
		. "select dn.ex, mn from ex, dn where lv = $_[1] and dn.ex = ex.ex"
	);
	# Create a temporary table of the unique ex and tt of all their translations in the variety,
	# excluding translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select distinct dn2.ex as ex2, tt from temp1tr, dn as dn1, lvdn as dn2, ex '
		. "where dn1.ex = temp1tr.ex1 and dn2.mn = dn1.mn and lv = $_[1] and dn2.ex = ex.ex "
		. "and dn2.ex != dn1.ex and dn2.ex != $_[0]"
	);
	# Create a temporary table of the unique ex and tt of all their translations in the variety,
	# excluding translations into the source expression and into themselves.

	$dbh->do(
		'create temporary table temp1trtr on commit drop as '
		. 'select distinct ex1, ex2, tt, dn2.mn '
		. 'from temp1tr, dn as dn1, dn as dn2, temp0trtr '
		. 'where dn1.ex = ex1 and dn2.ex = ex2 and dn2.mn = dn1.mn'
	);
	# Create a temporary table of the unique translations, their translations in the variety, and
	# the translations’ translations’ meanings.

	return (&QRs (
		'distinct ex2, tt from temp0tr, temp1trtr, mn '
		. 'where temp1trtr.ex1 = temp0tr.ex1 and mn.mn = temp1trtr.mn and mn.ap != temp0tr.ap '
		. 'order by tt'
	));
	# Return a list of references to rows of a table of the ex and tt of the translations’
	# translations in which the translations’ translations’ approvers differ from the translations’
	# approvers.

Version 7 was:

	$dbh->do(
		'create temporary table temp0trtr on commit drop as '
		. 'select dn1.ex as ex1, dn2.ex as ex2, tt, dn2.mn from temp1tr, dn as dn1, dn as dn2, ex '
		. "where dn1.ex = ex1 and dn2.mn = dn1.mn and lv = $_[1] and ex.ex = dn2.ex"
	);
	# Create a temporary table of them and the ex, tt, and mn of all their translations into the
	# variety.

	$dbh->do(
		'create temporary table temp1trtr on commit drop as '
		. "select * from temp0trtr where ex2 != ex1 and ex2 != $_[0]"
	);
	# Create a temporary table of them, excluding translations into the source expression and
	# into themselves.

	return (&QRs (
		'distinct ex2, tt from temp0tr, temp1trtr, mn '
		. 'where temp1trtr.ex1 = temp0tr.ex1 and mn.mn = temp1trtr.mn and mn.ap != temp0tr.ap '
		. 'order by tt'
	));
	# Return a list of references to rows of a table of the ex and tt of the translations’
	# translations in which the translations’ translations’ approvers differ from the translations’
	# approvers.

Several tests, ranging in expense, were conducted in “hot” conditions. Performance differences were nonmonotonic. In the most expensive test, translating English “tree” into Russian, version 7 was the fastest, taking 16 seconds, versus 19 seconds with version 0, versus 57 seconds with version 1, 87 seconds with version 2, 64 seconds with version 3, 28 seconds with version 4, 49 seconds with version 5, and 59 seconds with version 6. But, in the cheapest test, translating Norwegian Bokmål “vilkårlig” into Swahili, version 7 took 15 seconds, version 0 took 17 seconds, version 5 took 13 seconds, and the other versions took 0 to 1 second. The step that consumed nearly all the time in version 7 was the creation of table “temp0trtr”. Although that table contained only 15 rows in the cheapest test, creating it still took 15 seconds.

This experimentation yielded a design that kept the execution time within about 20 seconds, but only at the cost of slowing it to almost that time even when other designs executed the procedure within a second.

After this experiment we installed version 0 as the working version of this procedure.

Subsequently, after additional testing, we amended the procedure to insert a statement before version 0:

	$dbh->do('analyze temp1tr');
	# Collect statistics on it.

Valid XHTML 1.1!