Interlingual Technology Preview

Prepared for Esperantic Studies Foundation

Christie Evans and Jonathan Pool
2003/01/27
Revised 2004/07/22

Topics in Interlingual Technology

Example Topic: Interlingua-Based Translation

Topics in Interlingual Technology

Language Design

Theories of language design
Design of human and machine languages
Language design and intelligibility
Language design and translatability
Language design and learnability
Language design and neutrality
Esperanto as a case of language design
Special Englishes as cases of language design

Language Learning

Automated language instruction
Use of artificial languages to facilitate natural-language learning
Use of Internet technologies in language instruction
Use of artificial intelligence in language instruction

Multilingual Information Processing

Multilingual automated information searching and retrieval
Multilingual automated text recognition
Multilingual automated speech recognition
Multilingual automated speech generation
Multilingual standardization of information technology
Unicode as a case of multilingual standardization
Language-independent and multilingual software architectures
Linguistic aspects of software localization
Interlingual aspects of Jackpot, Semantic Web, Aspect-Oriented Programming, Intentional Software, and other theoretical revolutions in software architecture
Technologies for nonlingual communication
Technologies for translingual comparability in measurement and evaluation

Translation

Fully automatic translation
Human-machine collaborative translation
Strategies of translation automation
Pairwise translation automation
Interlingua-based translation
Controlled language for translingual processing

General

Linguistic theory and interlingual technology
Constraints on performance of interlingual technology
Integration of interlingual technology into other technology
Economic, political, social, cultural, and emotional effects of interlingual technology
Technology hybridity in interlingual solutions
Ownership and regulation of interlingual technology

Example Topic: Interlingua-Based Translation

To illustrate some features of work in interlingual technology, we present some information about one of the above topics: interlingua-based translation.

Terms and definitions.

The term machine translation (MT) most frequently refers to fully automatic machine translation (FAMT), or batch translation, in which a computer program takes source language input and translates it into target language output, without interaction with a human. This is contrasted to human-aided machine translation (HAMT), in which a person may be asked to resolve ambiguities or make other choices during the machine translation process, or computer-aided human translation (CAHT or MAHT), in which a human translator utilizes various computer tools such as bilingual dictionaries and example bases of translated texts to assist and speed his translation process.

One may distinguish MT architectures from MT technologies. An MT architecture is the basic process by which MT takes place; an MT technology is a method for performing one or more functions required by the MT architecture. Typically, an MT process implements only one architecture, but it may utilize multiple technologies. In these terms, interlingua-based translation automation is an architecture, not a technology. The two main architectures for MT systems today are pair-wise (also called "transfer"), in which the system translates directly from the source into the target, and interlingua, in which the system translates the source into an intermediate representation, which it then translates into the target language(s).

Research problems.

The automated understanding of natural language is the fundamental research problem for MT.

However, to the extent that the automated understanding of natural language is expensive or impossible, controlled input is a research problem. This is the problem of inducing and helping writers and speakers to produce and/or modify content so as to make it translatable.

Interlingua design is a basic research problem for interlingua translation. Interlinguas can be designed to take advantage of similarities between the source and target language and can be structured to cope most effectively with their differences. The design of interlinguas also responds to requirements and evaluation criteria with respect to precision, ambiguity, stylistic variety, and other attributes.

Theoretical and practical debates.

The most prominent discussion involving interlingua MT deals with the question: "Is interlingua MT superior or inferior to pair-wise MT?" There is reason to believe that the answer is: "It depends." Certain attributes of the translation problem and environment favor pair-wise MT, and some favor interlingua MT. One attribute is the number of languages involved:

In a pair-wise translation system, a separate translation program will be needed for each pair of languages in each direction, i.e. for three languages, English, French, and German, six programs are required: English to French, English to German, French to English, French to German, German to English, and German to French. The number of separate pair-wise translation programs for translating between n languages is n(n - 1). For a two-language system, two programs are required; for a six language system, thirty programs are needed.
In an interlingua system, two translation programs will be needed for each language in the system, one that translates into the interlingua and one that translates from the interlingua. This rule holds regardless of the total number of languages in the system. For a two-language system, an interlingua system will require four programs; for a six-language system, twelve programs will be needed. If a system will include more than three languages, an interlingua translation approach will require fewer programs.

Thus, the benefits of interlingua MT, in comparison with pair-wise MT, generally increase when source content is to be translated into multiple target languages, and when sources to be translated originate in multiple languages.

The two main architectures tend to exhibit some other relative advantages, as well. The pair-wise architecture has advantages when:

The system contains large example bases or statistical models or translation memories for the language pairs.
The translation task is not constrained, so the translation needs to capture all the flavors and subtleties of the source and target languages.
The source and target language are very similar, so that the translation task between them is relatively simple.

Conversely, the interlingua architecture has advantages when:

There are large example bases or statistical models or translation memories relating the various languages of the system and its interlingua, so that translation into and out of the interlingua can be optimized.
The translation task is constrained enough that nuances of the source and target language are not essential to the task.
The source and target languages are mutually exotic, with few one-to-one correspondences.

Within interlingua MT, the main debate is about the optimal nature and structure of an interlingua. The main alternatives discussed have been: (1) ordinary human language, (2) originally synthetic human language, and (3) formal (logical) representation. One view of the optimal interlingua is a code that can represent every nuance, including ambiguity, of all possible source and target languages. No actual interlingua has achieved that, and there is reason to believe that goal cannot be achieved. Thus, actual choices among interlinguas involve trade-offs between subgoals, including fidelity to particular source and target languages and particular types of content.

There has also been discussion about translation technologies, particularly about the advantages of different knowledge representations, algorithms, and paradigms. This discussion has evolved toward a consensus that technologies are not alternatives to each other, but are complements to each other. The most effective approaches to FAMT appear to involve combinations of technology.

Philosophical, theoretical, scientific, engineering, and commercial achievements.

The acid test of any MT system is what, and how well, it succeeds in translating from the source to the target language.

Of the several interlingua MT projects, the most successful in producing usable translations has been the KANT project (standing for "Knowledge-based Accurate Natural-language Translation") at the Center for Machine Translation in the Linguistic Technologies Institute at Carnegie Mellon University. It has been used to translate documents in "electric power utility management, heavy equipment technical documentation, medical records, car manuals, and TV captions". It makes use of a formal, not human, interlingua. It also makes some use of controlled input. For example, in its application to the translation of Caterpillar product manuals from English into French and Spanish, it provides tools for authors to support their use of Caterpillar Technical English (CTE). This allows them to write clear and consistent documents. It also offers authors a Service Information System (SIS), with which they annotate documents with information about the content's meaning and use. The KANT project has produced 33 technical papers from 1991 to 2002. A recent example is a paper titled "Challenges in Adapting an Interlingua for Bidirectional English-Italian Machine Translation".

Another major interlingua MT project, now dormant, was Distributed Language Translation, operated by BSO in the Netherlands. It produced theoretical monographs and prototypes, but it was never applied.

Another contemporary interlingua MT project is one at the Indian Institute of Technology in Mumbai, India. There researchers are developing a system to translate between Hindi and English using UNL as an interlingua. UNL, Universal Networking Language, was designed by United Nations University as an encoding of knowledge for sharing across computer systems, especially on the web. Their initial results suggest that UNL can potentially capture natural-language constructs.

The interlingua architecture has also been applied to HAMT technologies. Ergane is a multilingual dictionary/translation system which uses Esperanto as an interlingua to limit the number of bilingual dictionaries required. It is available free at http://download.travlang.com/Ergane/frames-en.html

Theoretical work on translation architectures has revealed that the two main architectures are closely related. Interlingua translation actually consists of two pairwise translation processes, one from the source into the interlingua and one from the interlingua into the destination. Therefore, progress in pair-wise translation can also constitute progress in interlingua translation.

The advancement of MT has consisted in part of the development of its contributing technologies. MT technologies include:

word-by-word direct translation
syntactic rules
semantic rules
complex dictionaries
rules developed from statistical analysis of large bodies of translated texts
example bases or translation memories of translated phrases and sentences which may provide matches to the current source to be translated and thus offer probable translations

Literary fantasies.

With a bit of license, one can construe the Babel fish described by Douglas Adams in The Hitchhiker's Guide to the Galaxy (1979) as an interlingua MT system. Adams writes: "The Babel fish ... is small, yellow and leechlike, and probably the oddest thing in the Universe. It feeds on the brainwave energy received not from its own carrier but from those around it. It absorbs all unconscious mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of its carrier a telepathic matrix formed by combining the conscious thought frequencies with nerve signals picked up from the speech centers of the brain which has supplied them. The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by your Babel fish." One interpretation of this somewhat imprecise account is that the Babel fish translates speech from any language into a formal representation (brain waves), from which its wearer derives a vocalizable understanding in the wearer's language. Adams also reminds us that MT achievements may be put to uses that aren't envisioned by the creators of MT technology: "Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation." There is a summary of work possibly inspired by this fantasy in TC Forum.

History.

There have been many attempts to theorize about and realize MT. An important minority of them have used an interlingua architecture.

In a recent history of MT, John Hutchins says that one might "trace the origins of machine translation (MT) back to seventeenth century ideas of universal and philosophical languages, and of 'mechanical' dictionaries". Hutchins reports that the earliest actual attempts to design MT solutions took place in 1933. Two designs for mechanical multilingual dictionaries were patented in that year. The more ambitious of them, by Petr Trojanskij, included methods for "coding and interpreting grammatical functions using 'universal' (Esperanto-based) symbols". Thus, 1933 may reasonably mark the birth of the quest for a practical interlingua MT system.

Interest, belief, and activity in MT has waxed and waned since the 1930s, as chronicled by Hutchins. Activity was vigorous in the 1950s and early 1960s and was depressed from the mid-1960s to the mid-1970s, partly because of disillusionment expressed in a report by the NSF-sponsored Automatic Language Processing Advisory Committee (ALPAC). During what Hutchins calls the "quiet decade", the "principal innovative experiments ... focused on essentially interlingua approaches." "However, by the mid-1970s, the future of the interlingua approach seemed to be in doubt." "During the latter half of the 1980s there was a general revival of interest in interlingua systems, motivated in part by contemporary research in artificial intelligence and cognitive linguistics." DLT was one of the main products of this revival. Both DLT and KANT were "rule-based", and rule-based approaches dominated MT. "Since 1989, however, the dominance of the rule-based approach has been broken by the emergence of new methods and strategies which are now loosely called 'corpus-based' methods." For example, IBM's Candide system applied statistical analysis to the bilingual transcripts of Canadian parliamentary debates in order to translate between English and French, entirely without the application of linguistic rules. Nonetheless, interlingua MT is far from dead. As summarized by Hutchins, "In the mid 1990s other interlingua-based systems were started, e.g., the ULTRA system at the New Mexico State University developed by Sergei Nirenburg, the UNITRAN system based on the linguistic theory of Principles and Parameters, and the Pangloss project, a collaborative project involving the universities of Southern California, New Mexico State and Carnegie Mellon." Hutchins notes that interlingua MT research in the late 1990s has exhibited a move away from its original focus on syntax and now involves, more than other approaches, the compilation of lexical information (information on the use of particular words, such as what is given to language learners).

In 1988 Hutchins surveyed the last five years of MT work and described 41 projects around the world. Of these, eight were interlingua projects. Of those, all but two used formal interlinguas (for example, Rosetta, developed at Phillips Research Laboratories). The two human languages used as interlinguas were Esperanto (in DLT, project 18) and Aymara (in ATAMIRI, project 41).

Two recognized pioneers in interlingua MT were DLT, which began in 1983, and KANT, which began in 1991. DLT used Esperanto as its interlingua, and KANT used a computer-language-like representation of the text content as its interlingua. DLT never became a commercial product. KANT was officially approved at Caterpillar for translating English to French in 1996 and English to Spanish in 1997; it continues to be expanded to translate to other languages. It has been rewritten as a more efficient and maintainable version in C++ for cross-platform compatibility (KANTOO) and continues to be developed.

Trends.

MT has exhibited a trend toward technological eclecticism. Within each architecture, multiple technologies are increasingly applied to the translation tasks. These include statistical, rule-based, and example-based technologies.

For trends in the demand for interlingua MT, see the section on "Supply, demand, pricing, and market sizes" below.

Forecasts.

We have not discovered yet any significant forecasts about the future of interlingua MT. But one might base forecasts in part on the forces that seem to have made DLT a practical dead-end (even if a valuable research project) while KANT became a viable, valued production system. A number of key differences between the systems seem relevant:

The task: Probably most important, KANT is designed for a very constrained task. KANT's input is pre-edited technical documentation, such as maintenance manuals for Caterpillar equipment written in Caterpillar Technical English (CTE). DLT was intended to translate human communication in all its confusion, a much more difficult task.
The environment: KANT, the machine translation tool, exists within a network of related applications that support the writing, pre-editing, translation and distribution of documentation. Authoring tools assist writers to generate consistent, unambiguous, clear documents; these are then edited by trained professionals who further annotate the documents to remove recognized ambiguities (such as indicating proper names so they will be skipped by translation). KANT is able to generate very high-quality target-language documentation, because its input is designed for translation from its conception. DLT did not have the luxury of filtering its input, although an interactive authoring tool was part of its original specifications.
The development process: Caterpillar provided motivated experts and testers as well as appropriate documents for translation from KANT's earliest versions on. Development was done iteratively as the definition of the problem was clarified by the actual use of each new release. DLT didn't have this energetic give and take with involved users.
The interlingua: The content of Kant's interlingua also evolved iteratively, but its initial design didn't change significantly. The interlingua grew to include all the aspects found useful during the actual translation process on real documents. Since whole realms of human experience are excluded from Caterpillar's technical manuals, this language didn't have to deal with emotional content, implications, or literary references; nor, since the input was in a tightly-structured language variety, did the interlingua need to capture complex grammatical structures. (However, see "Can Practical Interlinguas Be Used for Difficult Analysis Problems?" for an evaluation of KANT's interlingua applied to non-technical English text.) DLT, however, did need to capture all the possible subtleties of the languages for which it offered translation, so it required a more flexible interlingua. In choosing a human, even a planned human, language, DLT's developers added the ambiguities and choices of that language, as well as its richness and power, to the difficulties of their translation task.

Were experience the only guide, the experience of DLT and KANT might lead to a forecast that interlingua MT will gain commercial success in limited domains, for which it will make use of formal rather than human interlinguas. However, we suggest that this be considered a hypothesis rather than a working assumption.

However, there is much theoretical work bearing on the possibilities and limits of interlingua MT. Hutchins summarizes the mid-1970s doubts about interlingua MT as follows: "The main problems identifed were attributed by the Grenoble and Texas groups to the rigidity of the levels of analysis (failure at any stage meant failure to produce any output), the inefficiency of parsers (too many partial analyses which had to be 'filtered' out), and in particular loss of information about surface forms of the [source language] input which might have been used to guide the selection of [target language] forms and the construction of acceptable [target language] sentence structures."

Branches of commerce and industry.

We have not yet identified an industry or industry segment identifiable as based on interlingua MT. However, interlingua MT offers its greatest advantages where there is a need to translate from or into many languages and where the form and content of originals are restricted. Some industries exhibit these attributes much more than others. For example, the pharmaceutical industry needs translations of user instructions into as many languages as users and their governments require, and user instructions, because of regulatory forces and the communicative purpose, tend to be restricted.

Supply, demand, pricing, and market sizes.

Globalization of industry is focusing business attention on localization issues, for websites, advertising, manuals, and other documentation. Localization aims to maximize appeal and usability around the world by providing information in the local language and style. In most cases, the problem definition involves a primary source language which needs to be translated into multiple other languages, which is a situation in which the interlingua architecture can be particularly valuable. Issues and goals are discussed at www.globalization.com and by the Localization Industry Standards Association.

Individual researchers.

The dean of MT, who also writes superbly for general as well as technical audiences and with broad understanding and vision, is John Hutchins
Klaus Schubert, of the DLT project, remains an active scholar at the University of Flensburg.
Toon Witkam, of the DLT project, is an emeritus Professor of Ergonomics at Delft University of Technology.
Eric H. Nyberg, Carnegie Mellon University, is a major contributor to the KANT project.
Another important KANT participant is Teruko Mitamura, also at Carnegie Mellon University.

Research centers.

The most important research centers for MT include:

The European Association for Machine Translation maintains a list of research centers.

Foundations and other sources of research support.

DLT was subsidized by the European Union and then by the Dutch government for several years, as well as by BSO.

We have not yet gathered extensive data on research support in this field. Sources known to have given support to work in it include:

National Science Foundation
Defense Advanced Research Projects Agency (DARPA): supported the MIT Pangloss interlingua MT project
Japan Key Technology Center
American Society for Information Science and Technology (ASST)
European Union

Scholarly associations.

Associations related to this topic include:

Trade associations.

We have not yet discovered any trade associations focused on interlingua MT. Some trade associations will span multiple topics. For example, multiple kinds of interlingual technology affect software localization. The main trade association for that application realm is the Localization International Standards Association (LISA).

Bibliographies.

The most useful portal to MT bibliography is the one maintained by John Hutchins, the chief bibliographer (and also historian and prognosticator) of translation automation.

Periodicals.

The principal periodical dedicated to MT is MT News International.

Web sites and other public-information sources.

A useful general source on MT is the Association for Machine Translation in the Americas. Citations elsewhere in this proposal include links to Web sites and are not repeated here.

Specific articles and lecture outlines useful for an orientation on this topic include:

"The KANT Perspective", which compares MT architectures and technologies and discusses how they can be better used in integration than in competition
"Machine translation in the real world", which is a recent (2002) lecture outline
Linguistics 354--Computers and Human Language, the syllabus of a course which includes issues and topics in MT (including the extensive use of DLT to illustrate interlingua MT)