Christie Evans and Jonathan Pool
Topics in Interlingual Technology
Example Topic: Interlingua-Based Translation
To illustrate some features of work in interlingual technology, we present some information about one of the above topics: interlingua-based translation.
The term machine translation (MT) most frequently refers to fully automatic machine translation (FAMT), or batch translation, in which a computer program takes source language input and translates it into target language output, without interaction with a human. This is contrasted to human-aided machine translation (HAMT), in which a person may be asked to resolve ambiguities or make other choices during the machine translation process, or computer-aided human translation (CAHT or MAHT), in which a human translator utilizes various computer tools such as bilingual dictionaries and example bases of translated texts to assist and speed his translation process.
One may distinguish MT architectures from MT technologies. An MT architecture is the basic process by which MT takes place; an MT technology is a method for performing one or more functions required by the MT architecture. Typically, an MT process implements only one architecture, but it may utilize multiple technologies. In these terms, interlingua-based translation automation is an architecture, not a technology. The two main architectures for MT systems today are pair-wise (also called "transfer"), in which the system translates directly from the source into the target, and interlingua, in which the system translates the source into an intermediate representation, which it then translates into the target language(s).
The automated understanding of natural language is the fundamental research problem for MT.
However, to the extent that the automated understanding of natural language is expensive or impossible, controlled input is a research problem. This is the problem of inducing and helping writers and speakers to produce and/or modify content so as to make it translatable.
Interlingua design is a basic research problem for interlingua translation. Interlinguas can be designed to take advantage of similarities between the source and target language and can be structured to cope most effectively with their differences. The design of interlinguas also responds to requirements and evaluation criteria with respect to precision, ambiguity, stylistic variety, and other attributes.
Other research problems
The most prominent discussion involving interlingua MT deals with the question: "Is interlingua MT superior or inferior to pair-wise MT?" There is reason to believe that the answer is: "It depends." Certain attributes of the translation problem and environment favor pair-wise MT, and some favor interlingua MT. One attribute is the number of languages involved:
Thus, the benefits of interlingua MT, in comparison with pair-wise MT, generally increase when source content is to be translated into multiple target languages, and when sources to be translated originate in multiple languages.
The two main architectures tend to exhibit some other relative advantages, as well. The pair-wise architecture has advantages when:
Conversely, the interlingua architecture has advantages when:
Within interlingua MT, the main debate is about the optimal nature and structure of an interlingua. The main alternatives discussed have been: (1) ordinary human language, (2) originally synthetic human language, and (3) formal (logical) representation. One view of the optimal interlingua is a code that can represent every nuance, including ambiguity, of all possible source and target languages. No actual interlingua has achieved that, and there is reason to believe that goal cannot be achieved. Thus, actual choices among interlinguas involve trade-offs between subgoals, including fidelity to particular source and target languages and particular types of content.
There has also been discussion about translation technologies, particularly about the advantages of different knowledge representations, algorithms, and paradigms. This discussion has evolved toward a consensus that technologies are not alternatives to each other, but are complements to each other. The most effective approaches to FAMT appear to involve combinations of technology.
The acid test of any MT system is what, and how well, it succeeds in translating from the source to the target language.
Of the several interlingua MT projects, the most successful in producing usable translations has been the KANT project (standing for "Knowledge-based Accurate Natural-language Translation") at the Center for Machine Translation in the Linguistic Technologies Institute at Carnegie Mellon University. It has been used to translate documents in "electric power utility management, heavy equipment technical documentation, medical records, car manuals, and TV captions". It makes use of a formal, not human, interlingua. It also makes some use of controlled input. For example, in its application to the translation of Caterpillar product manuals from English into French and Spanish, it provides tools for authors to support their use of Caterpillar Technical English (CTE). This allows them to write clear and consistent documents. It also offers authors a Service Information System (SIS), with which they annotate documents with information about the content's meaning and use. The KANT project has produced 33 technical papers from 1991 to 2002. A recent example is a paper titled "Challenges in Adapting an Interlingua for Bidirectional English-Italian Machine Translation".
Another major interlingua MT project, now dormant, was Distributed Language Translation, operated by BSO in the Netherlands. It produced theoretical monographs and prototypes, but it was never applied.
Another contemporary interlingua MT project is one at the Indian Institute of Technology in Mumbai, India. There researchers are developing a system to translate between Hindi and English using UNL as an interlingua. UNL, Universal Networking Language, was designed by United Nations University as an encoding of knowledge for sharing across computer systems, especially on the web. Their initial results suggest that UNL can potentially capture natural-language constructs.
The interlingua architecture has also been applied to HAMT technologies. Ergane is a multilingual dictionary/translation system which uses Esperanto as an interlingua to limit the number of bilingual dictionaries required. It is available free at http://download.travlang.com/Ergane/frames-en.html
Theoretical work on translation architectures has revealed that the two main architectures are closely related. Interlingua translation actually consists of two pairwise translation processes, one from the source into the interlingua and one from the interlingua into the destination. Therefore, progress in pair-wise translation can also constitute progress in interlingua translation.
The advancement of MT has consisted in part of the development of its contributing technologies. MT technologies include:
With a bit of license, one can construe the Babel fish described by Douglas Adams in The Hitchhiker's Guide to the Galaxy (1979) as an interlingua MT system. Adams writes: "The Babel fish ... is small, yellow and leechlike, and probably the oddest thing in the Universe. It feeds on the brainwave energy received not from its own carrier but from those around it. It absorbs all unconscious mental frequencies from this brainwave energy to nourish itself with. It then excretes into the mind of its carrier a telepathic matrix formed by combining the conscious thought frequencies with nerve signals picked up from the speech centers of the brain which has supplied them. The practical upshot of all this is that if you stick a Babel fish in your ear you can instantly understand anything said to you in any form of language. The speech patterns you actually hear decode the brainwave matrix which has been fed into your mind by your Babel fish." One interpretation of this somewhat imprecise account is that the Babel fish translates speech from any language into a formal representation (brain waves), from which its wearer derives a vocalizable understanding in the wearer's language. Adams also reminds us that MT achievements may be put to uses that aren't envisioned by the creators of MT technology: "Meanwhile, the poor Babel fish, by effectively removing all barriers to communication between different races and cultures, has caused more and bloodier wars than anything else in the history of creation." There is a summary of work possibly inspired by this fantasy in TC Forum.
There have been many attempts to theorize about and realize MT. An important minority of them have used an interlingua architecture.
In a recent history of MT, John Hutchins says that one might "trace the origins of machine translation (MT) back to seventeenth century ideas of universal and philosophical languages, and of 'mechanical' dictionaries". Hutchins reports that the earliest actual attempts to design MT solutions took place in 1933. Two designs for mechanical multilingual dictionaries were patented in that year. The more ambitious of them, by Petr Trojanskij, included methods for "coding and interpreting grammatical functions using 'universal' (Esperanto-based) symbols". Thus, 1933 may reasonably mark the birth of the quest for a practical interlingua MT system.
Interest, belief, and activity in MT has waxed and waned since the 1930s, as chronicled by Hutchins. Activity was vigorous in the 1950s and early 1960s and was depressed from the mid-1960s to the mid-1970s, partly because of disillusionment expressed in a report by the NSF-sponsored Automatic Language Processing Advisory Committee (ALPAC). During what Hutchins calls the "quiet decade", the "principal innovative experiments ... focused on essentially interlingua approaches." "However, by the mid-1970s, the future of the interlingua approach seemed to be in doubt." "During the latter half of the 1980s there was a general revival of interest in interlingua systems, motivated in part by contemporary research in artificial intelligence and cognitive linguistics." DLT was one of the main products of this revival. Both DLT and KANT were "rule-based", and rule-based approaches dominated MT. "Since 1989, however, the dominance of the rule-based approach has been broken by the emergence of new methods and strategies which are now loosely called 'corpus-based' methods." For example, IBM's Candide system applied statistical analysis to the bilingual transcripts of Canadian parliamentary debates in order to translate between English and French, entirely without the application of linguistic rules. Nonetheless, interlingua MT is far from dead. As summarized by Hutchins, "In the mid 1990s other interlingua-based systems were started, e.g., the ULTRA system at the New Mexico State University developed by Sergei Nirenburg, the UNITRAN system based on the linguistic theory of Principles and Parameters, and the Pangloss project, a collaborative project involving the universities of Southern California, New Mexico State and Carnegie Mellon." Hutchins notes that interlingua MT research in the late 1990s has exhibited a move away from its original focus on syntax and now involves, more than other approaches, the compilation of lexical information (information on the use of particular words, such as what is given to language learners).
In 1988 Hutchins surveyed the last five years of MT work and described 41 projects around the world. Of these, eight were interlingua projects. Of those, all but two used formal interlinguas (for example, Rosetta, developed at Phillips Research Laboratories). The two human languages used as interlinguas were Esperanto (in DLT, project 18) and Aymara (in ATAMIRI, project 41).
Two recognized pioneers in interlingua MT were DLT, which began in 1983, and KANT, which began in 1991. DLT used Esperanto as its interlingua, and KANT used a computer-language-like representation of the text content as its interlingua. DLT never became a commercial product. KANT was officially approved at Caterpillar for translating English to French in 1996 and English to Spanish in 1997; it continues to be expanded to translate to other languages. It has been rewritten as a more efficient and maintainable version in C++ for cross-platform compatibility (KANTOO) and continues to be developed.
MT has exhibited a trend toward technological eclecticism. Within each architecture, multiple technologies are increasingly applied to the translation tasks. These include statistical, rule-based, and example-based technologies.
For trends in the demand for interlingua MT, see the section on "Supply, demand, pricing, and market sizes" below.
We have not discovered yet any significant forecasts about the future of interlingua MT. But one might base forecasts in part on the forces that seem to have made DLT a practical dead-end (even if a valuable research project) while KANT became a viable, valued production system. A number of key differences between the systems seem relevant:
Were experience the only guide, the experience of DLT and KANT might lead to a forecast that interlingua MT will gain commercial success in limited domains, for which it will make use of formal rather than human interlinguas. However, we suggest that this be considered a hypothesis rather than a working assumption.
However, there is much theoretical work bearing on the possibilities and limits of interlingua MT. Hutchins summarizes the mid-1970s doubts about interlingua MT as follows: "The main problems identifed were attributed by the Grenoble and Texas groups to the rigidity of the levels of analysis (failure at any stage meant failure to produce any output), the inefficiency of parsers (too many partial analyses which had to be 'filtered' out), and in particular loss of information about surface forms of the [source language] input which might have been used to guide the selection of [target language] forms and the construction of acceptable [target language] sentence structures."
We have not yet identified an industry or industry segment identifiable as based on interlingua MT. However, interlingua MT offers its greatest advantages where there is a need to translate from or into many languages and where the form and content of originals are restricted. Some industries exhibit these attributes much more than others. For example, the pharmaceutical industry needs translations of user instructions into as many languages as users and their governments require, and user instructions, because of regulatory forces and the communicative purpose, tend to be restricted.
Globalization of industry is focusing business attention on localization issues, for websites, advertising, manuals, and other documentation. Localization aims to maximize appeal and usability around the world by providing information in the local language and style. In most cases, the problem definition involves a primary source language which needs to be translated into multiple other languages, which is a situation in which the interlingua architecture can be particularly valuable. Issues and goals are discussed at www.globalization.com and by the Localization Industry Standards Association.
The most important research centers for MT include:
The European Association for Machine Translation maintains a list of research centers.
DLT was subsidized by the European Union and then by the Dutch government for several years, as well as by BSO.
We have not yet gathered extensive data on research support in this field. Sources known to have given support to work in it include:
Associations related to this topic include:
We have not yet discovered any trade associations focused on interlingua MT. Some trade associations will span multiple topics. For example, multiple kinds of interlingual technology affect software localization. The main trade association for that application realm is the Localization International Standards Association (LISA).
The most useful portal to MT bibliography is the one maintained by John Hutchins, the chief bibliographer (and also historian and prognosticator) of translation automation.
The principal periodical dedicated to MT is MT News International.
A useful general source on MT is the Association for Machine Translation in the Americas. Citations elsewhere in this proposal include links to Web sites and are not repeated here.
Specific articles and lecture outlines useful for an orientation on this topic include: