Panlingual Dogfood

Version 1

Jonathan Pool

Utilika Foundation
pool@utilika.org

This version is condensed and revised from Version 0.

Abstract
The Problem
Solution Tactics
A Vision
Conclusion
References

Abstract

The idea of a linguistically diverse yet efficiently interactive world is becoming popular, but no plausible model for its realization has yet been described. I envision here a process that supports this goal with collaborative panlingual semantic standardization. The process is panlingual not only in its product but also in its operation, thus complying with the "eat your own dogfood" principle and using its product to make itself possible. It is also participatory and lexicocentric, as illustrated by "dogfood". I envision the process helping to panlingualize interactive services, including search, question answering, discussion, correspondence, and knowledge dissemination.

The Problem

Situations of language contact have often produced language assimilation and language death. Although there are reportedly about 7,000 natural languages in the world (Gordon 2005), the contacts that occur amidst contemporary globalization will, according to several forecasts, make most of these languages extinct within a century (Woodbury 2006).

In order to preserve or restore the vitality of weak languages without enforcing the isolation of their speech communities, it appears necessary to make the use of weak languages a rational choice for their speakers. Mufwene (2002) argues that the proximate cause of most language death is the self-interested decisions made by the dying languages' own speakers to cease using them, and the only promising interventions to stop or reverse language death are ones that confer rewards on the users of weak languages. There is evidence that their speakers value them and would maintain them if it were "possible for speakers to earn their living competitively in these languages" (Mufwene 2002, 390).

Designing an intervention to facilitate global interactivity while preserving linguistic diversity (a condition that I shall abbreviate as "panlingual interactivity") seems difficult in light of this analysis. To succeed, it must apparently allow and motivate people to use their native languages as languages of productivity. If expressions' meanings can be reliably translated by automatic means from any language into any other language, they can be encoded in any language without losing value. But otherwise their value will be greater if they are encoded in a widely known language.

The problem of making expressions translatable among arbitrary natural languages is difficult in part because of discordant lexical ambiguity, in which an ambiguous lexeme in one language does not correspond to a lexeme with exactly the same set of denotations in every other language. Efficient panlingual interactivity seems to depend on systems that can reliably, automatically, and fully resolve discordant lexical ambiguity in any natural language. This dependence is problematic, because no system even tries to do this.

Solution Tactics

If there is a strategy that could make panlingual interactivity practical, we might find tactics for it in related existing projects. I consider five existing ones below.

Plone

Plone (http://plone.org) is "a content management system with strong multilingual support". Its functionalities include Web and portal services, document publishing, and groupware services. A typical application is to operate a Web site whose content can be displayed in any of several languages.

In principle, Plone permits any content to be transmitted to a user in any language, and it permits content elements of a site that appear in multiple contexts to be translated once and reused. Plone's tactics include making language a parameter that can be applied to minimal units of language-dependent content and using negotiation between the server and the client to set that parameter.

A final Plone tactic is to make its own development interface multilingual. The site developer sees Plone, as the site end-user sees the site, in whichever language is preferred. Plone's development interface currently exists in about 50 languages.

Unicode

Unicode (http://www.unicode.org/) is a standardization project for the adoption of a single text encoding capable of representing texts in all languages, natural and artificial. Unicode makes provision for about 1 million characters. Thus, Unicode has made the character space large enough so the standard can be effectively panlingual. Unicode's abundant character space makes possible another tactic: decentralized initiative. Decisions on characters in one script do not interact with decisions on characters in another script, so experts on particular scripts can autonomously formulate proposals for the inclusion of their scripts. A third Unicode tactic is to make decisions by collaborative majoritarian opinion aggregation.

Unicode began explicitly in order to be panlingual. It is still moving toward panlinguality, particularly with respect to ancient scripts, but being designed for panlinguality it does not need to be rearchitected as its coverage expands.

WordNet

WordNet (http://wordnet.princeton.edu/) is an activity that identifies word senses and associates them with lexemes. Its main tactic is to compile and maintain a knowledge base that identifies senses of content lexemes (more specifically, noun, verb, adjective, and adverb words and collocations). This work is performed by experts who use judgment in ascertaining the senses that deserve to be distinguished, the glosses that explain their distinct meanings, and the lexeme-sense associations. The work has three relevant results. First, it provides a collection of word senses ("synsets"). Second, it makes it possible to determine for any sense which lexemes can express it, and for any lexeme which senses it can express. Third, it assigns to senses particular properties (such as verb frame, category, and usage) and relations (such as subtype-supertype, component, member, ingredient, pertinence, and attribute). These results are language-specific.

A second WordNet tactic is the replication of WordNets for multiple languages. WordNet began as a unilingual project, dealing only with United States English. Different initiators subsequently compiled WordNets for additional languages. There are now 47 WordNets, including up to 3 different ones for the same language and one trilingual WordNet, covering 39 languages.

A third WordNet tactic is the translingual association of word senses. This work is conducted by the Global WordNet Association (http://www.globalwordnet.org/), EuroWordNet (http://www.illc.uva.nl/EuroWordNet/), MultiWordNet (http://multiwordnet.itc.it/english/home.php), and Mimida (http://www.gittens.nl/SemanticNetworks.html) projects. The details of this tactic differ among the four projects. In general, associating senses translingually requires additional expert judgment as to whether some sense in one language is equivalent to some sense in another language. In general, this judgment in these projects is influenced more by the United States English version of WordNet than by any other-language versions. In the Global WordNet Association project a set of senses, called "Base Concepts", is selected according to both the United States English version of WordNet and universality, with "Global Base Concepts" being those "that act as Base Concepts in all languages of the world".

Grammar Matrix

The Grammar Matrix (http://www.delph-in.net/matrix/) is a project for the construction of a system that automatically generates an initial computational parsing and generating grammar for any human language, given a lexicon and that language's values on a set of parameters. What makes it possible to produce an initial grammar for a language on the basis of parameter specifications is the knowledge of linguistic universals and cross-linguistic typology contained in the Grammar Matrix kernel and modules.

The tactic employed for eliciting the lexicon and paramater values is to present a Web form (http://www.delph-in.net/matrix/modules.html) to a person who knows both the target language and any of the languages in which the form is available. The form asks questions answerable by a person who is fluent in the language. The parameters cover basic grammatical differences among the world's languages, including subject-verb-object word order, how sentences are negated, how declarative sentences are converted to yes-no questions, how two or more phrases are coordinated with the equivalent of "and", how modes such as "can" and "must" are expressed, which cases exist, how nouns are marked as indefinite ("a") or definite ("the"), how nouns are pluralized, what word categories must agree with each other, and how agreement is expressed.

The Grammar Matrix uses semantic representation (specifically, Minimal Recursion Semantics) as its tactic for linking the parsing of sentences with the generation of sentences and for translingual equivalence. If multiple languages' lexicons are identified (e.g., "dogfood" in English is formally named and typed the same as "aliments pour chiens" in French), then one grammar can generate surface forms from the semantic representation that another grammar produces in parsing, thereby translating between the languages.

The expert who provides the lexicon and the parameter values to the Grammar Matrix thereby defines the variety of a language for which a grammar will be produced, and is therefore in control over the ambiguity in that variety. By specifying a constrained morphology and syntax and a word-sense-differentiated lexicon, the expert can force the Grammar Matrix to generate a grammar for a "controlled" variety of the language, one that finds and generates no lexical or structural ambiguity.

Semantic Web

The "Semantic Web" initiative (Berners-Lee 2001) envisions a network platform for universal interactivity. Although various flavors of this initiative have been identified (Marshall 2003), the most widely advocated flavor calls for a universalization of the World Wide Web's producing and consuming publics, to include not only human beings everywhere but also computers and physical devices.

The main enabling tactic advocated in the Semantic Web initiative is the semantic formalization of Web content. This formalization takes two forms. First, statements made in Web documents are to comply with a formal syntax. Second, references within statements are to be expressly defined.

Decentralized authority applies to definitions in the Semantic Web. Authors wishing to use a term must cite a definition of it, but they are free to cite any definition of their choosing. They may cite an already-published definition, or they may publish their own definition and cite that. Nonetheless, the Semantic Web envisions a tactic of re-using consensual definitions and a practice of making definitions rich. If this consensual tactic were implemented thoroughly, then whenever a Web document characterized anything as "dogfood" in a particular sense then it would cite the same definition. This would permit services that inspect the Web's content to combine content from different authors correctly. The richness of definitions would be achieved by the organization of definitions into ontologies, which provide not merely natural-language explications of meanings but also formal constraints and entailments, such as value types, transitivity, and exclusivity. The use of shared ontologies would permit automated reasoning from Web content, deriving conclusions not expressly stated (such as using information about dogfood to answer questions about pet food).

A Vision

We can combine some of the tactics described above to design a strategic vision for panlingual interactivity. In particular, I propose to combine:

panlinguality by design, from Unicode
a multilingual development interface, from Plone
syntactic formalization, from the Semantic Web
translingual lexical standardization, from WordNet
ontological referencing, from the Semantic Web
natural-language-based semantic representation, from the Grammar Matrix

The proposed strategy would support panlingual interactivity with what I'll call "collaborative panlingual semantic standardization". For brevity, I'll call the strategy "CPSS". CPSS as I imagine it can be considered to be a component of the Semantic Web, providing functionalities inspired by the other four projects.

CPSS would make Semantic Web panlinguality practical. Without CPSS, the Semantic Web would permit ordinary Web users to construct their own ontologies and semantically annotated content. This suggests content creators would not use others' existing ontologies unless they could understand them, which in most cases they couldn't because each ontology author would provide glosses only in the language the author knew. The result would be an explosion of synonymity even more powerful than forecast by Marshall (2003), where panlinguality wasn't under consideration. The synonymity, however, would be imperfect, because ontology authors would be influenced by the lexeme-sense mappings of their own languages. So, ontology matching would be error-laden, and panlingual search and knowledge extraction would suffer. Even when multiple definitions were equivalent, there would be no way to guarantee the automatic discovery of the equivalence, and, even if the equivalence were discovered, the effort in formulating multiple equivalent definitions would be wasteful.

While lexical standardization is optional in the Semantic Web vision, it would be essential in CPSS, and this would require that the community of CPSS adopters be adequately motivated to share ontologies. The cost of sharing for each adopter would be foregoing the ability to define concepts exactly as the adopter prefers, and to forego in part concepts that conform maximally to lexeme senses in the adopter's main language. Moderating that cost, CPSS would permit adopters to participate equitably and efficiently in the construction of the shared ontologies. The concepts in the ontologies would not be discovered, as in principle the synsets of (at least some) WordNets are, but would be chosen collaboratively, with deliberation and bargaining, by the CPSS community. Should an ontology contain a concept expressed by the English verb "to dogfood"? The community's decision process would answer that, and no single language's word-sense inventory would be the presumed determinant of the concept inventory.

Syntactic standardization, unlike lexical standardization, is mandatory in the Semantic Web vision, but CPSS would help make it panlingually accessible to the user community. One of the most plausible doubts about the Semantic Web vision is that its cognitive demands on users would be intolerable. Marshall (2003) argues that no user-friendly authoring interface could spare users the need to make their knowledge, intentions, and presuppositions precise and that this requirement is equivalent to the training that turns a person into a knowledge engineer. If there is any escape from this burden, it seems likely to lie in the use of native-language intuitions to help users reason correctly about ontological design and use. CPSS would use the Grammar Matrix approach to permit native speakers of the world's languages to define morphosyntactic fragments of their languages that can unambiguously generate at least semantic representations equivalent to those that the Semantic Web's OWL, RDF Schema, and RDF can express. CPSS would make use of knowledge about controlled-language expressivity and ambiguity (Pool 2005), knowledge from various related unilingual controlled-language projects (e.g., Clark 2005, Kaljurand 2006, Sowa 2004), and knowledge about the efficiency of constrained communication (Ford 1979).

CPSS would make maximal use of dogfooding in order to panlingualize not only its products but also its process. Other multilingual projects, such as Plone, pursue this approach to some extent, but stop after permitting consumer users and individual site creators to work in multiple languages. CPSS would be more thorough about this. It would not rely on the assumption that every activist (e.g., policymaking participant) in the adopter community shares a common natural language (such as English). Its aim would be to facilitate participation in all aspects of the activity by persons with relevant knowledge and skill, regardless of their native languages and regardless of which additional languages, if any, they know. The Grammar Matrix lexicon-input and parameterization form would be a CPSS-compliant document, so as soon as it has been completed by a bilingual user it would become available in the language for which it has just been completed. This would permit monolingual speakers of the language to inspect and question the choices made, and permit bilinguals who know that language to use the form in that language for the generation of grammars for additional languages. The voting process in which members of the community adopt ontology decisions would likewise have an interface made from CPSS-compliant forms, permitting participation via any CPSS-enabled language. The deliberative process would be a greater dogfooding challenge, but one from which it would not be reasonable for CPSS to retreat, since deliberation is an activity type that those envisioning a universally accessible Web want to include in its scope. More than the other activities, the dogfooding of deliberation would validly test the expressivity achieved by CPSS as it panlingually develops.

Conclusion

The strategic vision described here, if attempted, might succeed or fail, but in either case it would be a realistic test of the feasibility of the mass-participation element of the proposed Semantic Web. It is unreasonable to debate the merits of the Semantic Web idea without asking it to deal explicitly with the existence of 7,000 languages in the world. An attempt to implement the CPSS vision would provide empirical measures of the cost and limits of doing so.

Were CPSS to be practical, it would make foreseeable improvements in information retrieval, information extraction, question answering, summarization, transaction processing, deliberation, discussion, and publication. In addition, it would change the world's economy of language. Languages that are being treated as doomed would become useful for participation in global information exchange, and this would increase the value of using them productively and in other ways. The value of creating writing systems for as-yet unwritten languages, and the value of becoming literate in written minority languages, would also increase greatly. The global linguistic equilibrium could significantly change, as it became less costly to satisfy the preference for linguistic diversity and linguistic preservation.

References

Berners-Lee 2001. Tim Berners-Lee, James Hendler, and Ora Lassila, "The Semantic Web", Scientific American, 284(5), 2001, 34-43. http://www.scientificamerican.com/article.cfm?articleID=00048144-10D2-1C70-84A9809EC588EF21&catID=2

Clark 2005. "Acquiring and Using World Knowledge Using a Restricted Subset of English", 2005. http://www.cs.utexas.edu/users/pclark/papers/flairs.pdf.

Ford 1979. W. Randolph Ford, Alphonse Chapanis, and Gerald D. Weeks, "Self-Limited and Unlimited Word Usage during Problem Solving in Two Telecommunication Modes", Journal of Psycholinguistic Research, 8, 1979, 451-475.

Gordon 2005. Raymond G. Gordon, Jr. (ed.), Ethnologue: Languages of the World, 15th edn. Dallas: SIL International, 2005. http://www.ethnologue.com/.

Kaljurand 2006. Kaarel Kaljurand and Norbert E. Fuchs, "Bidirectional Mapping between OWL DL and Attempto Controlled English", delivered at Fourth Workshop on Principles and Practice of Semantic Web Reasoning, Budva, Monte Negro, 2006. http://www.ifi.unizh.ch/attempto/publications/papers/ppswr2006_kaljurand.pdf.

Marshall 2003. Catherine C. Marshall and Frank M. Shipman, "Which Semantic Web?", Hypertext '03 Proceedings, 2003. http://www.csdl.tamu.edu/~marshall/ht03-sw-4.pdf.

Mufwene 2002. Salikoko S. Mufwene, "Colonization, Globalization and the Plight of 'Weak' Languages", Journal of Linguistics, 38, 2002, 375-395.

Pool 2005. Jonathan Pool, "Can Controlled Languages Scale to the Web?", manuscript. http://utilika.org/pubs/etc/ambigcl/.

Sowa 2004. John F. Sowa, "Common Logic Controlled English", 2004. http://www.jfsowa.com/clce/specs.htm.

Woodbury 2006. Anthony C. Woodbury, "What is an Endangered Language?". Linguistic Society of America, 2006. http://www.lsadc.org/info/pdf_files/Endangered_Languages.pdf.