Intervention de Monsieur Nabil ALIMACHINE TRANSLATION: A CONTRASTIVE LINGUISTIC PERSPECTIVE
NABIL ALI
I Background
In 1983, the author was assigned the task of developing the first bilingual Arabic-English home-computer. This task was involved with the development of a bilingual operating system (MSX-based), as well as the establishment of a software development unit dedicated to Arabic applications. Due to the home-orientation of the project, the emphasis was mainly on culture-ware and education-ware. This placed higher priority on the Natural Language Processing and solving the myriad of problems associated with Arabic computation. At that time, Arabic was extremely underprivileged in the computation field, suffering the limitations of a minimal system at a pure character level, and poor printing and display qualities. Thus, it was necessary to shift to a more developed level, dealing with larger linguistic units, namely the word, the sentence, and the continuous text. Seeing as to how English was the most established computation example, we had to draw on its resources and techniques. Shortly, we were to discover that these techniques are not suitable for Arabic. This is simply due to the fact that Arabic, compared to English, is much more complex at almost all linguistic levels (with phonology as the sole exception). At the character level, the complexity lays in the cursive shape and concatenation of Arabic letters, and above all, these letters are characterized by a high degree or context sensitivity. By this, it is meant that its appropriate shape is determined by the surrounding letters. Hence, the author had to implement an automata for the generation of the appropriate shape of letters depending on their context. At the word level, the morphology of Arabic (the language subsystem dealing with the structure of words) as it is already known, is the most sophisticated of all languages world-wide. Therefore, the author had to develop a morphological processor capable of analyzing any Arabic word into its morphological primitives, as well as synthesizing the final form of words out of these primitives. Lastly came the syntactical level, which no doubt proved to be the most difficult, primarily because Arabic is usually written without vowels. In essence, written Arabic is a quasi-stenographic script, and this results in a severe mélange of various ambiguities, which are unprecedented and absent from any other languages. The morphological ambiguity, due to the absence of vowels, is intermixed with other types of ambiguities, mainly those associated with word sense, part – of - speech, and syntactical structure. For a non-Arabic speaker to appreciate such a problem, let us assume hypothetically that an English sentence such as " SOME FIRMS LEND MONEY" is written in the Arabic fashion. The result will be the following string of consonants: "SM FRMS LND MNY" Each of these consonantal forms may have a set of alternative vowelized interpretations (figure:1)
Figure 1: morphological ambiguity due to drop of vowels
Thus, any syntactical processor dealing with Arabic text as its input has to primarily disambiguate such quasi-stenographic script. As a result, an automatic vowelizer became mandatory as a prerequisite for Arabic computation. To solve this problem, the author has developed a "shallow" automatic understander of Arabic in order to disambiguate the unvowelized text, as well as to substitute the missing vowels. This required the achievement of three main computational linguistic tasks: (1) the development of an Arabic parser; (2) the development of a lexical-semantic processor; (3) the development of an automatic generator of the vowelized text.
Since parsing techniques developed for English have been proven inadequate for the Arabic language, both in function and performance, a parsing system based on a multi-level grammar, was developed and implemented. This system is capable of handling the previously mentioned intermixed set of ambiguities. The disambiguation mechanism works incrementally at every level of the grammar. Residual ambiguities are resolved heuristically, resorting to preferential principles working on both syntactic and semantic levels.
The availability of this sophisticated system induced the idea to use it as a generalized model for other languages. The system was successfully slimmed down to handle English. For example, the morphological processor was slimmed down to be used as an English stammer (to extract the stem of inflected word forms), the shallow understander was slimmed down to an efficient English parser, and the lexical-semantic disambiguated was tailored to handle types of ambiguities encountered in English.
In general, the developed Arabic-English bi-directional system could be characterized as being engineering-oriented. The author felt that its rather adhoc approach had to be refined theoretically through more serious investigation in the field of contrastive linguistics. In this regard, the author has exploited intensively a basic property of Arabic, namely its non Exotic charactor which places it in an intermediary position among linguistic extremes, found in other languages.
The paper is intended to present machine translation from a linguistic divergence. First, multi-linguality will be overviewed contrastively as a preface to a more specific discussion focusing on translation. It will conclude, with a brief description of the contrastive aspects of the bi-directional, Arabic-English translation system which is currently being developed.
II Multi-linguality: A Contrastive Perspective.
Language in the present study is viewed as a system comprised of two main components, grammar and lexicon. Each of the two components will be considered consecutively.
According to Robert Freidin (1), the comparative work carried out by nineteenth-century grammarians, was concerned with establishing an explanatory basis for the relationships between languages and groups of languages primarily in terms of a common ancestor. Contemporary comparative grammar, in contrast, is significantly broader in scope. It is concerned with a theory in grammar that is postulated to be an innate component of a human mind/brain. In this way, the theory of grammar is a theory of human language and hence establishes the relationship among all languages, not just those that happen to be related by historical accident (for instance, via common ancestry).
One can safely say that the current advancements in contrastive linguistics are attributed to the adoption of the generative paradigm. This paradigm has been applied successfully across the different language subsystems, mainly morphology, syntax, and semantics. A general theory of generative morphology, both derivational and inflectional, has been developed (2). It has reached a level of maturity that has made it possible to be applied cross-linguistically. This led to different attempts to develop a universal computational morphological system. These universal approaches deal basically with affixational morphology. Their performance with regard to diffusional (non-concatenative) morphology is still questionable. Furthermore, the majority of morphological research focus on the word form much more than they do on the words’ semantics and information content. Aronoff has initiated a semantic-based morphology which views the derivation of words as a generative process from one word sense to another. This shift toward semantic-based morphology is essential for contrastive analysis at the word level in general, and for translation, in particular.
At the syntactic level, the diversity of languages is based on the "Government Biding" theory developed by Chomsky. According to GB, a language is not a system of rules, but a system of specifications for parameters in an invariant system of principles of Universal Grammar (UG). Linguistic diversity can be explained as variation in the setting of certain values for a principle of UG. Hence, linguistic variations would, in part, be reduced to parametric variation for principles of UG. An example of this parameter is that of the "null-subject" that differentiates between languages using explicit subjects, such as French and English, and those that permit the omission of the subject, such as Arabic and Spanish. While the GB parameterization-based model is highly theoretically apt, yet due to its general nature, it is too abstract to be applied as practical computational systems for analysis (parsing) and synthesis (generation).
As we move towards semantics, linguistic divergence fades out. Although languages usually exhibit broad disparity at the morphological and syntactical levels, such linguistic disparity is greatly diminished at the semantic level, at which various syntactical forms are converted to their corresponding logical forms. Formal logic, is basically universal and thus can transcend linguistic boundaries. At this logical level, the sameness of meaning is explicitly expressed, this in turn, enables cross-linguistic mapping and transformation. Engineering-oriented approaches to develop a universal semantic processor which can work multi-lingually have been developed. They adopt a compositional paradigm which decomposes meaning using a universal set of semantic primitives, A well-known example of these processors is the conceptual-dependency model developed by Shank.
Regarding the lexicon, and according to James Pustejovsky (3), "Computational and theoretical linguistics have largely treated the lexicon as a static set of word senses tagged with features for syntactic morphological and semantic information." This view has undermined the role of the lexicon within the overall language system. Indeed the pendulum has switched sides to the extent that it has reached the other extreme of viewing the whole language within the lexicon itself. The notion that morphology, syntax and semantics are in the lexicons is gaining acceptance. Under this lexicon-based framework, the major role of these language subsystems is to express formally (rule-based) the lexical redundancies and regularities encountered among the different lexical entries. Currently, the generative paradigm is being introduced in the field of the lexicon in an attempt to upgrade the art of lexicography to the level of exact science of what is now termed "lexicology". MIT has launched a cross-linguistic lexicon project with the main objective to systematize different aspects of lexical divergence.
A major challenge facing generative lexicology is that related to metaphors. Since metaphors are strongly linked to culture, it is difficult to isolate the culture-dependent subparts from those that are independent. However, Lekov, in his seminal work, Metaphors We Live By, sheds light on how to tackle the metaphor problem cross-linguistically. By providing us with many examples of commonly used metaphors that exist in many different languages, he initiates a new approach to perceive metaphors at a higher universal level.
III Application to the Field of Machine Translation
With the above overview of multi-linguality, we now bring forth our contrastive analysis in the context of machine translation. This transition could be better visualized if we note that the changing attitude towards translation studies has always been determined by transformations in the theory of language. According to Marie-therese Abdel-Messih (4), one could summarize these transformations in terms of the following milestones:
Plato’s theory of language: which assumes the existence of a fundamental prior utterance with a determinant literal meaning that can be transformed inter-lingually.
isomorphic grammar model (figure 2). Here, the former two will be discussed while the latter will be dealt with separately in the coming section.
Figure 2: Machine translation models
(a) The Inter-Lingua model: To recognize the significance of the inter-lingua machine translation model, one may refer to the much more simplistic transfer scheme used to translate mono-directionally from a single language to the other. Generally speaking, this scheme abides by the previously mentioned "sameness" principle, whereby the generated translated text is defined by the grammar of the source language and the transfer subsystem which converts it to the corresponding target language representation. For each language pair, there is a specific transfer module. This is definitely a tough requirement which makes the transfer scheme extremely impractical in case of multilingual bi-directional automatic translators. The inter-lingua came as a solution to this problem. In this model, a grammar of the source language defines an analysis component which translates from the source language into an intermediate language known as the inter-lingua. A grammar of the target language defines a generation component which translates from the inter-lingua to the target language (6). By this, instead of using different transfer modules for each language pair, the inter-ligua works as a multi-lingual transfer module. A major drawback in the inter-lingual model is that it requires for each member language specific analysis and synthesis components.
(b) The Parameterization Model: The Parameterization Model is an enhancement to the inter-lingua model by overcoming its major drawback just mentioned. The approach to translation is similar to the traditional approach in that it uses a language independent representation. Nevertheless, in this approach there are only one analysis component (multi-lingual parser) and one synthesis which work multi-lineally. Language-specific information is factored out in terms of parameter settings and the translation mapping operates uniformly across all languages. The UNITRAN system developed by Dorr is one implementation of this model. It uses parameterization in both the syntactic and lexical distinctions (5). The parameter setting approach is desirable from a number of different perspectives. First, it allows language-specific knowledge to be represented independently from the language included in the syntactical principles and the inter-lingual representation. Second, it appeases the contrastive aspects. Third, the parameter setting approach allows a machine-translation system to be easily modified and augmented. As previously mentioned, the parameterization-based machine translator model is yet too abstract, and this fact indeed does renders its implementation a difficult task.
IV Isomorphic Grammar Approach
In developing his bi-directional Arabic-English machine translation system, the author has adopted the isomorphic grammar approach whereby the grammar pair of both the source and target languages are attuned to one another. This differs from the inter-lingual models in which grammars for different languages can be developed independently. Isomorphism has been established through different lexical and grammatical development arrangements. These are:
Writing isomorphic grammars for different languages, particularly those that are members of different language families such as the case of Arabic and English (Arabic belonging to the Semitic family and English to the Indo-European) is considered difficult. However, through the experience of the author in writing grammars, for both languages, the case has proved less difficult than expected. Isomorphism emerges naturally, provided that the style of grammar writing abides with a set of well-established principles. Needless to say, perfect isomorphism can not be attained, and this is due to genuine differences between languages. Regarding the Arabic-English language pair, major differences could be summarized as follows:
V Some Requirements for the Globalization of Machine Translation Technology