Intervention de Monsieur Nabil ALI

MACHINE TRANSLATION: A CONTRASTIVE LINGUISTIC PERSPECTIVE

NABIL ALI

I Background

In 1983, the author was assigned the task of developing the first bilingual Arabic-English home-computer. This task was involved with the development of a bilingual operating system (MSX-based), as well as the establishment of a software development unit dedicated to Arabic applications. Due to the home-orientation of the project, the emphasis was mainly on culture-ware and education-ware. This placed higher priority on the Natural Language Processing and solving the myriad of problems associated with Arabic computation. At that time, Arabic was extremely underprivileged in the computation field, suffering the limitations of a minimal system at a pure character level, and poor printing and display qualities. Thus, it was necessary to shift to a more developed level, dealing with larger linguistic units, namely the word, the sentence, and the continuous text. Seeing as to how English was the most established computation example, we had to draw on its resources and techniques. Shortly, we were to discover that these techniques are not suitable for Arabic. This is simply due to the fact that Arabic, compared to English, is much more complex at almost all linguistic levels (with phonology as the sole exception). At the character level, the complexity lays in the cursive shape and concatenation of Arabic letters, and above all, these letters are characterized by a high degree or context sensitivity. By this, it is meant that its appropriate shape is determined by the surrounding letters. Hence, the author had to implement an automata for the generation of the appropriate shape of letters depending on their context. At the word level, the morphology of Arabic (the language subsystem dealing with the structure of words) as it is already known, is the most sophisticated of all languages world-wide. Therefore, the author had to develop a morphological processor capable of analyzing any Arabic word into its morphological primitives, as well as synthesizing the final form of words out of these primitives. Lastly came the syntactical level, which no doubt proved to be the most difficult, primarily because Arabic is usually written without vowels. In essence, written Arabic is a quasi-stenographic script, and this results in a severe mélange of various ambiguities, which are unprecedented and absent from any other languages. The morphological ambiguity, due to the absence of vowels, is intermixed with other types of ambiguities, mainly those associated with word sense, part – of - speech, and syntactical structure. For a non-Arabic speaker to appreciate such a problem, let us assume hypothetically that an English sentence such as " SOME FIRMS LEND MONEY" is written in the Arabic fashion. The result will be the following string of consonants: "SM FRMS LND MNY" Each of these consonantal forms may have a set of alternative vowelized interpretations (figure:1)
 
 

Figure 1: morphological ambiguity due to drop of vowels

Thus, any syntactical processor dealing with Arabic text as its input has to primarily disambiguate such quasi-stenographic script. As a result, an automatic vowelizer became mandatory as a prerequisite for Arabic computation. To solve this problem, the author has developed a "shallow" automatic understander of Arabic in order to disambiguate the unvowelized text, as well as to substitute the missing vowels. This required the achievement of three main computational linguistic tasks: (1) the development of an Arabic parser; (2) the development of a lexical-semantic processor; (3) the development of an automatic generator of the vowelized text.

Since parsing techniques developed for English have been proven inadequate for the Arabic language, both in function and performance, a parsing system based on a multi-level grammar, was developed and implemented. This system is capable of handling the previously mentioned intermixed set of ambiguities. The disambiguation mechanism works incrementally at every level of the grammar. Residual ambiguities are resolved heuristically, resorting to preferential principles working on both syntactic and semantic levels.

The availability of this sophisticated system induced the idea to use it as a generalized model for other languages. The system was successfully slimmed down to handle English. For example, the morphological processor was slimmed down to be used as an English stammer (to extract the stem of inflected word forms), the shallow understander was slimmed down to an efficient English parser, and the lexical-semantic disambiguated was tailored to handle types of ambiguities encountered in English.

In general, the developed Arabic-English bi-directional system could be characterized as being engineering-oriented. The author felt that its rather adhoc approach had to be refined theoretically through more serious investigation in the field of contrastive linguistics. In this regard, the author has exploited intensively a basic property of Arabic, namely its non Exotic charactor which places it in an intermediary position among linguistic extremes, found in other languages.

The paper is intended to present machine translation from a linguistic divergence. First, multi-linguality will be overviewed contrastively as a preface to a more specific discussion focusing on translation. It will conclude, with a brief description of the contrastive aspects of the bi-directional, Arabic-English translation system which is currently being developed.

II Multi-linguality: A Contrastive Perspective.

Language in the present study is viewed as a system comprised of two main components, grammar and lexicon. Each of the two components will be considered consecutively.

According to Robert Freidin (1), the comparative work carried out by nineteenth-century grammarians, was concerned with establishing an explanatory basis for the relationships between languages and groups of languages primarily in terms of a common ancestor. Contemporary comparative grammar, in contrast, is significantly broader in scope. It is concerned with a theory in grammar that is postulated to be an innate component of a human mind/brain. In this way, the theory of grammar is a theory of human language and hence establishes the relationship among all languages, not just those that happen to be related by historical accident (for instance, via common ancestry).

One can safely say that the current advancements in contrastive linguistics are attributed to the adoption of the generative paradigm. This paradigm has been applied successfully across the different language subsystems, mainly morphology, syntax, and semantics. A general theory of generative morphology, both derivational and inflectional, has been developed (2). It has reached a level of maturity that has made it possible to be applied cross-linguistically. This led to different attempts to develop a universal computational morphological system. These universal approaches deal basically with affixational morphology. Their performance with regard to diffusional (non-concatenative) morphology is still questionable. Furthermore, the majority of morphological research focus on the word form much more than they do on the words’ semantics and information content. Aronoff has initiated a semantic-based morphology which views the derivation of words as a generative process from one word sense to another. This shift toward semantic-based morphology is essential for contrastive analysis at the word level in general, and for translation, in particular.

At the syntactic level, the diversity of languages is based on the "Government Biding" theory developed by Chomsky. According to GB, a language is not a system of rules, but a system of specifications for parameters in an invariant system of principles of Universal Grammar (UG). Linguistic diversity can be explained as variation in the setting of certain values for a principle of UG. Hence, linguistic variations would, in part, be reduced to parametric variation for principles of UG. An example of this parameter is that of the "null-subject" that differentiates between languages using explicit subjects, such as French and English, and those that permit the omission of the subject, such as Arabic and Spanish. While the GB parameterization-based model is highly theoretically apt, yet due to its general nature, it is too abstract to be applied as practical computational systems for analysis (parsing) and synthesis (generation).

As we move towards semantics, linguistic divergence fades out. Although languages usually exhibit broad disparity at the morphological and syntactical levels, such linguistic disparity is greatly diminished at the semantic level, at which various syntactical forms are converted to their corresponding logical forms. Formal logic, is basically universal and thus can transcend linguistic boundaries. At this logical level, the sameness of meaning is explicitly expressed, this in turn, enables cross-linguistic mapping and transformation. Engineering-oriented approaches to develop a universal semantic processor which can work multi-lingually have been developed. They adopt a compositional paradigm which decomposes meaning using a universal set of semantic primitives, A well-known example of these processors is the conceptual-dependency model developed by Shank.

Regarding the lexicon, and according to James Pustejovsky (3), "Computational and theoretical linguistics have largely treated the lexicon as a static set of word senses tagged with features for syntactic morphological and semantic information." This view has undermined the role of the lexicon within the overall language system. Indeed the pendulum has switched sides to the extent that it has reached the other extreme of viewing the whole language within the lexicon itself. The notion that morphology, syntax and semantics are in the lexicons is gaining acceptance. Under this lexicon-based framework, the major role of these language subsystems is to express formally (rule-based) the lexical redundancies and regularities encountered among the different lexical entries. Currently, the generative paradigm is being introduced in the field of the lexicon in an attempt to upgrade the art of lexicography to the level of exact science of what is now termed "lexicology". MIT has launched a cross-linguistic lexicon project with the main objective to systematize different aspects of lexical divergence.

A major challenge facing generative lexicology is that related to metaphors. Since metaphors are strongly linked to culture, it is difficult to isolate the culture-dependent subparts from those that are independent. However, Lekov, in his seminal work, Metaphors We Live By, sheds light on how to tackle the metaphor problem cross-linguistically. By providing us with many examples of commonly used metaphors that exist in many different languages, he initiates a new approach to perceive metaphors at a higher universal level.

III Application to the Field of Machine Translation

With the above overview of multi-linguality, we now bring forth our contrastive analysis in the context of machine translation. This transition could be better visualized if we note that the changing attitude towards translation studies has always been determined by transformations in the theory of language. According to Marie-therese Abdel-Messih (4), one could summarize these transformations in terms of the following milestones:

  • Plato’s theory of language: which assumes the existence of a fundamental prior utterance with a determinant literal meaning that can be transformed inter-lingually.
      1. Defining translation in terms of sameness: a discourse that privileged the source language, thus making the corresponding target language as a by-product of the original.
      1. The rejection of the sameness approach a trend which gave way to plural interpretations that de-stabilized the concept of fidelity. Languages, according to Benjamin, are not strangers to one another. Rather, they are interrelated in what they want to express. However, this kinship of languages does not refer to superficial similarities. Instead, it manifests itself in deeply seated properties which can only be surfaced through in-depth contrastive investigation.
      1. The rejection of the concept of universal language that could be reached through an accurate translation. Jacques Derrida re-interpreted the art of reading in terms of growth of language whereby translation supplements one language with what the other lacks.
      In the field of machine translation, "the kinship of languages" has different realizations. We can distinguish three kinds of multi-lingual systems that can translate from one set of languages to another. These are the inter-lingua translation model, the parameterization model, and the
       
       

      isomorphic grammar model (figure 2). Here, the former two will be discussed while the latter will be dealt with separately in the coming section.

      Figure 2: Machine translation models

      (a) The Inter-Lingua model: To recognize the significance of the inter-lingua machine translation model, one may refer to the much more simplistic transfer scheme used to translate mono-directionally from a single language to the other. Generally speaking, this scheme abides by the previously mentioned "sameness" principle, whereby the generated translated text is defined by the grammar of the source language and the transfer subsystem which converts it to the corresponding target language representation. For each language pair, there is a specific transfer module. This is definitely a tough requirement which makes the transfer scheme extremely impractical in case of multilingual bi-directional automatic translators. The inter-lingua came as a solution to this problem. In this model, a grammar of the source language defines an analysis component which translates from the source language into an intermediate language known as the inter-lingua. A grammar of the target language defines a generation component which translates from the inter-lingua to the target language (6). By this, instead of using different transfer modules for each language pair, the inter-ligua works as a multi-lingual transfer module. A major drawback in the inter-lingual model is that it requires for each member language specific analysis and synthesis components.

      (b) The Parameterization Model: The Parameterization Model is an enhancement to the inter-lingua model by overcoming its major drawback just mentioned. The approach to translation is similar to the traditional approach in that it uses a language independent representation. Nevertheless, in this approach there are only one analysis component (multi-lingual parser) and one synthesis which work multi-lineally. Language-specific information is factored out in terms of parameter settings and the translation mapping operates uniformly across all languages. The UNITRAN system developed by Dorr is one implementation of this model. It uses parameterization in both the syntactic and lexical distinctions (5). The parameter setting approach is desirable from a number of different perspectives. First, it allows language-specific knowledge to be represented independently from the language included in the syntactical principles and the inter-lingual representation. Second, it appeases the contrastive aspects. Third, the parameter setting approach allows a machine-translation system to be easily modified and augmented. As previously mentioned, the parameterization-based machine translator model is yet too abstract, and this fact indeed does renders its implementation a difficult task.

      IV Isomorphic Grammar Approach

      In developing his bi-directional Arabic-English machine translation system, the author has adopted the isomorphic grammar approach whereby the grammar pair of both the source and target languages are attuned to one another. This differs from the inter-lingual models in which grammars for different languages can be developed independently. Isomorphism has been established through different lexical and grammatical development arrangements. These are:

      1. The same grammar model is used for the language pair. In our system, the Generalized Phrase-Structure Grammar (GPSG) developed by Gazder has been chosen.
      2. Common guidelines for grammar writing with regard to categorization and subcategorization, as well as sequence of syntactical constructions and derivations.
      3. Same lexicalization principles which guarantee the compatibility of the detailed parts–of- speech between the source language and the target language.
      4. Identical verbal argumentation scheme (semantic valence patterns).
      5. Same features system to support the semantic compositional analysis of meaning, and the specification of morph-syntactical properties.
      6. Same notational system for both grammar and lexicon.
      Bidirectionality of the Arabic-English translation system has rendered its syntactical and lexical transfer components much more sophisticated than that usually found in mono-directional machine translation systems-main reason is that both members of the language pair could act as a source language. Hence, the translation system does not have the luxury of subletting the linguistic courage, either lexically or syntactically.

      Writing isomorphic grammars for different languages, particularly those that are members of different language families such as the case of Arabic and English (Arabic belonging to the Semitic family and English to the Indo-European) is considered difficult. However, through the experience of the author in writing grammars, for both languages, the case has proved less difficult than expected. Isomorphism emerges naturally, provided that the style of grammar writing abides with a set of well-established principles. Needless to say, perfect isomorphism can not be attained, and this is due to genuine differences between languages. Regarding the Arabic-English language pair, major differences could be summarized as follows:

      1. Word formation: English word formation is mainly of offixational nature. Arabic combines both offixational and diffusional methods of word formation.
      2. Difference in basic word order: while English is categorized as SVO (S: subject V: verb O:object), Arabic is basically VSO. Arabic also allows for nominal sentential patterns initialized by the subject.
      3. Null-subject parameter: where English uses explicit subject that can be dropped in Arabic.
      4. Syntactical Flexibility: English syntax follows strict word order. Arabic syntax is more flexile.
      5. Use of pronouns: Unlike English, Arabic does not allow for stranded prepositions resumptive pronouns are used instead to construct a full prepositional phrase.
      1. Distinction due to pre/post adjectival modification: in English, the adjective precedes its modified noun and in Arabic, the case is vice-versa.
      2. Distinction due to pre/post genitive construction: English uses post genitive construction (X’sY), while in Arabic the genitive’s head precedes its compliment. (XY)
      3. Dropping of the modified noun: Arabic tends to drop the modified noun in case of abstract or rational nouns. For example, "the wise" and "the clear" imply the "wise man" and "the clear object". While this elliptic constructions is extremely productive in Arabic, it is highly restricted in English and is used in limited cases such as, "the rich", "the blind" or "the disabled".
      4. Relatevization: While English relatives definite and indefinite nouns, Arabic restricts relatevization to definite nouns only.
      5. Punctuation: Punctuation in English follows strict rules. On the other hand, Arabic punctuation is much more flexible, and its usage is rather discretional.
      In brief, the isomorphic Arabic-English translator is data-intensive and the transfer model has been reduced to a direct mapper between the grammar pair of the sconce and target languages. Lack of isomorphism, due to difference in style of grammar writing, could be resolved via explosion and implosion of syntactical categories at both sides of the language pair. This is achieved via what is known technically as recursive ascend or descend along the grammar rules. In this isomorphic model, the same grammar is used for both analysis (parsing) and synthesis (generation). Constraints for applying the different grammatical rules are interpreted differently by the parser and the generator. For instance, while the verb-subject agreement is considered during parsing a condition for well-formalness of the input sentence, the same agreement is used to generate the inflectional features or the appropriate forms of the verb and subject.

      V Some Requirements for the Globalization of Machine Translation Technology

        To support the globalization of the machine technology, the author recommends the following:
      1. The development of a universal meta-language for both grammar and lexicon.
      2. The development of automatic tools for conversion between different grammar models.
      3. The standardization of lexical organization and content.
      4. The development of translation-oriented multi-lingual textual corpus.
      5. More importance has to be given to textual contrastive linguistics.
      6. The encouragement of research works in the field of contrastive computational linguistics.
      References
      1. Freidin, Robert. 1992. Principles and Parameters in Comparative Grammar. Cambridge: MIT Press, pp 1 – 6.
      2. Aronoff, Mark. 1994. Morphology by Itself: Stems and Inflectional Classes. Cambridge: MIT Press.
      3. Pustejovsky, James. 1995. The Generative Lexicon. Cambridge: MIT Press, pp 105 – 131.
      4. Abdel-Messih, Marie Therese. "Translation as a Cross-Cultural Discourse" 1999. roceedings: The Fifth International Symposium on Comparative Literature. Cairo: Faculty of Arts press, pp 283 – 313.
      5. Dorr, Bonnie Jean. 1993. Machine Translation: A View From The Lexicon. Cambridge: MIT Press, pp 2 – 22.
      6. Landsbergen, Jan."Isomorphic grammars and their use in the Rosetta translation system." In Machine Translation Today. Ed: Margaret King. 1987. Edinburgh: Edinburgh University Press, pp 351 – 372.