lunes, 20 de febrero de 2017

MARY'S ROOM; AND/OR LEXICAL MISMATCH

According to John Locke (1690, adapted to make the word "child" gender-neutral); “If a child were kept in a place where they never saw any other (colour) but black and white till they came of age, they would have no more ideas of scarlet or green than those who from their childhood never tasted an oyster or a pineapple has of those particular relishes”.
Now let's call the child Mary, teach her the basics of colour on black and white paper, and do the Mary's Room thought experiment:
Mary is a brilliant scientist who is, for whatever reason, forced to investigate the world from a black and white room via a black and white television monitor since her infancy. She specializes in the neurophysiology of vision and acquires, let us suppose, all the physical information there is to obtain about what goes on when we see ripe tomatoes, or a cloudless day sky, and use terms like 'red', 'blue', and so on. She discovers, for example, just which wavelength combinations from the sky stimulate the retina, and exactly how this produces via the central nervous system the contraction of the vocal cords and expulsion of air from the lungs that results in the uttering of the sentence 'The day sky is blue'. [···] What will happen when Mary is released from her black and white room or is given a colour television monitor? Will she learn anything or not?
According to those who devised the Mary's Room experiment, she would learn something new.
  1. Mary (before her release) knows everything physical there is to know about other people/colour.
  2. Mary (before her release) does not know everything there is to know about other people/colour (because she learns something about them on her release).
  3. Therefore, there are truths about other people or colour (and herself) which escape the physicalist story.


We cannot review here all the works which have dealt with this issue, the list is impressive. Very briefly, one can distinguish a divergence (roughly speaking, same meaning but different syntactic structure) from a mismatch (roughly speaking, the grammar and the lexicon of the Source Language (SL) do not make some distinctions which are required by the grammar and the lexicon of the Target Language (TL)) by stating that the former shows a difference in construction (such that he swam across the river translates into French as il a traversé la rivière à la nage), whereas the latter shows a difference in meanings which are equivalent but not identical from one language to another one (such that fish translates into Spanish as pez and pescado, the former being a living fish whereas the latter is the one you eat). More attention has been paid to divergences than to mismatches, for mainly two reasons:
1 divergences have been used to provide arguments in favour of or against transfer-based and interlingua-based approaches,
2 divergences, being a syntactic phenomenon, can be detected and resolved more easily than mismatches which involve a semantic treatment, as there is, in this case, hardly any syntactic trigger.

The case of mismatches is even more problematic, as there is need not only for contextual knowledge but also for extra-linguistic knowledge, as discussed in [Kameyama et al., 1991]. We present below the semantic distinctions emphasised by [Heid, 1993]:
2.1 the TL word exhibits more semantic distinctions or finer-grained distinctions than the SL one, such that fish is lexicalised in Spanish by pez and pescado,
2.2 the TL word exhibits fewer semantic distinctions or coarser-grained distinctions than the SL one, such that the Spanish nouns pez and pescado are both lexicalised in English as fish,
2.3 the TL and SL words do not carry the same semantic distinctions; for instance, such that the Spanish verb madrugar is lexicalised in English by get up early.
We would like to add to the above list:
2.4 the TL or SL share the same semantic features but have different stylistic or pragmatic usage of their lexicalisations;
2.5 the two conceptual worlds between the languages differ; in other words, when we have a conceptual mismatch.
(For instance, for insurance policies one should not make the same inferences based on driving in left hand-side and right hand-side countries, unless the conceptual worlds have been rendered “equivalent”. For instance, the French text extracted from the French UAP corpus: l'adversaire qui prenait son virage complètement à gauche m'a heurté et maintenant il profite de ce que j'avais bu pour me donner tous les torts. Honnêtement est-ce qu'il vaut mieux être saôul à droite ou chauffard à gauche? translates into English as the adversary who took his turn completely on the left [lane] is the one who drove into me, and now he takes advantage of the fact that I had been drinking to make me responsible for all casualties. Honestly, what is the best, a drunkard on the right or a roadhog on the left? Having a Natural Language Processing (NLP) system make the same inferences for the two conceptual worlds could lead to wrong inferences in resolving further coreferences.)
2.6 there is a lexical conceptual gap between the TL and the SL; SL has a lexeme whose meaning is absent in the TL.
We call all the above distinctions “language gaps”. Our interest in resolving language gaps (i.e. when there is not a one-to-one mapping between languages, whatever the linguistic level, lexical, semantic, syntactic, etc...) using a knowledge-based approach along with planning techniques comes from noticing that all earlier work ([Lindop & Tsujii, 1993], [Dorr, 1995], [Heid, 1993], [Kameyama et al., 1991], [Levin & Nirenburg, 1993], [Palmer & Wu, 1995], ...), whatever the approach or paradigm adopted, seem to fail to solve completely (i.e., recognise and generate) language gaps. More generally, if we want to account for all types of “language gaps”, we suggest distinguishing between four major types of “language gap”, corresponding to their level of treatment:

conceptual: when the conceptual worlds representing different realities can be made “equivalent”
pragmatical: when the languages have different conventional ways of expressing a meaning semantic: when the language units share some semantics, most of it overlapping; or hardly share any semantics
lexical: when the languages share semantics but differ in lexicalisation.

We consider the four kinds of gaps as listed in (Figure 1) from a processing viewpoint, specifically, as three sub-problems of the “language gaps” theory for lexical selection in generation: synonymy, hypernymy (including hyponymy), relevancy.
2. Hypernymy. Figure 3 shows a case of hypernymy, that is, when the TL does not make distinctions required by the SL. This case is not difficult in the sense that the TL is not ambiguous with respect to itself, but just from the SL perspective. The fish example shows that in English it does not matter whether we are talking about food or animal with the word fish “conflates” both interpretations in one single word. One might talk about “vagueness” in this case. From the processing viewpoint, in a knowledge-based approach, selecting the appropriate translation candidate for the Spanish pez or pescado is equivalent to search for the least common hypernym of the semantics of the Spanish lexical items.

3. Relevancy
Figure 4 shows the most challenging case of semantic gap. This type of gap does not directly support a translation between SL and TL, but only some approximate translation that we call relevancy. By relevancy, we mean to focus on the most relevant information from the SL text to be carried across to TL to best match the most equivalently relevant information in TL. From a processing viewpoint, this case involves taking into account static and dynamic resources: conceptual world model, “script-like” information, and an engine to draw inferences on the static resources in context. Although we cannot detail the process in this paper, we will illustrate it through an example.
The relevancy process determines for a particular word or phrase in SL (sl11) the set of possible candidates, whether lexicalised or not: words and phrasals (tl21, ..., tl2n), as well as semantic representations (semk). This set will be added to the set of candidates, input to the lexical selection process. The hyper and hypo in Figure 4 stand for hypernymy and hyponymy respectively. The most difficult case of relevancy concerns when SL has a lexical item or expression which meaning is not found in TL. There, the SL lexeme(s) must be given a definiens trying to find the best words in TL to express it, this process might involve using hypernymy and hyponymy treatments and will require an inference engine.
Hyponymy can be understood as a sub-type of the relevancy type: further specifying the meaning of a SL word (sl11) to best “match” the meanings of the words from TL (tl21, tl22), requires contextual processing, but not necessarily extralinguistic knowledge. (In this sense, the hyponymy treatment includes Nirenburg’s notion of saliency which holds at the lexical level only. By saliency the author meant to lexicalise in as few lexemes as possible in the TL, the most semantic information of the input. For instance for madrugar → get up early, we would rightly match the pairs instead of generating for madrugar say get up in the morning before 6am.) For instance, assuming the semantics for fish, pez, pescado, given below, going from English to Spanish might require more or less contextual reasoning to match the SL text:
fish (X) sem: FISH (X)
pez (X) sem: FISH (X), LOCATION (WATER)
pescado (X) sem: FISH (X), EDIBLE (X)
In the presence of LOCATION (WATER) in the context of FISH, the language matcher will try to best match as much semantic as possible in TL, selecting the Spanish pez as in the example I saw many fish in Lake Powell. However, more contextual processing might be involved for the language matcher to find the best solution, in particular in the case of non literal language such as in I liked the fish I had at noon, what was it?, where the event ellipsis EAT has first to be reconstructed ([Viegas & Nirenburg, 1995]), to find that in this context FISH, as a potential theme of EAT, is of type EDIBLE and therefore pescado will be selected. EAT illustrates a case of semk in Figure 4.

Some confusion with respect to semantic gaps seems to come from a widely held belief that an SL which has fewer lexical units corresponding to a greater number of lexical units in the TL is ambiguous from a monolingual perspective, such as in the examples:
fish → pez/pescado (Spanish)
se trouver (French) → stand/lie
The word fish (ditto fisk, Fisch, poisson) becomes ambiguous only with respect to Spanish, se trouver (French) with respect to English. ( There is no consensus on what is underspecification (see [Van Deemter & Peters (eds.), 1996] for different approaches). In this paper, we will consider a lexeme as semantically underspecified when its meaning can be further specified for a particular truth value in context. For instance, fish is underspecified with respect to its ANIMAL or FOOD meanings in I bought two fish. It becomes specified in I bought two fish to put them in the aquarium, and in I bought two fish to fry them with the chips).

Oh... and, like, Spanish has "pez" for /fish as animal/ and "pescado" for /fish as food./ And the Scandinavian languages lack an exact word for the noun "mind" (translating the corresponding word, depending on the context, as "förstând" --reason--, "minne" --memory--, "tanke" --thought--, "själ" --soul--, "hjärna" --brain--, "sinne" --sense--, or "psyke"): I mean, they lack an exact word for /mind/ yet can tell between several different kinds of /snow/ (nysnö --new snow--, kornsnö --granulated snow--, snömos --creamy dirty slush formed on the streets--, kramsnö --malleable snow, ideal for building snowmen, igloos, et al.--, and so on)... A mind is a terrible thing to translate into Swedish, for instance. Seeing these cases through a Whorfian lens...

[Kameyama et al., 1991] Kameyama, M., R. Ochitani and S. Peters. 1991. Resolving Translation Mismatches With Information Flow. In Proceedings of the Association for Computational Linguistics, 1991, pp. 193-200.
[Heid, 1993] Ulrich Heid 1993. Le lexique: quelques problèmes de description et de représentation lexicale pour la traduction automatique. In P. Bouillon and A. Clas (eds), pp. 169-196.

To close with yet another thought experiment: would you rather live in a country in whose language there's a long paraphrase for /death penalty/capital punishment/ and a single easy word for /holiday/ (as in most European languages), or in one whose language has a single word for /death penalty/capital punishment/ and a paraphrase for /holiday?/ I have already made my choice referring to my own cultures (Mediterranean and Scandinavian): the first country of those proposed.

No hay comentarios:

Publicar un comentario