From the back cover:
The majority of natural language processing (NLP) is English language processing, and while there is good technology support for (standard varieties of) English, support for Albanian, Burmese, or Cebuano--and most other languages--remains limited. Being able to bridge this digital divide is important for scientific and democratic reasons but also represents an enormous growth potential.
The ubiquity of communications is a fantastic blessing but presents humanity with a very significant cultural-heritage challenge. Many natural languages are disappearing or are in danger of disappearing, and this is a tremendous cultural loss for all of us. As a “representative” of a small group of people with few speakers of their language, and even fewer who understand it, I feel strongly about this issue; beyond democracy, this issue may be critical for humanity’s future.
This book is a well-done (albeit humble) leap forward in the struggle to maintain an important part of our collective humanity. However, the challenges are still very large. The book describes a computer technique to aid in NLP: word embedding. This is an important technique, but I fear that the authors, despite their obvious good intentions, are missing some of the obstacles, some of which may actually be further enforced by this technique unless practitioners are very careful. I shall try to illustrate this with an example.
Some years ago, while researching a paper (which later grew into a book), I critically examined 16 translations of The Song of Songs, which is Solomon’s. That, by the way, is itself a “cute” example because the book is misnamed: it is not a “song”; it is a poem, and setting it to music is actually prohibited in Judaism. Its name in English would be better rendered as “poem of poems.”
Each of the 16 renditions of it in English could be considered “correct” on a word-pair basis, but each of them eventually came to a “story board” completely different from the original. In a follow-up study, I asked ten people who were bilingual to examine the translations of the book in their language. These were Dutch, Japanese, German, French, Arabic, Persian, Afrikaans, an African language that I regret to say I forget the name of, and Moroccan (Berber). In all cases, their conclusions were identical to my own experience with the translations in English: word-for-word reasonable, but the story was messed up.
The paper and book were called The opposite of time, which is not really comprehensible in English. English, for instance, always assumes symmetry in word pairs for opposites. If A is the opposite of B, then of necessity B is the opposite of A. This is not the case in most Semitic languages. Specifically, the opposite of time in Hebrew would be holiness. However, its opposite is secular, quite far from time in meaning.
Symmetric pairing does occur but is far from a “rule.” In parallel, I can say that in perusing the New Testament in both English and Hebrew for meanings, I frequently encounter (in Hebrew) constructs that simply nobody would ever say. (This is, of course, no reflection on the content, only on the rendition.)
Again, word pairings can be superb for shallow (for example, technical) documents, or they may be a method for a first draft of a translation. However, they are insufficient for translations of anything real, particularly when the issue concerns two (or more) languages that are very far from one another in cultural milieu. And this, of course, reflects nothing on any particular language or any grouping of languages. Some natural languages tend to be flat while others may have multiple dimensions.
The book as it stands is important, but the technology is far from “industrial strength” for now. The bibliography is very extensive and impressive concerning digital NLP. However, it appears to me that the technologies involved still have a long way to go before they are sufficient for the aims the authors set for themselves. I strongly recommend the book to anyone involved in the field. However, the reader should be aware ahead of time of this very powerful limitation that still affects the field of study quite deeply.