Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Automatic language identification in texts: a survey
Jauhiainen T., Lui M., Zampieri M., Baldwin T., Lindén K. Journal of Artificial Intelligence Research65 (1):675-782,2019.Type:Article
Date Reviewed: Mar 19 2020

One might think that automatic language identification (LI) is straightforward--surely, distinguishing English from Polish is easy. This review shows that the problem is much harder than one might expect. For example, distinguishing Modern Standard Arabic (MSA) from the Egyptian dialect, especially when the text is a tweet, is hard. This review sets out to encompass the current state of the art of LI. To get some idea of the scope of the paper and the material reviewed, be aware that the paper is more than 100 pages, and more than one-third of that is the bibliography.

The authors divide the LI process into four steps. First, select a document representation. Second, a language model for a predefined set of languages is derived from a training corpus. Third, a function is defined that determines how well a given document fits the language model for each training language. Finally, the language of the document is predicted.

Language models can be constructed via n-grams of characters, bearing in mind that “character” is not well defined for all languages, or n-grams of words, among many other possible features. The authors give an exhaustive survey of features that can be used. The probability of such a feature in a text can then be part of the quantitative function used for language identification. The authors survey papers where a mixture of types of features is used. Support vector machines, decision trees, and neural networks, among others, have been used.

There is a survey of empirical evaluations of LI systems. An issue here is the length required of a document to enable the identification of the language; think, here, of tweets or search phrases. The authors note that there is a need for standardized datasets to enable comparisons between systems. A section of the review is devoted to application areas. Another section reviews off-the-shelf language identifiers. A lengthy section covers research directions and open issues. Multilingual documents are an issue here.

The paper contains a significant number of tables that list papers covering a specified topic. This will be very useful to readers wishing to pursue a particular topic further. The paper is an impressive contribution that highlights the complexities of an area that one might overlook.

Reviewer:  J. P. E. Hodgson Review #: CR146936 (2008-0197)
Bookmark and Share
  Featured Reviewer  
 
Language Parsing And Understanding (I.2.7 ... )
 
 
Natural Language Interfaces (I.2.1 ... )
 
 
Text Analysis (I.2.7 ... )
 
 
Text Processing (I.5.4 ... )
 
 
General (I.2.0 )
 
Would you recommend this review?
yes
no
Other reviews under "Language Parsing And Understanding": Date
Computer processing of natural language
Krulee G., Prentice-Hall, Inc., Upper Saddle River, NJ, 1991. Type: Book (9780136102885)
Sep 1 1992
Deep and superficial parsing
Wilks Y., Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1985. Type: Book (9789780131638419)
Dec 1 1987
Compound noun interpretation problems
Jones K., Prentice Hall International (UK) Ltd., Hertfordshire, UK, 1985. Type: Book (9789780131638419)
Dec 1 1987
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy