Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Text mining with MATLAB
Banchs R., Springer Publishing Company, Incorporated, New York, NY, 2012. 366 pp. Type: Book (978-1-461441-50-2)
Date Reviewed: Feb 12 2013

Knowledge discovery from natural language texts is an exciting field of research with limitless real-life applications. The explosive growth of the Internet as a huge repository of information and knowledge existing in texts led scientists to develop new computing tools and methodologies for managing this information. Besides the problem of handling the huge amount of information, there are additional difficulties such as the dynamic and evolving character of texts and the heterogeneity of data sources. In general, most of the knowledge hidden either in structured libraries (for example, medical documents in PubMed) or in unstructured streams (social network postings) remains to be discovered. That is why text mining, the scientific discipline that combines mathematical and computational tools for knowledge discovery from texts, is so important today.

Although there are numerous software tools that implement text mining methodologies (for example, the tm package in the R statistical language), MATLAB is very popular mathematical software for scientists from various disciplines; therefore, the author’s task is welcome. Most of the book’s parts are introductory and can be studied without previous MATLAB or text mining experience.

The book consists of an introductory chapter and 11 chapters divided into three parts. The chapters conclude with sections containing references for further reading, proposed exercises, and projects. The concepts are wisely presented gradually, starting from the simplest ones (variable types), and proceeding to operators, functions, procedures, methodologies, and finally applications. Even the data types are presented gradually, from the character level to the document collection level. Every concept is presented with examples in MATLAB. All of this makes the book a very nice educational aid, able to support an introductory course in text mining.

The introductory chapter 1 first discusses the benefits of using MATLAB for text mining. It then presents an outline of the book’s contents and how the book can be used, and concludes with a very brief introduction to the basic principles of the MATLAB environment.

The first part, “Fundamentals,” contains four chapters (chapters 2 through 5) on how MATLAB can handle strings of characters. Chapter 2 presents the variable types and classes used in MATLAB to represent text. It also introduces the basic functions, which are useful for string management and operation. Chapter 3 proceeds with the use of regular expressions. It starts with operators for matching characters, then presents functions for matching sequences of characters and operators for conditional matching, and concludes with the definition and use of tokens. Chapter 4 is devoted to procedures for operating with strings. It presents procedures for searching and comparing strings, string replacement, insertion, segmentation and concatenation, and finally set operations that can be applied to characters or tokens. The last chapter of this part, chapter 5, describes the main functions for reading and writing files of different formats.

The second part, “Mathematical Models,” contains fours chapters (chapters 6 through 9) describing the two main approaches of mathematical modeling for textual data: the statistical approach and the geometric approach. Chapter 6 explains basic statistical properties of natural language, such as Zipf’s law and methods for analyzing relations between words, like word co-occurrences and mutual information. Chapter 7 discusses the problem of modeling dependencies in larger forms of text. Two classes of statistical models are considered: n-gram models and bag-of-words models. Chapter 8 presents the alternative approach of geometric models. According to this, texts are represented by vector spaces. The chapter begins with the term-document matrix; proceeds with the tf-idf weighting scheme; and concludes with the concept of distance between vectors representing text. Chapter 9 discusses three methods for dimensionality reduction (vocabulary pruning and merging, linear transformation, and nonlinear projection) for addressing the problems of sparseness in high-dimensional spaces.

The third part, “Methods and Applications,” contains three chapters (chapters 10 through 12). This last part discusses some common problems in text mining and natural language processing applications. Chapter 10 is devoted to document categorization. It starts with the preparation of data, and then it discusses two different aspects of categorization: unsupervised clustering and supervised classification. Chapter 11 focuses on an information retrieval problem, the document search, introducing concepts like binary, vector and cross-language search; keyword extraction; and evaluation measures like recall and precision. Chapter 12 deals with the content analysis of documents presenting concepts like polarity estimation and property extraction.

In conclusion, the book offers a nice introduction to techniques and methods for text mining with MATLAB. It is recommended for MATLAB users, especially those who are familiar with writing code. However, beginners will find it interesting and easy to follow. Readers interested in text mining will also find the overall matrix philosophy of MATLAB interesting. Teachers especially will find the book useful for project assignments.

Reviewer:  Lefteris Angelis Review #: CR140925 (1305-0354)
Bookmark and Share
 
Data Mining (H.2.8 ... )
 
 
Matlab (G.4 ... )
 
 
Text Processing (I.5.4 ... )
 
 
Applications (I.5.4 )
 
 
Mathematical Software (G.4 )
 
Would you recommend this review?
yes
no
Other reviews under "Data Mining": Date
Feature selection and effective classifiers
Deogun J. (ed), Choubey S., Raghavan V. (ed), Sever H. (ed) Journal of the American Society for Information Science 49(5): 423-434, 1998. Type: Article
May 1 1999
Rule induction with extension matrices
Wu X. (ed) Journal of the American Society for Information Science 49(5): 435-454, 1998. Type: Article
Jul 1 1998
Predictive data mining
Weiss S., Indurkhya N., Morgan Kaufmann Publishers Inc., San Francisco, CA, 1998. Type: Book (9781558604032)
Feb 1 1999
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy