Data mining primarily involves the extraction or discovery of useful patterns of information for various applications; it has become an important field in recent times. With this importance in mind, Larose has started a series on data mining, initially consisting of three volumes, two of which have already been published. This book is the second volume, and advances the knowledge laid by the first and third [1,2].

This book follows a white box approach, where the reader is walked through the various algorithms using specific examples applied to large data sets. Further, there are excellent exercises, which are divided into concept clarification, working with data, and hands-on analysis. Several commonly available programs such as Clementine, SPSS, Minitab, and Weka, are used in the book; the use of Weka should be particularly welcome, as it is open-source software.

The book is divided into seven chapters. Each chapter is preceded by introductory material and ends with a summary, useful references, and exercises. The first chapter focuses on dimension reduction methods, specifically principal component analysis, and its variant, factor analysis. The necessity for dimension reduction is explained before discussing these methods. An interesting aspect of this chapter is the brief discussion of user-defined composite as a method for dimension reduction, as there are not many texts that address this method explicitly.

The next chapter is on regression modeling, where only simple linear regression methods (that is, a single input predictor variable is used to predict a single output response variable) are discussed. The last part of the chapter is on simple transformations, such as the use of logarithms, to convert relationships between parameters from nonlinear to linear (not to be confused with nonlinear/linear analysis in signal and system courses).

Chapter 3 extends the material discussed in the previous chapter by including regression methods where there is more than one input predictor variable. The multiple regression model is discussed with several related topics: inferencing, regression with categorical predictors, multicollinearity, and variable selection methods. The use of principal components as predictors is also discussed. In the following chapter, a further extension of regression techniques is covered: logistic regression methods are used to treat situations with categorical response variables.

Bayesian statistics, which assumes the data is known and the parameters are random variables, is covered in chapter 5. First, the difference between Bayesian and classical statistics is introduced. The maximum a posteriori classification method is explained using a simple example. This is followed by naive Bayes classification, which is suitable for a higher number of predictor and target variables. Finally, Bayesian belief networks, an extension of naive Bayes classification, are covered.

Chapter 6 is on genetic algorithms (GAs) where both binary and continuous-based (real-valued) chromosomes are described. The basic principles behind GAs are explained. The sections on enhancements to the standard selection and crossover operators are very useful. Several variations of the crossover operators are described; however, Larose does not cover the inversion operator, which is useful in some applications (like the traveling salesman problem). A Weka hands-on application obtains optimal neural network weights using GAs.

The final chapter treats a detailed case study of “Modeling Response to Direct Mail Marketing,” which was carried out using the cross-industry standard process for data mining. The six phases involved in the standard are business understanding, data understanding, data preparation, modeling, evaluation, and deployment. Each is described along with several appropriate methods. However, some of the methods, such as C5.0 and CART, were applied using Clementine software and may seem like “black box” material to the general reader.

The material is easy to read, and does not require an extensive mathematics background or lots of programming experience. A basic understanding of statistics would probably be beneficial, as the book is oriented toward statistical data mining (with three chapters dedicated to regression analysis and one on Bayesian, though there is a chapter on GAs). The use of other open-source software, such as Scilab or Octave (open-source versions of MATLAB), along with some technical hands-on examples, could have attracted readers with an engineering or computer science background.

Overall, the book is interesting to read, and the methods will be useful for data mining researchers working with one or more of the discussed approaches. Some or all of the text could also be adapted for advanced undergraduate or graduate-level teaching. There is a companion Web site (www.dataminingconsultant.com) that provides an array of teaching resources, should an instructor decide to adopt the book.