Decision trees are valuable counterparts to neural networks in data mining--the book confirms this statement. Evolutionary decision trees in large-scale data mining is a didactic introduction to, and overview of, the almost two-decade-long research on the fruitful application of genetic/evolutionary algorithms. As evolutionary algorithms diverge, producing more and more data during the proliferation of generations, they should be kept in frames. For selecting genetic operations (mutations, crossovers) and suitable descendants, this approach groups the objects by trees. Appropriate decisions divide the descendants into subgroups.
Complex decisions can be developed to reach proper solutions through equilibrium branches. The author describes how decision trees are applied to determine classification and regression predictors, and demonstrates the necessity of multi-objective optimization via examples. Different methods are described, for example, balancing potentially conflicting objectives. The author also covers Pareto optimal solutions and how to avoid overfitting the test data.
The presented approach is inherently iterative. Stepwise growth of the decision tree is controlled via feedback, by analyzing some previous layers of the existing decision tree. The method has been extended to treat the whole tree. The author calls it “global induction.” It is implemented in a native C++ global decision tree (GDT) application. Simple and sophisticated functional variants of the method are described. The method has been adapted to specific data mining applications.
The author describes and discusses many implementations of this unified framework, following the theory. The most robust implementations are parallel, using either (i) general-purpose graphics processing units (GPGPU) according to the CUDA architecture; (ii) the Apache Spark open-source distributed system implementing the shared-memory improvement of the disk-based Hadoop; or (iii) Weka, a collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform.
The structure of the book is well-thought-out. It consists of four parts. Part 1 explains the basics: evolutionary computation, decision trees in data mining, and parallel and distributed computation. Part 2 describes GDT for univariate trees and oblique and mixed decision trees. Part 3 introduces two complex applications: cost-sensitive tree induction for financial data, and multi-test decision trees for gene expression data. Part 4 deals with the previously mentioned parallel implementation variants/combinations for evolutionary induction.
The methods are described and explained starting from the simple ones and concluding with sophisticated variants. The diagrams, figures, and tables can be easily overlooked. References are provided at the end of each chapter, and there is a comprehensive index at the end of the book. The included experimental results are compared and discussed. Further details of the implementations and their applications can be found in the references.
A possible research direction would be to apply this iterative, global induction-based way of thinking to multilayer neural networks, Bayesian networks, and sets of decision rules. Another possibility is its application to fast-changing data streams.
I recommend the book for students, researchers, and developers interested in real-life applications of big data analysis.