In his earlier book from 2018, Data mining algorithms in C++ , the author indicated that a “volume 2 will appear some day.” This is it. It builds on the techniques and tools of the earlier book and follows a similar style of presentation. Each chapter presents a problem and guides readers through the code that implements the solution. The complete code of the routine is not given, only small segments pertinent to the particular point being developed. However, the author makes the complete source code available for download on his website (http://www.timothymasters.info/). Since the book is a guide to the software, downloading the code and reading it along with the book is essential.
After a brief introductory chapter, each of the following five chapters address a specific issue or task. The second chapter is on forward selection (and backward selection) of independent components that will account for the variance of a dataset. This chapter lacks an extended theoretical introduction and refers readers to a paper by Puggini and McLoone  for background. Simple forward selection of incrementally adding components will generate different results, and a different understanding of important features, than deleting components and then adding others.
The third chapter is on local feature selection. This is the problem of identifying large features that separate data and the problem of identifying features that differentiate datasets within these large features, especially when the large features are in different states. The theoretical background is in a paper by Armanfard, Reilly, and Korneili . The author uses an example of financial markets when different local features may be more important in a rising market than in a static market. The basis of the algorithm is the Simplex method in linear programming.
The fourth chapter, “Memory in Time Series Features,” is the longest chapter of the book, nearly one-third of the volume. It covers the extraction of features when samples are not independent of each other, using a hidden Markov model to identify unobservable variables and estimate their values from a set of data items linked by time. Each data item can be described by a set of state variables that undergo transitions from one condition to another. Some of these variables may be observable and others may be unknown and of potentially great interest. There is a small amount of theoretical background in this chapter. However, hidden Markov models is a topic of greater scope and readers need to find suitable reference material for more information.
The fifth chapter, “Stepwise Selection on Steroids,” returns to the topic of selection. Specifically, it deals with the problems of overlooking variables that are important, but only in conjunction with other variables; of validating variables to avoid adding superfluous items; and of rejecting the addition of variables when simple random luck can account for the variance (that is, avoiding trying to fit random noise). The technique used is linear quadratic regression, which includes the variables, their squares, and all possible cross products.
The last chapter is on converting nominal variables to ordinal. This is the shortest chapter in the book. Nominal values may have numerical values, but they do not imply quantity, magnitude, or the ability to do any meaningful arithmetic. The months of the year are nominal quantities, that is, numbers are associated with them but for limited purposes. For example, if February is represented as 2 and August is 8, adding them does not yield October at 10. Yet ordinal equivalents allow software to use the fact that February is earlier than August, which is in turn earlier than October. Nominal data are difficult to handle as variables in a program. The fact that the months can be correlated with numerical values makes their appropriate use possible. Actual data will be more complicated. The example developed in this chapter focuses on putting data in bins based on comparisons that have no inherent magnitude.
This is an excellent book directed toward those who are already working in data mining. For the novice, the earlier volume should be mastered first before plunging into this one. The VarScreen software is available on the author’s web page; the executable (Windows version only) and manual are also online (http://www.timothymasters.info/varscreen.html). The source code is in a zip file (http://www.timothymasters.info/data-mining.html); on this page, the book is referred to using a different title, but the code is easily found.