A cluster is a group of similar objects; objects from different clusters are not alike. Clustering is an important tool in exploratory data analysis and is used in several disciplines, such as artificial intelligence, pattern recognition, geology, biology, psychology, and information retrieval. A clustering algorithm generates clusters from the definitions of objects, and cluster analysis is the formal analysis of these algorithms.
This excellent book emphasizes informal algorithms for clustering data and interpreting results. The authors, whose names should be familiar to researchers working in cluster analysis, masterfully introduce mathematical and statistical theory only when necessary.
The book consists of five chapters. Chapter 1 introduces the general concepts and the literature. Chapter 2 presents the authors’ view of data and introduces the representation of objects (pattern matrix), the idea of a proximity matrix, different ways to represent data, and various proximity indices (ratio, nominal, and probabilistic). This chapter also introduces the concept of normalization, linear and nonlinear projections to permit visual examination and dimension reduction of multivariate data, intrinsic dimensionality (the problem of dimension reduction), and multidimensional scaling.
In chapter 3 the authors present clustering methods and algorithms. They begin by classifying various approaches to classification; then they present clustering algorithms under two headings--hierarchical clustering and partitional clustering--and provide information about the available cluster analysis packages. They discuss the clustering methodology for the major steps of explanatory data analysis--data collection, initial screening, representation, clustering tendency, clustering strategy, validation, and interpretation--and indicate which sections of the book are relevant to each step. The authors conclude this part by introducing various approaches from the literature for the comparative analysis of clustering methods.
Chapter 4, the most important part of the book, presents a comprehensive summary of procedures for the objective validation of cluster analysis results. Jain and Dubes approach the validity problem from the viewpoint of probability and statistics and begin by providing background information that includes fundamentals of statistical testing of hypotheses, statistics that can be used to test cluster validity, and procedures for a Monte Carlo analysis. The authors then describe three types of criteria for validating a clustering structure: external criteria measure performance by matching a clustering structure and a priori information; internal criteria assess the fit between the clustering structure and the data used to describe objects; and relative criteria help one to choose the most stable of several clustering structures or the structure most appropriate for the data. Later in this chapter they discuss the validity of hierarchical and partitional structures and consider individual clusters in terms of external, internal, and relative validity indices. Their discussion of the validity of partitional structures thoroughly covers the fundamental problem of partitional cluster structure validity: how many clusters do the data contain? They close the chapter by covering the clustering tendency problem, which is usually neglected. The concern here is whether the data are random: if they are, clusters will be artifacts of the clustering algorithm.
In chapter 5 Jain and Dubes briefly discuss applications of clustering to image processing and computer vision, and in the eight appendices they briefly discuss concepts of pattern recognition, the normal and the hypergeometric distribution, linear algebra, scatter matrices, factor analysis, multivariate analysis of variance, some definitions from graph theory, and an algorithm that creates clustered data in a d-dimensional unit hypercube sampling window.
This book is for people who gather and interpret data and would be an excellent reference for researchers working on cluster analysis and especially on cluster validity. It could be used as a textbook for a graduate course on clustering algorithms and explanatory data analysis or as a supplemental text in courses on research methodology, pattern recognition, image processing, remote sensing, and information retrieval, and the authors state in the preface that interested readers may contact them to obtain homework problems.
The length of the book is appropriate; the authors use the space economically and do not repeat themselves. The language is simple and easy to understand, and the purpose of each section is stated clearly and achieved excellently. I have found no typos.
The 31 examples are carefully chosen and make the book easy to understand. The most unusual feature of the book, and to me the best one, is the excellent discussion of cluster validity, which occupies 25 percent of the book. This discussion is an important contribution by itself and provides many references to the literature. The bibliography, which contains 434 citations from almost 100 journals and over 380 researchers, is also excellent: the references are timely and cover many aspects of the clustering literature. The text cites most of these works and explains their contents, and the book also contains an author index as well as a detailed and helpful general index.
This important book on cluster analysis is distinct from most other works in the field, as it combines results from several disciplines and will lead to cross-fertilization. If you are using or researching cluster analysis and do not wish to reinvent the wheel, this excellent book must be in your library. I consider it a classic of cluster analysis literature.