FensiVAT is a rapid hybrid clustering algorithm that identifies clusters in large datasets characterized by many instances (N) and multiple features (p) in each instance.
FensiVAT is an improvement over popular algorithms based on random sampling, such as clustering large applications (CLARA) using k-means, clustering using representatives (CURE), and clustering with improved visual assessment of tendency (clusiVAT), or using dimensionality reduction by projecting data on a lower dimension space, such as CLIQUE and PROCLUS. These approaches suffer from space and/or time complexity issues.
FensiVAT integrates techniques for random projection and the visual assessment of cluster tendency by random sampling matrices, obtained by random projection of the dataset in a lower dimension space and aggregating multiple distances using principal component analysis (PCA) and linear discriminant analysis (LDA), called maximin and random sampling (MMRS).
The authors’ ten-step algorithm includes input, dataset generation in downspace, near-MMRS sampling, reduced image (iVAT) generation, application of VAT/iVAT to distance matrices, clustering, and extension in down-space. They apply FensiVAT in the analysis of US Census 1990, KDD CUP, FOREST, MiniBoone, MNIST, and ACT datasets. FensiVAT is an order of magnitude faster than clusiVAT and several orders of magnitude faster than the other approaches without compromising accuracy.
This well-written paper has 55 references and will interest the big data community.