Efficient data management is the distinguishing factor for the success of computer applications. Traditional software made use of querying and updating persistent datasets saved in a stable storage format. With growing usage, increased automation, and widespread connectivity, many modern systems are bombarded with big data, especially characterized by its velocity, in the form of continuous high-speed streams. Rapid response is required in real time, and the luxury of deploying previous techniques that relied on post processing and making multiple passes over a static collection is no longer affordable. This emerging trend poses different and difficult challenges, so there is a strong need to develop new methodologies to handle the complexity.
It is a pleasure then to come across this impressive volume that covers both the theory and algorithms related to processing high-speed data streams. Numerous authors and editors have come together and prepared 25 chapters of material organized into five parts, covering basics, problem models, effective solutions, and practical applications of this domain.
Foundations, like sampling, quantile queries, join sizes, frequency moments, distinct value estimation, and so on, are very well explained in six chapters, which form the first part of the book. In addition, the sliding window computation model is considered in detail along with related algorithms and results. Such a model is useful in applications where recent items are more important.
Sampling is a basic operation in providing quick approximate answers. In Bernoulli sampling, each element is included in the sample with the same probability; in Poisson sampling, the probability of inclusion varies by element. Both methods have a drawback that the sample size is not deterministic; hence, various reservoir sampling methods are also presented. As stated in the book, “in general, sampling from a sliding window is a much harder problem than sampling from a stationary window”; various techniques for both schemes are well described.
According to the book, “computing the median, the 99-percentile, or the quartiles of a set are examples of quantile queries.” A general formulation of this problem is in terms of finding an element of a given rank in a sorted set. Two kinds of models are considered here. In a cash-register model, an observation once presented is not removed later from the set. A “model that allows for deletion of elements” is known as the turnstile model. For both models, deterministic and randomized algorithms are included.
“Many problems over streams of data can be modeled as problems over vectors” defined by incremental data updates. We are often interested in finding common attributes between multiple streams, and in “estimating the join size between pairs of relations,” which “corresponds to computing the inner product of their frequency-distribution vectors.” Necessary background is introduced, and algorithms are presented for computing join sizes along with theoretical results on their complexity.
Finding the most frequent items in a data stream is another fundamental problem. Solving this accurately in a single pass requires linear storage. If some small error in the results is tolerable, then several ingenious methods can be developed, and the authors briefly mention those. Methods that update multiple counters for each element are of particular interest, and data structures like count-sketch are described in detail.
Estimating the number of classes in a population is another well-studied problem. Given a dataset where each item is from a universe of possible values, “the number of distinct values (called the zeroth frequency moment) is the number of values from the universe that occur at least once.” Earlier methods based on sampling were unable to provide good error estimates. The Flajolet-Martin (FM) algorithm is the first small-space distinct values estimation algorithm presented. It uses a synopsis consisting of a bit vector, logarithmic in size. Items in the dataset select a random bit of the synopsis and set it with a geometric distribution. The Alon-Matias-Szegedy (AMS) algorithm adapts the FM algorithm to use linear hash functions and provides stronger error guarantees. Theoretical results along with other variants of these algorithms are very well covered in the book. Extensions to other scenarios, updated streams and distributed streams, are also considered.
The rest of the book contains more advanced material. Mining from data streams is covered in Part 2. Several discussions on multi-query processing, queries over distributed streams, and so on are included in Part 3. Some specific stream processing engines and systems are described in Part 4. Finally, descriptions on applications like network monitoring and so on form Part 5.
Overall, this is a very useful and well-written text that can be recommended for students, practitioners, and researchers alike. The subject matter fills in a range of details, starting from the basics (including related mathematical theorems) and progressing to describe recent developments along with adequate references to the existing literature and suggestions for future research.