This is a slightly expanded new edition of Sakr’s 2016 book. The first edition correctly assessed the trajectory of big data systems, including their initial success and future limits. This second edition adds a chapter on large-scale machine learning and deep learning frameworks.
Overall, this short book is well written and informative. With a plethora of techniques and software in the ecosystem, Sakr has wisely chosen a subset of systems quite illustrative of the field as of 2020. The book is aimed at three distinct audiences: students, researchers, and practitioners. As a survey of the field, aimed at bootstrapping research on several topics, the book offers a current set of technologies and references for each chapter. Again, Sakr has updated this edition to reflect the field as of 2020.
The introductory chapter familiarizes readers with the concepts of big data, cloud computing, storage, processing, and analytics systems. Sakr also presents a short roadmap of the book. Chapter 2 covers general-purpose big data systems. Sakr uses the same systems from the first edition: Hadoop/MapReduce, Spark, Flink, and AsterixDB.
In the third chapter, Sakr introduces the large-scale processing of structured data. With the recognition that SQL is still the main avenue for data management, he illustrates with Hive, Impala, IBM Big SQL, HadoopDB, Presto, Tajo, Google BigQuery, Phoenix, and PolyBase. While having a brief one- or two-page introduction to each is not sufficient to understand their deep capabilities, it is sufficient to form an idea of their domains and provide links to more information.
Chapter 4 targets large-scale graph processing systems with a slant toward academically produced software. While Sakr has added Spark-based systems to the mix, some other dominant solutions are not covered. With the current trend toward analyzing networks and their relationships in multiple domains, the chapter deserves a little more depth.
The fifth chapter covers large-scale stream processing. Here the author holds on to the same systems covered in the first edition: Storm, InfoSphere, and pipelining systems such as Pig Latin and Tez. Dominant Apache streaming projects such as Kafka, Flume, Samza, Spark, NiFi, and Beam are only very briefly mentioned.
The predominant new addition to the book is chapter 6, “Large-Scale Machine/Deep Learning Frameworks.” While only ten pages long, it introduces emerging material on techniques aimed at uncovering value within accumulated large datasets. Both machine learning frameworks and deep learning frameworks are presented. Leading frameworks such as CNTK, TensorFlow, Keras, and PyTorch are mentioned, but without much detail due to limited space. The final chapter, “Conclusions and Outlook,” outlines the progress made in the field of big data in the last decade as well as the complex problems still to be solved.
As a survey book, the author succeeds in raising awareness for the topic and reinforcing the view of its depth. As a research tool, the book works as a stepping stone for the curious manager or researcher wanting a short introduction to a wide range of big data areas. An easy read on the topic, it does not require advanced technical or mathematical experience. The target audience is not developers or programmers, who likely read more in-depth material or texts focused on a particular system. Nevertheless, reading this book may be captivating even if your shortlist of technologies is different from the material covered--after all, big data has many facets. Its many references provide a solid foundation for further study.