Since the introduction of Apache Hadoop nearly a decade ago, new tools and methods for analyzing large datasets have evolved rapidly, dramatically improving performance and providing easier and more powerful query languages.
This trend continues, as outlined in Quinto’s guide to three recent advances in data analytics: Kudu, Impala, and enhancements to Spark, all of which are open-source Apache Software Foundation projects. These technologies exploit concepts from relational databases and include SQL or SQL-like interfaces, which encourages faster learning of their query syntax.
The author first introduces Kudu, a columnar database that supports both Hadoop and Spark infrastructures and uses SQL for transaction queries. He presents numerous well-explained examples and use cases for creating and managing structured Kudu data.
Impala is a high-performance SQL for Hadoop. Quinto carefully describes the Impala architecture and again gives many examples of standard SQL queries taking advantage of that architecture’s parallelism.
Several chapters are devoted to describing and using Apache Spark, a memory-based analytical framework that includes SQL support as well as application programming interfaces (APIs) for other familiar tools like Java, R, and Python. Spark is rapidly becoming the most frequently used big data processing framework due to its orders-of-magnitude performance advantages over traditional Hadoop implementations.
The book includes additional chapters on data governance and management, topics often overlooked in other big data references, and adds helpful discussions on data management in cloud computing systems like AWS, Azure, and Google.
In the final chapter, Quinto briefly summarizes six big data case studies, each of which makes use of the technologies presented earlier. For example, British Telecom built a large Hadoop-based data analytics platform that included Spark and Impala, significantly increasing query performance and throughput.
Each chapter of Quinto’s book concludes with extensive references leading to greater detail on the topics discussed. The book assumes familiarity with basic data analytics methods and tools, especially Hadoop, and presents high-level introductions to emerging technologies. It thus serves as a helpful guide for data analytics professionals seeking to keep pace with this dynamic and quickly advancing field.