Data cleaning provides an extensive literature review. It showcases the body of work that academia has produced over the last decades on the subject of data cleaning automation. Identifying and correcting dirty data by means of a computer is a task with widespread business and commercial applications; yet the interest generated in academia, and the ensuing practical results, have so far failed to gain much traction for broad application to real-world problems.
Data cleaning is cognizant and aware of this state of affairs. A consequence of the current circumstances is that real-world solutions tend to be piecemeal works, custom-tailored to some specific use case rather than a structured application of broad techniques supported by some theoretical framework. However, even if it presents a critique of the academic approach, Data cleaning cannot provide an overarching, encompassing, organic framework because that framework, with regard to data cleaning, still does not exist.
This book nevertheless forges forward, itemizing in the form of a curated catalogue many technical approaches to different aspects of the automated cleanup of bad data. Here, it is addressed to researchers as well as practitioners with disparate interests in this general subject. For the former, as a summary with a rich bibliography of what has already been invented, including sensible suggestions for new directions that need trailblazing and an overview of the pitfalls impairing current results. For the latter, as a comprehensive collection of structured techniques, each briefly described and summarized and often accompanied by short partial examples, enabling an assessment of the applicability of each to some practical case that needs an expedient solution.
As a practitioner, I personally found some of the presentations and examples inadequate as standalone items, that is, it was difficult to make a decision on the applicability of the underlying techniques to a specific problem. On the other hand, it is simple enough to broaden the reach of an insufficient explanation when necessary. I wish to express my boundless appreciation for how concise Data cleaning is. At a little over 200 pages, the text is understandably too terse for reaching into the details of everything. The measure of its brevity, completeness, and candor is yet enabling, and to some extent engrossing, and could be hailed as exemplary. The book is an encompassing, well-structured, and comprehensive catalogue and critique of an evolving discipline that, all things considered, still has a ways to go. It was published in 2019, though I wish it could have been on my shelves sooner.
The book is organized in eight chapters. The first, “Introduction,” summarizes the context and the principles inspiring the organization and structure of the book. The following two chapters, “Outlier Detection” and “Data Deduplication,” collect and illustrate structured techniques aimed at these respective goals, including means for taming, to some extent, the curse of dimensionality, as well as managing algorithmic complexity with regard to data size. These two chapters encompass known and novel solutions with potentially broad applicability; a reasonable background in statistics will help the reader in gaining the most from this overview, which covers approximately one third of the book.
Chapter 4, “Data Transformation,” provides a perspective on techniques for enrolling extract-transform-load (ETL) jobs as data cleanup tools with an increasing level of sophistication. Chapters 5, “Data Quality Rule Definition and Discovery,” and 6, “Rule Based Data Cleaning,” are dedicated to defining, discovering, and applying rules that enable the detection, and possibly correction, of mutually inconsistent states in otherwise coherent bodies of data. Much work in these chapters insists on entity-relationship databases, with some attention to data stored in different manners and with different (or no) structure. This section, covering approximately one third of the book, offers several techniques that can operate under often limiting constraints. Some are based on formal logic, so a background is this can enhance the reader’s understanding. Chapter 7, “Machine Learning and Probabilistic Data Cleaning,” examines the applicability of automation to large-scale data cleanup. It examines, in particular, machine learning for deduplication, machine learning for repair, and machine learning for analytics cleanup. The last chapter, “Conclusion and Future Thoughts,” provides commentary on the current state of academic work for automated data cleanup and proposes directions for further investigation.