One would hope that professionals calling themselves data scientists would have extensive training in both statistical theory and practice. Yet current data analytics curricula, while naturally including at least one general statistics course, often neglect the integration of those two essential components. This can and does sometimes lead to simplistic explorations, analyses, and modeling of complex datasets.

Kaptein and van den Heuvel teach statistics and data science at the Eindhoven University of Technology and at Tilburg University in the Netherlands, and have developed a new course and textbook for data scientists incorporating a more rigorous foundation in probability and statistics than found in many other popular data science texts.

The authors assume some prerequisite coursework in mathematics and programming, and have taught their course to undergraduate students in computer science, economics, and even social sciences. Their text focuses on the use of modern applied statistical methods and includes the extremely important yet often minimal coverage sampling. The book’s extensive examples and exercises use the R language and include numerous datasets for illustrating basic data concepts, sampling and estimation, probability, distributions, multivariate techniques, and Bayesian analysis. The authors’ website for the textbook (http://www.nth-iteration.com/statistics-for-data-scientist/) includes access to the sample datasets, R source code, and recorded whiteboard lectures.

Each chapter begins with a general introduction to the major topic and presents detailed analytical examples using R paired with the relevant theoretical concepts and formulae. There is much mathematical notation used, which might be a challenge for some readers without
the prerequisite backgrounds. One important chapter covers multivariate exploration and analysis of datasets and the concepts and measures of dependency and association for different data types. The final chapter on Bayesian statistics presents a readable and comprehensive discussion of that approach to estimation and decision-making, although entire libraries have been written on that topic. The authors nicely summarize and illustrate the differences between Bayesian and frequentist probability methods, yet admit that there is much more to learn about them.

Having taught data analytics at the introductory graduate level, I welcome the authors’ textbook as an essential resource for training well-grounded entry-level data scientists. As stated in the Data Science Association’s Code of Conduct [1], their first requirement is competence:

A data scientist shall provide competent data science professional services to a client. Competent data science professional services requires the knowledge, skill, thoroughness and preparation reasonably necessary for the services.

Training in both the theory and practice of data analytics is a requirement for such competence. The authors’ textbook definitely provides a valuable resource for such training.