This lively article deals with the creation of datasets for testing as standalone projects separate from their underlying applications. The starting assumption is simple: as live data is generally not suitable for demos due to privacy concerns, the first solution that comes to mind seems just as simple. It consists of preloading applications with manually generated sample data and then letting salespeople run those applications during their presentations, following some predefined script.
In such a context, sample dataset creation is not some one-time activity, but one that must be repeated over time. This means requirements for sample datasets become as stringent as those for any other dataset or code. Sample data must adapt to newer data, different database schemas, added program features, and many similar other constraints. As such, sample dataset creation becomes a creative collaboration that includes marketing, creating scripts, and information and communications technology (ICT), creating the matching datasets.
Here, the author goes a step further. He explains how to generate sample datasets as if they were a function library of some programming language of choice. Such a library would be able to import data from different sources, filter them, and then automatically create fictitious data according to requirements.
At first, the creation of a function library of this sort certainly requires more time than just getting some data hastily together; but this is time repaid later, with interest, when the need for a different dataset arises. Following this approach, the new dataset need not be created again from scratch, but automatically, just by modifying requirements and constraints within the library functions. Another advantage of this method is that appropriate datasets can be created by anybody, even salespeople (wink), not just programmers. Moreover, such datasets reveal their usefulness not only during sales demos, but also during software development. In this case, sample data just becomes part of the release cycle.
Languages to create such datasets include R, Python, and Perl. Environments include Git or any other version control system. Together these languages and environments are often part of modern continuous integration/continuous delivery (CI/CD) systems.
All this is told in a very enjoyable style, making the whole article not only useful but also entertaining.