Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Data sketching
Cormode G.  Queue 15 (2): 49-67, 2017. Type: Article
Date Reviewed: Feb 28 2020

This lively article deals with the creation of datasets for testing as standalone projects separate from their underlying applications. The starting assumption is simple: as live data is generally not suitable for demos due to privacy concerns, the first solution that comes to mind seems just as simple. It consists of preloading applications with manually generated sample data and then letting salespeople run those applications during their presentations, following some predefined script.

In such a context, sample dataset creation is not some one-time activity, but one that must be repeated over time. This means requirements for sample datasets become as stringent as those for any other dataset or code. Sample data must adapt to newer data, different database schemas, added program features, and many similar other constraints. As such, sample dataset creation becomes a creative collaboration that includes marketing, creating scripts, and information and communications technology (ICT), creating the matching datasets.

Here, the author goes a step further. He explains how to generate sample datasets as if they were a function library of some programming language of choice. Such a library would be able to import data from different sources, filter them, and then automatically create fictitious data according to requirements.

At first, the creation of a function library of this sort certainly requires more time than just getting some data hastily together; but this is time repaid later, with interest, when the need for a different dataset arises. Following this approach, the new dataset need not be created again from scratch, but automatically, just by modifying requirements and constraints within the library functions. Another advantage of this method is that appropriate datasets can be created by anybody, even salespeople (wink), not just programmers. Moreover, such datasets reveal their usefulness not only during sales demos, but also during software development. In this case, sample data just becomes part of the release cycle.

Languages to create such datasets include R, Python, and Perl. Environments include Git or any other version control system. Together these languages and environments are often part of modern continuous integration/continuous delivery (CI/CD) systems.

All this is told in a very enjoyable style, making the whole article not only useful but also entertaining.

Reviewer:  Andrea Paramithiotti Review #: CR146912 (2005-0113)
Bookmark and Share
  Featured Reviewer  
 
General (E.0 )
 
 
Linear Approximation (G.1.2 ... )
 
 
Graph Theory (G.2.2 )
 
Would you recommend this review?
yes
no
Other reviews under "General": Date
 Core data analysis: summarization, correlation, and visualization (2nd ed.)
Mirkin B.,  Springer International Publishing, New York, NY, 2019. 540 pp. Type: Book (978-3-030002-70-1), Reviews: (2 of 2)
May 5 2022
Learn RStudio IDE: quick, effective, and productive data science
Campbell M.,  Apress, New York, NY, 2019. 164 pp. Type: Book (978-1-484245-10-1)
May 21 2020
Core data analysis: summarization, correlation, and visualization (2nd ed.)
Mirkin B.,  Springer International Publishing, New York, NY, 2019. 540 pp. Type: Book (978-3-030002-70-1), Reviews: (1 of 2)
Mar 12 2020
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright © 2000-2022 ThinkLoud, Inc.
Terms of Use
| Privacy Policy