This video is an hour-long talk by Michael Stonebraker on potential disruptions in the world of big data. “Wait, a minute!” you might say, “I thought big data was the disrupter. How can it be disrupted as well?” Well, it can. And in this extremely informative talk, Stonebraker lays out the case for disruption in big data clearly and economically, and with the erudition and insights of someone who has been a pundit and player in the world of databases for over four decades. I watched this video three times in preparation for this review and could easily watch it again. This is not because it is hard to understand. It is very clear and easy to follow. It is because it is packed with so many useful insights and opinions that I didn’t want to miss or forget any of the many, many important points. A couple of the highlights follow.
We often define big data as data that exceeds the capacity of existing relational technology. As a result, we see relational databases and NoSQL databases as being on separate development paths. Not so fast, says Stonebraker. The performance problem is at the physical level. Today’s relational databases are row store databases, whereas NoSQL databases are column store databases. Storing columns at the physical level improves throughput by orders of magnitude. Another distinction between relational databases and NoSQL databases is that relational databases have a SQL front end whereas NoSQL databases use parallel procedural languages such as Hadoop with Java. But that is beginning to change as well. Over time, NoSQL databases will adopt SQL while relational databases will migrate to column store at the physical level. As this happens, the distinction between relational databases and NoSQL databases will disappear. This is surprising at first glance. But, as Stonebraker explains it, it makes perfect sense. The disruption here will be among the vendors of database technology. The vendor or vendors who figure this out the soonest will push the others out of the lead, maybe even out of the game. Further disruption will occur in organizations or applications that bet on the wrong horse.
While a unified technology platform is a big plus for data applications, there is also a huge barrier. Along with the tremendous growth in the volume of data, a similar growth has occurred in the complexity of data integration. In the good old days, we had to worry about integrating data from two different departments, such as a customer file from the marketing department and a customer file from the shipping and receiving department. But integrating two files is child’s play compared to integrating files from tens of thousands of sources. These “data lakes,” to use the current buzzword, are data swamps according to Stonebraker. And the 800-pound gorilla in the room is the data integration problem.
Stonebraker provides the clearest vision of the future of big data that I have seen anywhere. The video is only an hour long and is well worth watching several times. So, anyone who works with data, wants to work with data, wants to keep working with data, or even just uses the word “data” should watch this video. It will save you a lot of time that could be wasted making bad decisions.
Other reviews under "Information Storage And Retrieval":
Length normalization in XML retrieval Kamps J., de Rijke M., Sigurbjörnsson B. Research and development in information retrieval (Proceedings of the 27th International Conference on Research and Development in Information Retrieval, Sheffield, United Kingdom, Jul 25-29, 2004)80-87, 2004. Type: Proceedings