Computing Reviews
Today's Issue Hot Topics Search Browse Recommended My Account Log In
Review Help
Search
Dataset popularity prediction for caching of CMS big data
Meoni M., Perego R., Tonellotto N. Journal of Grid Computing16 (2):211-228,2018.Type:Article
Date Reviewed: Dec 24 2018

The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) is a collaboration between thousands of physicists using a distributed computing and storage infrastructure consisting of 70 sites. Historical usage data is used to determine whether machine learning could be used to develop “predictive models able to forecast which datasets will become popular”; this prediction could improve the caching of CMS data for more efficient access to data. The authors compare the performance of popularity prediction caching (PPC), a novel data caching policy, to other popular caching policies: “Experiments conducted on large traces of real dataset accesses show that PPC outperforms [least recently used, LRU], [significantly] reducing the number of cache misses.”

Hadoop and Spark were used to analyze data from the CMS logs, focusing on number of accesses and considering some of the other parameters. Predicting data access was modeled as a binary classification problem interested in identifying datasets as either popular or unpopular at a specific time at a specific CMS storage site.

Due to the enormous size of data produced at CERN, the CMS storage architecture is organized into tiers; only tier-1 contains all the data. The amount of storage varies for each node site, and there is a significant time cost for each new data transfer due to size. PPC “optimizes[s] the eviction policy implemented at each site.” The model also considers data aging and refreshing the meaning of “popular” through the use of a sliding window recalculated on a weekly basis.

Dataset popularity approaches are tailored to the specific needs and data formats of the corresponding research domain. Prior techniques described in the literature were used to optimize classification results.

The authors provide a very detailed and thorough description of the steps they took while developing this novel approach, and provide numerous well-thought-out experiments to determine “optimum” and prove “better.” In the end, PPC was deployed as an enhancement to LRU, checking to see if a dataset is predicted to be popular before evicting it by LRU rules. This approach is especially useful for CMS sites with limited storage and/or bandwidth.

This is an excellent paper, especially as an example of how to quantitatively compare results from different algorithms and approaches.

Reviewer:  Jill Gemmill Review #: CR146354 (1903-0079)
Bookmark and Share
 
Grid computing (C.2.4 ... )
 
 
Cloud Computing (C.2.4 ... )
 
 
Distributed Systems (C.2.4 )
 
Would you recommend this review?
yes
no
Other reviews under "Grid computing": Date
Research on GridFTP traffic features
Yang M., Liu Y., Ma X., Li L., Hu P.  CMC 2009 (Proceedings of the 2009 WRI International Conference on Communications and Mobile Computing, Jan 6-8, 2009)39-43, 2009. Type: Proceedings
Dec 31 2009
Grid computing: techniques and applications
Wilkinson B., Chapman & Hall/CRC, Boca Raton, FL, 2009.  387, Type: Book (978-1-420069-53-2)
Mar 23 2010
Efficient data consolidation in grid networks and performance analysis
Kokkinos P., Christodoulopoulos K., Varvarigos E. Future Generation Computer Systems 27(2): 182-194, 2011. Type: Article
Mar 11 2011
more...

E-Mail This Printer-Friendly
Send Your Comments
Contact Us
Reproduction in whole or in part without permission is prohibited.   Copyright 1999-2024 ThinkLoud®
Terms of Use
| Privacy Policy