Computing Reviews, the leading online review service for computing literature.

Search

Timely offloading of result-data in HPC centers
Monti H., Butt A., Vazhkudai S. Supercomputing (Proceedings of the 22nd Annual International Conference on Supercomputing, Island of Kos, Greece, Jun 7-12, 2008)124-133.2008.Type:Proceedings

Date Reviewed: Jul 29 2008

“High-performance computing (HPC) is facing an exponential growth in job output dataset sizes,” which will necessitate a “significant commitment of supercomputing center resources--most notably, precious scratch space--in handling data staging and offloading.” In addition, the purge policies that are usually employed aren’t well matched to user needs in this regard. The authors observe that in HPC environments like TeraGrid, much of the work is of a collaborative nature, with jobs being submitted by users from geographically dispersed sites. In many instances, there is a requirement for the dispatch of result data to several locations. Therefore, the authors have devised a mechanism to exploit these characteristics, by performing a collaborative offload of job output data. Such data is pushed to an initial “level 1” set of intermediate nodes in chunks, with protection against node/link failure provided by Reed-Solomon 4:5 coding. A BitTorrent scatter-gather protocol is used to transfer chunks to intermediate nodes at a number of levels, until the data arrives at nodes near the submission site(s). Retrieval from these is accomplished using a pull mechanism. Each intermediate node has a network weather service (NWS) component that is used in the selection of intermediate nodes and the paths thereto. A set of extensions was devised for the Portable Batch System (PBS) job-scheduler, allowing users to nominate in their submission script a destination site, allowable intermediate nodes and capacities, and a deadline time. Testing was done over 22 PlanetLab sites distributed across the US, and the results are presented in tabular and graphical formats. They are impressive, and I would encourage all who manage or use HPC facilities to take a look.

Reviewer: G. K. Jenkins	Review #: CR135882 (0909-0843)

Network Operating Systems (C.2.4 ... )

Storage Hierarchies (D.4.2 ... )

Would you recommend this review?

yes

Other reviews under "Network Operating Systems":	Date

Simulations of three adaptive, decentralized controlled, job scheduling algorithms Stankovic J. (ed) Computer Networks and ISDN Systems 8(3): 199-217, 1984. Type: Article	Nov 1 1985

Models of the task assignment problem in distributed systems Lucertini M. (ed), Springer-Verlag New York, Inc., New York, NY, 1984. Type: Book (9780387818160)	Jul 1 1985

Operating system design; vol. 2: internetworking with XINU Comer D., Prentice-Hall, Inc., Upper Saddle River, NJ, 1987. Type: Book (9789780136374145)	Feb 1 1988

more...

Reproduction in whole or in part without permission is prohibited. Copyright 1999-2024 ThinkLoud^®
Terms of Use | Privacy Policy