Presented by Robust Storage Management On Desktop, in Machine Room, and Beyond Xiaosong Ma Computer Science and Mathematics Oak Ridge National Laboratory Department of Computer Science North Carolina State University Contributors Sudharshan Vazhkudai Stephen Scott
Beckerman_ Ma_FL_0611 More data will be produced by PetaFlop machines High-end computing (HEC) storage systems face severe performance and reliability challenges Storage failures: significant contributor to system down time There is a need to reduce I/O related job failures/resubmissions Storage availability challenge – data producer side System# CPUsMTBF/IOutage source ASCI Q hrsStorage, CPU ASCI White hrsStorage, CPU Google reboots/dayStorage, mem Microscopic view 3 to 7% per year of disk failures; 3 to 16% per year of controller failures; up to 12% per year of SAN switches Ten times the rate expected from the disk vendor specification sheets
Beckerman_ Ma_FL_0611 Storage availability challenge – data consumer side Data eventually get analyzed and visualized at desktop machines These are indispensable human interaction devices They are highly limited in both storage capacity and performance Meanwhile, many nearby workstations have unused disk space and idle system resources Large scientific datasets can be striped to donated space Access can be much faster than using local disk However, donated nodes can become unavailable at any time
Beckerman_ Ma_FL_0611 Common point: Need to address availability of transient data Scientific data in HEC settings are different from general-purpose data They are produced and access in a distributed manner Supercomputers are neither the source nor the destination of the data Input data are pre-staged, read-only Output data are offloaded after run, write-once during simulation Output data become read-only during analysis Data are analyzed with temporal and space locality Storage space is a precious shared resource New technology is needed for improving data availability
Beckerman_ Ma_FL_0611 FreeLoader overview Enabling trends: Unused storage: More than 50% of desktop storage is unused Immutable data: Data are usually write-once, read-many, with remote source copies Connectivity: Well connected, secure LAN settings FreeLoader aggregate storage cache: Scavenges O(GB) of contributions from desktops Parallel I/O environment across loosely-connected workstations, aggregating I/O as well as network bandwidth NOT a file system, but a low-cost, local storage solution enabling client-side caching and locality
Beckerman_ Ma_FL_0611 Dealing with unreliable FreeLoader nodes Collective downloading Combination of two techniques: Utilizing multiple nodes in patching missing data Each node retrieving long, contiguous file segment Local data shuffling to desired striping pattern Prefix caching Only storing part of datasets Overlapping tail-patching with serving cached prefix
Beckerman_ Ma_FL_0611 Extending FreeLoader fault-tolerence to machine room scenario Node-local storage can be efficiently aggregated Temporary scratch space to store transient job data Augmenting parallel file system metadata Extend file system metadata to include recovery hints Information regarding “source” and “sink” of user’s job data Sample metadata items: URIs, credentials, etc Metadata automatically extracted from job scripts Enables elegant, automatic data recovery and offloading without manual intervention 7 Ma_FL_0611
Beckerman_ Ma_FL_0611 Online data recovery Motivation Staged input data have natural redundancy and are immutable RAID does not help with I/O node failures Idea: patch data from the staging source How ? Enhance parallel file systems with “recovery metadata” Deploy multiple nodes for parallel patching Preliminary results Large remote I/O requests (256MB) and local shuffling can be overlapped with client activity Data reconstruction from ORNL to PSC shows good scalability 8 Ma_FL_0611
Beckerman_ Ma_FL_0611 Eager offloading of result data Offloading result-data equally important for end-user visualization and interpretation Storage system failure and purging of scratch space can cause loss of result-data Eager offloading: Transparent data migration using destination embedded in metadata Data offloading can be overlapped with computation Can failover to intermediate storage/archives Needs coordination with parallel file system and job management tools 9 Ma_FL_0611
Beckerman_ Ma_FL_0611 Summary: System overview Supercomputer Center End-user/Mirror Sites Parallel file system Archives Source copy of dataset accessed gsiftp://source/dataset Computer nodes I/O access Failure Online reconstruction of staged data Result-data offloading Data Source Caches near by io-node
Beckerman_ Ma_FL_0611 Contacts Xiaosong Ma Assistant Professor, Department of Computer Science North Carolina State University Joint Faculty, Computer Science and Mathematics Oak Ridge National Laboratory (919) Sudharshan S. Vazhkudai Computing & Computational Sciences Directorate Computer Science and Mathematics Oak Ridge National Laboratory (865) Ma_FL_0611