1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from UBC, ANL, ORNL) Networked Systems Laboratory (NetSysLab) University of British Columbia
2 A golf course … … a (nudist) beach (… and 199 days of rain each year) Networked Systems Laboratory (NetSysLab) University of British Columbia
The Landscape Storage System Middleware SupercomputersDesktop GridsCloud Computing WorkflowsCheckpointingData Analysis Diverse platform capabilities Diverse workload characteristics Challenge: Design an efficient storage system middleware CCCC
4 Motivation: Underprovisioned storage systems on manyHPC platforms (e.g., BlueGene/P at ANL) 10 Gb/s Switch Complex GPFS 24 servers IO rate : 8GBps = 51KBps / core 2.5K IO Nodes 850 MBps per 64 nodes 160K cores Hi-Speed Network 2.5 GBps per node The shared storage is a bottleneck There are underutilized resources close to application
5 Solution: a temporary shared datastore 10 Gb/s Switch Complex GPFS 24 servers IO rate : 8GBps = 51KBps / core 2.5K IO Nodes 850 MBps per 64 nodes 160K cores Shared data-store 2.5 GBps per node Nodes dedicated to an application Storage system coupled with the application’s execution
6 Benefits 10 Gb/s Switch Complex GPFS 24 servers IO rate : 8GBps = 51KBps / core 2.5K IO Nodes 850 MBps per 64 nodes 160K cores Shared data-store 2.5 GBps per node Storage closer to the application. Ability to specialize
Evaluation: Harnessing ‘Close to Application’ Underutilized Resources Overall: 1.52x Workflow Stages (DOCK6) Read input, compute, and write temporary results Summarize, sort, and select Archive Storage Optimizations Cache the input data Cache temporary files Asynch. flush results to GPFS Results (8K cores) 1.06x 11.76x 1.51x Exploiting the underutilized resources can critically improve the storage system performance Zhang et. al., “Design and Evaluation of a Collective I/O Model for Loosely- coupled Petascale Programming”, MTAGS ’08.
Evaluation: Specialization MosaStore throughput at larger scale (pool of 35 nodes) Experiment by: Henry Monti (VirginiaTech) on Cray XT4 cluster at ORNL Deduplication benefits a checpointing workload 3x higher throughput 25-70% less storage space and network effort Scales to hundreds of clients Specialization can critically improve the storage system performance [S. Al-Kiswany, M. Ripeanu, S. Vazhkudai, A. Gharaibeh, “stdchk: A Checkpoint Storage System for Desktop Grid Computing”, ICDCS ‘08]
Summary so far MosaStore: versatile storage architecture, that : Exploits underutilized resources ‘close`to the application. Supports specialization and configurability System is Configured at deployment time Deployment lifetime coupled with that of the target application. [S. Al-Kiswany, A. Gharaibeh, M. Ripeanu, “The Case for a Versatile Storage System”, HotStorage’09]
MosaStore - Storage System Prototype Goals: (1) exploration platform, and (2) support for large-scale computational science research projects. MosaStore - Storage System Prototype Goals: (1) exploration platform, and (2) support for large-scale computational science research projects. Versatile Storage Configurable and extensible storage system that can be specialized for a broad set of apps. [ICDCS ’08, HotStorage ’09] Configurable and extensible storage system that can be specialized for a broad set of apps. [ICDCS ’08, HotStorage ’09] How to harness massively multicore processors to support storage system operations? [HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10] How to harness massively multicore processors to support storage system operations? [HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10] StoreGPU Cross-layer Optimizations Can one enable cross-layer optimizations? [HPDC HotTopics ’08, CCGrid`12, WSLF`11] Can one enable cross-layer optimizations? [HPDC HotTopics ’08, CCGrid`12, WSLF`11] CMFS API Automating config. choice How I choose a good configuration for my application? [ERSS`11¸ GRID`10] How I choose a good configuration for my application? [ERSS`11¸ GRID`10]
Application Storage System Applications can present hints on the desired use of the data: e.g., desired replication levels, caching, data importance, etc Storage System Application Storage can expose storage-level attributes e.g., file location characteristics, file health status, Today: applications and storage systems treat data items uniformly Opportunity: additional information can enable differentiated treatment of data items POSIX API Custom Metadata Our use-case: A workflow aware file system
Workflow Applications Montage workflow File based communication Irregular and application- dependant data access s of process, runs for weeks Generate large I/O volumes (100TB cumulative). 12 Source [Zhao et. al, 2012] 512 BG/P cores, GPFS intermediate file system
I/O patterns in Workflow Applications Pipeline Broadcast Reduce Scatter Gather 13 Case studies in storage access by loosely coupled petascale applications, Wozniak et al, PDWS, 2009
Application: Montage 14 < Stages 6, 7,8 Pipeline pattern Stage - 10 Reduce pattern Stage - 9 Pipeline pattern Stage - 5 Reduce pattern
I/O Patterns and Storage Optimizations PipelineLocality aware scheduling BroadcastReplication ReduceData placement Locality-aware scheduling ScatterBlock-level placement Locality-aware scheduling GatherBlock level co-placement Locality-aware scheduling PatternOptimizations 15 Data-item specific patterns and optimizations! Need for information flows in both directions Idea: Cross-layer communication to support this
A workflow-aware file system Thesis: cross-layer communication supported by file-level metadata the key mechanism to enable a workflow-aware file system Progress so far: promising evaluation of potential gains (CCGrid`12) Next step: build the system and evaluate it with applications (?SC`12) 16
MosaStore - Storage System Prototype Goals: (1) exploration platform, and (2) support for large-scale computational science research projects. MosaStore - Storage System Prototype Goals: (1) exploration platform, and (2) support for large-scale computational science research projects. Versatile Storage Configurable and extensible storage system that can be specialized for a broad set of apps. [ICDCS ’08, HotStorage ’09] Configurable and extensible storage system that can be specialized for a broad set of apps. [ICDCS ’08, HotStorage ’09] Harnessing massively multicore processors to support storage system operations. [HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10] Harnessing massively multicore processors to support storage system operations. [HPDC ’08, JoCC‘09, IPCCC’09, HPDC`10] StoreGPU Cross-layer Optimizations Enabl bidirectional cross-layer optimizations. [HPDC HotTopics ’08, CCGrid`12, WSLF`11] Enabl bidirectional cross-layer optimizations. [HPDC HotTopics ’08, CCGrid`12, WSLF`11] CMFS API Automating config. choice How I choose a good configuration for my application? [ERSS`11¸ GRID`10] How I choose a good configuration for my application? [ERSS`11¸ GRID`10]
Thank you