Download presentation
Presentation is loading. Please wait.
1
Towards a High-Performance and Scalable Storage System for Workflow Applications
Emalayan Vairavanathan Department of Electrical and Computer Engineering The University of British Columbia - At the beginning I would like to thank Sathish and Tor for participating in my committee and also for reading my thesis
2
Background: Workflow Applications
Large number of independent tasks collectively work on a problem Common Characteristics File based communication Large number of tasks Large amount of storage I/O Regular data access patterns Here mention what is meant by large amount of IO ? Arrows shows files and circles shows computation stages modFTDock workflow
3
Background – ModFTDock in Argonne Blue Gene/P
1.2 M Docking Tasks Workflow Runtime Engine File based communication Large IO volume Scale: Compute nodes App. task Local storage App. task Local storage App. task Local storage App. task Local storage App. task Local storage IO throughput < 1MBps / core Now I am going to explain about the Storage system design challenges in the context of workflow applications TODO: Find out how much is the bandwidth ? IO rate numbers may be too old. So verify Also co-relate it with the ModFtDock diagram. Central Storage System (e.g., GPFS, NFS)
4
Montage workflow (512 BG/P CPU cores, GPFS)
Background –Central Storage Bottleneck Z. Zhang et. al, SC’12 Montage workflow (512 BG/P CPU cores, GPFS) To illustrate the problem, we show in Fig. 1 the core-time distribution of a Montage benchmark problem on 512 cores of an IBM BG/P with intermediate results stored on GPFS. Even on this small number of cores, I/O dominates: 73.6% of core-time is consumed by I/O, task execution takes 13.4% of core-time; and 13.0% of core-time is consumed by scheduling overhead and CPU idle time due to workload imbalances. Of the latter time, around 39% (5.1% of total time) is idle time due to a gather operation, in which all but one core sit idle while data is fetched from GPFS. Add more evidence - from Justin and others
5
Contributions - Alleviating storage I/O bottleneck
Intermediate Storage System Designed and implemented a prototype Integrated with workflow runtime Evaluated with applications on BG/P The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. S. Al-Kiswany, Emalayan Vairavanathan, L. B. Costa, H. Yang, M. Ripeanu. Submitted - FAST '13. Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware storage A Workflow-Aware Storage System: An Opportunity Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B. Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID '12. Acceptance Rate : 27%. A case for Workflow-Aware Storage: An Opportunity Study using MosaStore. Emalayan Vairavanathan, S. Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak, D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal MosaStore Storage System Experimental platform for other studies Predicting Intermediate Storage Performance for Workflow Applications. L. B. Costa, A. Barros, Emalayan Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted – CCGRID '13.
6
… Intermediate Storage System Opportunities: Underutilized resources
Workflow Runtime Engine Compute Nodes App. task Local storage App. task Local storage App. task Local storage … POSIX API Performance gain might come from mainly two different things. - Underutilized network throughput - Network latency – proximity of the intermediate storage - Partial POSIX implementation Assumptions: Application is data intensive - mainly the amount of storage –IO used during the workflow time is more than the amount of storage-IO used during stage-in and stage-out. But this is the case for most applications. Stage Out Intermediate Storage Stage In Central Storage System (e.g., GPFS, NFS)
7
20- 40% improvement 2x improvement
Evaluation - modFTDock on Blue Gene/P 20- 40% improvement Here add animation to the next topics we are going to address. 2x improvement
8
Contributions - Alleviating storage I/O bottleneck
Intermediate Storage System Designed and implemented a prototype Integrated with workflow runtime Evaluated with applications on BG/P The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. S. Al-Kiswany, Emalayan Vairavanathan, L. B. Costa, H. Yang, M. Ripeanu. Submitted - FAST '13. Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware storage A Workflow-Aware Storage System: An Opportunity Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B. Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID '12. Acceptance Rate : 27%. A case for Workflow-Aware Storage: An Opportunity Study using MosaStore. Emalayan Vairavanathan, S. Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak, D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal MosaStore Storage System Experimental platform for other studies Predicting Intermediate Storage Performance for Workflow Applications. L. B. Costa, A. Barros, Emalayan Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted – CCGRID '13.
9
… A Workflow-aware Storage System Opportunities
Dedicated intermediate storage Exposing data location Regular data access patterns Workflow Runtime Engine Task scheduling Compute Nodes POSIX API App. task Local storage App. task Local storage App. task Local storage … Assumptions: 1. Storage nodes and storage clients are co-deployed on a single node - For pipeline and reduce patterns this assumption will impact. But there are other patterns which will not be affected by this assumption: for example replication, scatter, data reuse, data distribute But still high-level take away is that one can optimize the storage according to the workflow applications and can observe significant gain. Deploy intermediate storage Workflow-aware Intermediate Storage Intermediate storage (shared) Stage In/Out Central Storage System (e.g., GPFS)
10
Data Access Patterns in Workflow Applications
Pipeline Broadcast Reduce Scatter and Gather Locality and location-aware scheduling Replication Collocation and location-aware scheduling Arrows shows files and circles shows computation stages Tell why each optimizations can improve of these patterns . Tell what is locality and locality aware scheduling ? / what is colocation / what is block level data placement Block-level data placement Wozniak et al PDSW’09, Katz et al BlueWater, Shibata et al. HPDC’10
11
ModFTDock Data Access Patterns in ModFTDock Broadcast pattern
Reduce pattern Pipeline pattern
12
vs Workflow-aware storage
Evaluation - Baselines Central Storage System (e.g., GPFS, NFS) App. task Local storage Intermediate storage (shared) Compute Nodes … Stage In/Out MosaStore, NFS and vs Workflow-aware storage Node-local storage Local storage Workflow-aware storage MosaStore Where ever it is possible we are going to use ext3 as a optimal baseline NFS
13
NFS server is better provisioned
Evaluation - Platform Cluster of 20 machines. Intel Xeon 4-core, 2.33-GHz CPU, 4-GB RAM, 1-Gbps NIC, and a RAID-1 on two 300-GB 7200-rpm SATA disks Central storage NFS server Intel Xeon E core, 2.33-GHz CPU, 8-GB RAM, 1-Gbps NIC, and a 6 SATA disks in a RAID 5 configuration Only mention that NFS is better provisioned. You can see that NFS server is provisioned on a machine with 6 SATA disks and RAID5 configuration. NFS server is better provisioned
14
Evaluation – Benchmarks and Application
Synthetic benchmark Workload Pipeline Broadcast Reduce Small 100KB, 200KB, 10KB 100KB, 1KB 10KB, 100KB Medium 100 MB, 200 MB, 1MB 100 MB, 1MB 10MB, 200 MB Large 1GB, 2GB, 10MB 1 GB, 10 MB 100MB, 2 GB Goal is to say there are three synthetic benchmarks namely pipeline, broadcast and reduce. These benchmarks represents the patterns in the real workflows and helps to analyze the performance introduced by our techniques in an easy way. All the benchmarks stage-in the data from backend-storage to intermediate storage and do some processing on the data and then produce the final results on intermediate storage. Then this final results is copied to backend storage from intermediate storage. Application and workflow run-time engine Montage modFTDock
15
3x improvement in workflow time
Synthetic Benchmark - Pipeline Optimization: Locality and location-aware scheduling Arrows shows files and circles shows computation stages Average runtime for medium workload 3x improvement in workflow time
16
60% improvement in the runtime
Synthetic Benchmarks - Broadcast Optimization: Replication Arrows shows files and circles shows computation stages Average runtime for medium workload on disk 60% improvement in the runtime
17
10% improvement in the runtime
Evaluation – Montage Total application time on five different systems - Here we do not have time for stage-in and stage-out Montage workflow 10% improvement in the runtime
18
Contributions - Alleviating storage I/O bottleneck
Intermediate Storage System Designed and implemented a prototype Integrated with workflow runtime Evaluated with applications on BG/P The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. S. Al-Kiswany, Emalayan Vairavanathan, L. B. Costa, H. Yang, M. Ripeanu. Submitted - FAST '13. Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware storage A Workflow-Aware Storage System: An Opportunity Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B. Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID '12. Acceptance Rate : 27% (one of the top 15 papers). A case for Workflow-Aware Storage: An Opportunity Study using MosaStore. Emalayan Vairavanathan, S. Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak, D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal MosaStore Storage System Experimental platform for other studies Predicting Intermediate Storage Performance for Workflow Applications. L. B. Costa, A. Barros, Emalayan Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted – CCGRID '13. Storage optimizations has significant impact on application performance
19
THANK YOU
20
BACKUP SLIDES
21
Background –Many-task workflows
Large amount of legacy code Rapid application development Portability (workstation – supercomputers) Easy to debug Implicit fault-tolerance Expression of natural parallelism Here add animation to the next topics we are going to address.
22
Background – Motivation
Many-task applications are becoming popular Better utilization of costly hardware, Energy saving (lot of time is spend to execute workflow applications) Better scalability and high performance will help to solve large problems more accurately Large number of available workflow applications Here add animation to the next topics we are going to address.
23
GPFS: deployed on 128 file server nodes (3 Petabytes storage capacity)
Blue Gene/P Architecture 40960 compute nodes (160K cores) 10 Gbps Switch Complex GPFS: deployed on 128 file server nodes (3 Petabytes storage capacity) 640 IO Nodes Torus Network 6.4 Gbps per link. Tree network (850 MBps x 640) 10 Gb/s x 128 - Here we do not have time for stage-in and stage-out
24
Example Workflow Software Stack
Shared Storage System Swift script Intermediate Code Task dispatching service (e.g. Coasters) Worker Workflow runtime engine (e.g. Swift) Tasks / Notifications Performs Storage IO Swift Compiler Here add animation to the next topics we are going to address.
25
MosaStore distributed storage architecture
Intermediate Storage System MosaStore File is divided into fixed size chunks. Chunks: stored on the storage nodes. Manager maintains a block-map for each file POSIX interface for accessing the system MosaStore distributed storage architecture
26
Contribution - Intermediate Storage System
Support a set of POSIX APIs (random read and write, delete, close) Garbage-collection Replication (eager and lazy) Client side caching Here add animation to the next topics we are going to address. MosaStore Storage System
27
Viability study – Changes in MosaStore
Optimized data placement for the pipeline pattern Priority to local writes and reads Optimized data placement for the reduce pattern Collocating files in a single benefactor Replication mechanism optimized for the broadcast pattern Parallel replication Data block placement for the scatter and gather patterns Now I am going to explain about the Storage system design challenges in the context of workflow applications Objective and Background The goals is answer following questions: What is the problem I am going to attack Introduction to workflows Why storage based solutions? Why not using something else ? Why this is important - Bottleneck analysis Challenges: Why this problem is hard to attack? Methodology: How I am going to attack this problem ? What is the success criteria ? How I am going to evaluate my success criteria ? Contribution: Impact of my work in the research community ? Will some one refer findings 10 years from now ? Summary: How this work can be generalized ? And other future directions ?
28
Evaluation - Synthetic Benchmark on Blue Gene/P
Here add animation to the next topics we are going to address. Pipeline benchmark Runtime at different scale 100% performance gain in the application runtime
29
2x improvement in the runtime
Synthetic Benchmarks - Reduce Optimization: Collocation and location-aware scheduling Arrows shows files and circles shows computation stages Average runtime for medium workload 2x improvement in the runtime
30
Synthetic benchmarks – Small workload
Here add animation to the next topics we are going to address. Reduce benchmark Broadcast benchmark
31
20% improvement in the runtime
Evaluation – ModFTDock - Here we do not have time for stage-in and stage-out Total application time on three different systems ModFTDock workflow 20% improvement in the runtime
32
Total application time five different systems
Evaluation – Montage per stage time - Here we do not have time for stage-in and stage-out Total application time five different systems
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.