Emalayan Vairavanathan

Slides:

Advertisements

Similar presentations

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

Advertisements

Crossing the Chasm: Sneaking a parallel file system into Hadoop Wittawat Tantisiriroj Swapnil Patil, Garth Gibson PARALLEL DATA LABORATORY Carnegie Mellon.

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

KMemvisor: Flexible System Wide Memory Mirroring in Virtual Environments Bin Wang Zhengwei Qi Haibing Guan Haoliang Dong Wei Sun Shanghai Key Laboratory.

1 The Case for Versatile Storage System NetSysLab The University of British Columbia Samer Al-Kiswany, Abdullah Gharaibeh, Matei Ripeanu.

AME: An Any-scale many-task computing Engine Zhao Zhang, University of Chicago Daniel S. Katz, CI University of Chicago.

VMware Infrastructure Alex Dementsov Tao Yang Clarkson University Feb 28, 2007.

Where to go from here? Get real experience building systems! Opportunities: 496 projects –More projects:

1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.

Gordon: Using Flash Memory to Build Fast, Power-efficient Clusters for Data-intensive Applications A. Caulfield, L. Grupp, S. Swanson, UCSD, ASPLOS’09.

Frangipani: A Scalable Distributed File System C. A. Thekkath, T. Mann, and E. K. Lee Systems Research Center Digital Equipment Corporation.

Google Distributed System and Hadoop Lakshmi Thyagarajan.

Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc

1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.

A Workflow-Aware Storage System Emalayan Vairavanathan 1 Samer Al-Kiswany, Lauro Beltrão Costa, Zhao Zhang, Daniel S. Katz, Michael Wilde, Matei Ripeanu.

Report ： Zhen Ming Wu 2008 IEEE 9th Grid Computing Conference.

Energy Prediction for I/O Intensive Workflow Applications 1 Hao Yang, Lauro Beltrão Costa, Matei Ripeanu NetSysLab Electrical and Computer Engineering.

Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.

11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.

Yongzhi Wang, Jinpeng Wei VIAF: Verification-based Integrity Assurance Framework for MapReduce.

1. 2 Corollary 3 System Overview Second Key Idea: Specialization Think GoogleFS.

Experience with Using a Performance Predictor During Development a Distributed Storage System Tale Lauro Beltrão Costa *, João Brunet +, Lile Hattori #,

1 Configurable Security for Scavenged Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany, Matei Ripeanu.

Energy Prediction for I/O Intensive Workflow Applications 1 MASc Exam Hao Yang NetSysLab The Electrical and Computer Engineering Department The University.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Building a Parallel File System Simulator E Molina-Estolano, C Maltzahn, etc. UCSC Lab, UC Santa Cruz. Published in Journal of Physics, 2009.

Sensitivity of Cluster File System Access to I/O Server Selection A. Apon, P. Wolinski, and G. Amerson University of Arkansas.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

The Limitation of MapReduce: A Probing Case and a Lightweight Solution Zhiqiang Ma Lin Gu Department of Computer Science and Engineering The Hong Kong.

Amy Apon, Pawel Wolinski, Dennis Reed Greg Amerson, Prathima Gorjala University of Arkansas Commercial Applications of High Performance Computing Massive.

Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.

Fragmentation in Large Object Repositories Russell Sears Catharine van Ingen CIDR 2007 This work was performed at Microsoft Research San Francisco with.

MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.

1 MosaStore -A Versatile Storage System Lauro Costa, Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu, Emalayan Vairavanathan, (and many others from.

Large Scale Parallel File System and Cluster Management ICT, CAS.

Towards Exascale File I/O Yutaka Ishikawa University of Tokyo, Japan 2009/05/21.

Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.

Distributed Information Systems. Motivation ● To understand the problems that Web services try to solve it is helpful to understand how distributed information.

Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Implementation of a reliable and expandable on-line storage for compute clusters Jos van Wezel.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

DynamicMR: A Dynamic Slot Allocation Optimization Framework for MapReduce Clusters Nanyang Technological University Shanjiang Tang, Bu-Sung Lee, Bingsheng.

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.

An Efficient Threading Model to Boost Server Performance Anupam Chanda.

PROOF Benchmark on Different Hardware Configurations 1 11/29/2007 Neng Xu, University of Wisconsin-Madison Mengmeng Chen, Annabelle Leung, Bruce Mellado,

1 CEG 2400 Fall 2012 Network Servers. 2 Network Servers Critical Network servers – Contain redundant components Power supplies Fans Memory CPU Hard Drives.

Load Rebalancing for Distributed File Systems in Clouds.

Background Computer System Architectures Computer System Software.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

1 Thierry Titcheu Chekam 1,2, Ennan Zhai 3, Zhenhua Li 1, Yong Cui 4, Kui Ren 5 1 School of Software, TNLIST, and KLISS MoE, Tsinghua University 2 Interdisciplinary.

VU-Advanced Computer Architecture Lecture 1-Introduction 1 Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 1.

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

The Sort Benchmark AlgorithmsSolid State Disks External Memory Multiway Mergesort  Phase 1: Run Formation  Phase 2: Merge Runs  Careful parameter selection.

Geethanjali College Of Engineering and Technology Cheeryal( V), Keesara ( M), Ranga Reddy District. I I Internal Guide Mrs.CH.V.Anupama Assistant Professor.

NFV Compute Acceleration APIs and Evaluation

Organizations Are Embracing New Opportunities

The demonstration of Lustre in EAST data system

Diskpool and cloud storage benchmarks used in IT-DSS

LCG 3D Distributed Deployment of Databases

PA an Coordinated Memory Caching for Parallel Jobs

Overview Introduction VPS Understanding VPS Architecture

Department of Computer Science University of California, Santa Barbara

A Software-Defined Storage for Workflow Applications

Declarative Transfer Learning from Deep CNNs at Scale

Department of Computer Science University of California, Santa Barbara

Presentation transcript:

Towards a High-Performance and Scalable Storage System for Workflow Applications Emalayan Vairavanathan Department of Electrical and Computer Engineering The University of British Columbia - At the beginning I would like to thank Sathish and Tor for participating in my committee and also for reading my thesis

Background: Workflow Applications Large number of independent tasks collectively work on a problem Common Characteristics File based communication Large number of tasks Large amount of storage I/O Regular data access patterns Here mention what is meant by large amount of IO ? Arrows shows files and circles shows computation stages modFTDock workflow

Background – ModFTDock in Argonne Blue Gene/P 1.2 M Docking Tasks Workflow Runtime Engine File based communication Large IO volume Scale: 40960 Compute nodes App. task Local storage App. task Local storage App. task Local storage App. task Local storage App. task Local storage IO throughput < 1MBps / core Now I am going to explain about the Storage system design challenges in the context of workflow applications TODO: Find out how much is the bandwidth ? IO rate numbers may be too old. So verify Also co-relate it with the ModFtDock diagram. Central Storage System (e.g., GPFS, NFS)

Montage workflow (512 BG/P CPU cores, GPFS) Background –Central Storage Bottleneck Z. Zhang et. al, SC’12 Montage workflow (512 BG/P CPU cores, GPFS) To illustrate the problem, we show in Fig. 1 the core-time distribution of a Montage benchmark problem on 512 cores of an IBM BG/P with intermediate results stored on GPFS. Even on this small number of cores, I/O dominates: 73.6% of core-time is consumed by I/O, task execution takes 13.4% of core-time; and 13.0% of core-time is consumed by scheduling overhead and CPU idle time due to workload imbalances. Of the latter time, around 39% (5.1% of total time) is idle time due to a gather operation, in which all but one core sit idle while data is fetched from GPFS. Add more evidence - from Justin and others

Contributions - Alleviating storage I/O bottleneck Intermediate Storage System Designed and implemented a prototype Integrated with workflow runtime Evaluated with applications on BG/P The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. S. Al-Kiswany, Emalayan Vairavanathan, L. B. Costa, H. Yang, M. Ripeanu. Submitted - FAST '13. Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware storage A Workflow-Aware Storage System: An Opportunity Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B. Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID '12. Acceptance Rate : 27%. A case for Workflow-Aware Storage: An Opportunity Study using MosaStore. Emalayan Vairavanathan, S. Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak, D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal MosaStore Storage System Experimental platform for other studies Predicting Intermediate Storage Performance for Workflow Applications. L. B. Costa, A. Barros, Emalayan Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted – CCGRID '13.

… Intermediate Storage System Opportunities: Underutilized resources Workflow Runtime Engine Compute Nodes App. task Local storage App. task Local storage App. task Local storage … POSIX API Performance gain might come from mainly two different things. - Underutilized network throughput - Network latency – proximity of the intermediate storage - Partial POSIX implementation Assumptions: Application is data intensive - mainly the amount of storage –IO used during the workflow time is more than the amount of storage-IO used during stage-in and stage-out. But this is the case for most applications. Stage Out Intermediate Storage Stage In Central Storage System (e.g., GPFS, NFS)

20- 40% improvement 2x improvement Evaluation - modFTDock on Blue Gene/P 20- 40% improvement Here add animation to the next topics we are going to address. 2x improvement

Contributions - Alleviating storage I/O bottleneck Intermediate Storage System Designed and implemented a prototype Integrated with workflow runtime Evaluated with applications on BG/P The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. S. Al-Kiswany, Emalayan Vairavanathan, L. B. Costa, H. Yang, M. Ripeanu. Submitted - FAST '13. Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware storage A Workflow-Aware Storage System: An Opportunity Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B. Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID '12. Acceptance Rate : 27%. A case for Workflow-Aware Storage: An Opportunity Study using MosaStore. Emalayan Vairavanathan, S. Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak, D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal MosaStore Storage System Experimental platform for other studies Predicting Intermediate Storage Performance for Workflow Applications. L. B. Costa, A. Barros, Emalayan Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted – CCGRID '13.

… A Workflow-aware Storage System Opportunities Dedicated intermediate storage Exposing data location Regular data access patterns Workflow Runtime Engine Task scheduling Compute Nodes POSIX API App. task Local storage App. task Local storage App. task Local storage … Assumptions: 1. Storage nodes and storage clients are co-deployed on a single node - For pipeline and reduce patterns this assumption will impact. But there are other patterns which will not be affected by this assumption: for example replication, scatter, data reuse, data distribute But still high-level take away is that one can optimize the storage according to the workflow applications and can observe significant gain. Deploy intermediate storage Workflow-aware Intermediate Storage Intermediate storage (shared) Stage In/Out Central Storage System (e.g., GPFS)

Data Access Patterns in Workflow Applications Pipeline Broadcast Reduce Scatter and Gather Locality and location-aware scheduling Replication Collocation and location-aware scheduling Arrows shows files and circles shows computation stages Tell why each optimizations can improve of these patterns . Tell what is locality and locality aware scheduling ? / what is colocation / what is block level data placement Block-level data placement Wozniak et al PDSW’09, Katz et al BlueWater, Shibata et al. HPDC’10

ModFTDock Data Access Patterns in ModFTDock Broadcast pattern Reduce pattern Pipeline pattern

vs Workflow-aware storage Evaluation - Baselines Central Storage System (e.g., GPFS, NFS) App. task Local storage Intermediate storage (shared) Compute Nodes … Stage In/Out MosaStore, NFS and vs Workflow-aware storage Node-local storage Local storage Workflow-aware storage MosaStore Where ever it is possible we are going to use ext3 as a optimal baseline NFS

NFS server is better provisioned Evaluation - Platform Cluster of 20 machines. Intel Xeon 4-core, 2.33-GHz CPU, 4-GB RAM, 1-Gbps NIC, and a RAID-1 on two 300-GB 7200-rpm SATA disks Central storage NFS server Intel Xeon E5345 8-core, 2.33-GHz CPU, 8-GB RAM, 1-Gbps NIC, and a 6 SATA disks in a RAID 5 configuration Only mention that NFS is better provisioned. You can see that NFS server is provisioned on a machine with 6 SATA disks and RAID5 configuration. NFS server is better provisioned

Evaluation – Benchmarks and Application Synthetic benchmark Workload Pipeline Broadcast Reduce Small 100KB, 200KB, 10KB 100KB, 1KB 10KB, 100KB Medium 100 MB, 200 MB, 1MB 100 MB, 1MB 10MB, 200 MB Large 1GB, 2GB, 10MB 1 GB, 10 MB 100MB, 2 GB Goal is to say there are three synthetic benchmarks namely pipeline, broadcast and reduce. These benchmarks represents the patterns in the real workflows and helps to analyze the performance introduced by our techniques in an easy way. All the benchmarks stage-in the data from backend-storage to intermediate storage and do some processing on the data and then produce the final results on intermediate storage. Then this final results is copied to backend storage from intermediate storage. Application and workflow run-time engine Montage modFTDock

3x improvement in workflow time Synthetic Benchmark - Pipeline Optimization: Locality and location-aware scheduling Arrows shows files and circles shows computation stages Average runtime for medium workload 3x improvement in workflow time

60% improvement in the runtime Synthetic Benchmarks - Broadcast Optimization: Replication Arrows shows files and circles shows computation stages Average runtime for medium workload on disk 60% improvement in the runtime

10% improvement in the runtime Evaluation – Montage Total application time on five different systems - Here we do not have time for stage-in and stage-out Montage workflow 10% improvement in the runtime

Contributions - Alleviating storage I/O bottleneck Intermediate Storage System Designed and implemented a prototype Integrated with workflow runtime Evaluated with applications on BG/P The Case for Cross-Layer Optimizations in Storage: A Workflow-Optimized Storage System. S. Al-Kiswany, Emalayan Vairavanathan, L. B. Costa, H. Yang, M. Ripeanu. Submitted - FAST '13. Workflow-aware Storage System Identified new data access patterns Studied the viability of a workflow-aware storage A Workflow-Aware Storage System: An Opportunity Study. Emalayan Vairavanathan, S. Al-Kiswany, L. B. Costa, Z.Zhang, D.Katz, M.Wilde, M. Ripeanu. CCGRID '12. Acceptance Rate : 27% (one of the top 15 papers). A case for Workflow-Aware Storage: An Opportunity Study using MosaStore. Emalayan Vairavanathan, S. Al-Kiswany, A. Barros, L. B. Costa1 H. Yang, G. Fedak, D.Katz, M.Wilde, M. Ripeanu. Submitted - FGCS Journal MosaStore Storage System Experimental platform for other studies Predicting Intermediate Storage Performance for Workflow Applications. L. B. Costa, A. Barros, Emalayan Vairavanathan, S. Al-Kiswany, M. Ripeanu. Submitted – CCGRID '13. Storage optimizations has significant impact on application performance

THANK YOU

BACKUP SLIDES

Background –Many-task workflows Large amount of legacy code Rapid application development Portability (workstation – supercomputers) Easy to debug Implicit fault-tolerance Expression of natural parallelism Here add animation to the next topics we are going to address.

Background – Motivation Many-task applications are becoming popular Better utilization of costly hardware, Energy saving (lot of time is spend to execute workflow applications) Better scalability and high performance will help to solve large problems more accurately Large number of available workflow applications Here add animation to the next topics we are going to address.

GPFS: deployed on 128 file server nodes (3 Petabytes storage capacity) Blue Gene/P Architecture 40960 compute nodes (160K cores) 10 Gbps Switch Complex GPFS: deployed on 128 file server nodes (3 Petabytes storage capacity) 640 IO Nodes Torus Network 6.4 Gbps per link. Tree network (850 MBps x 640) 10 Gb/s x 128 - Here we do not have time for stage-in and stage-out

Example Workflow Software Stack Shared Storage System Swift script Intermediate Code Task dispatching service (e.g. Coasters) Worker Workflow runtime engine (e.g. Swift) Tasks / Notifications Performs Storage IO Swift Compiler Here add animation to the next topics we are going to address.

MosaStore distributed storage architecture Intermediate Storage System MosaStore File is divided into fixed size chunks. Chunks: stored on the storage nodes. Manager maintains a block-map for each file POSIX interface for accessing the system MosaStore distributed storage architecture

Contribution - Intermediate Storage System Support a set of POSIX APIs (random read and write, delete, close) Garbage-collection Replication (eager and lazy) Client side caching Here add animation to the next topics we are going to address. MosaStore Storage System

Viability study – Changes in MosaStore Optimized data placement for the pipeline pattern Priority to local writes and reads Optimized data placement for the reduce pattern Collocating files in a single benefactor Replication mechanism optimized for the broadcast pattern Parallel replication Data block placement for the scatter and gather patterns Now I am going to explain about the Storage system design challenges in the context of workflow applications Objective and Background The goals is answer following questions: What is the problem I am going to attack Introduction to workflows Why storage based solutions? Why not using something else ? Why this is important - Bottleneck analysis Challenges: Why this problem is hard to attack? Methodology: How I am going to attack this problem ? What is the success criteria ? How I am going to evaluate my success criteria ? Contribution: Impact of my work in the research community ? Will some one refer findings 10 years from now ? Summary: How this work can be generalized ? And other future directions ?

Evaluation - Synthetic Benchmark on Blue Gene/P Here add animation to the next topics we are going to address. Pipeline benchmark Runtime at different scale 100% performance gain in the application runtime

2x improvement in the runtime Synthetic Benchmarks - Reduce Optimization: Collocation and location-aware scheduling Arrows shows files and circles shows computation stages Average runtime for medium workload 2x improvement in the runtime

Synthetic benchmarks – Small workload Here add animation to the next topics we are going to address. Reduce benchmark Broadcast benchmark

20% improvement in the runtime Evaluation – ModFTDock - Here we do not have time for stage-in and stage-out Total application time on three different systems ModFTDock workflow 20% improvement in the runtime

Total application time five different systems Evaluation – Montage per stage time - Here we do not have time for stage-in and stage-out Total application time five different systems