How Much SSD Is Useful For Resilience In Supercomputers

Slides:



Advertisements
Similar presentations
Virtual Memory (II) CSCI 444/544 Operating Systems Fall 2008.
Advertisements

Traffic Engineering with Forward Fault Correction (FFC)
Achieving Elasticity for Cloud MapReduce Jobs Khaled Salah IEEE CloudNet 2013 – San Francisco November 13, 2013.
SLA-Oriented Resource Provisioning for Cloud Computing
Cloud Computing to Satisfy Peak Capacity Needs Case Study.
Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.
SAN DIEGO SUPERCOMPUTER CENTER at the UNIVERSITY OF CALIFORNIA; SAN DIEGO SDSC RP Update October 21, 2010.
Proactive Prediction Models for Web Application Resource Provisioning in the Cloud _______________________________ Samuel A. Ajila & Bankole A. Akindele.
1 Auction or Tâtonnement – Finding Congestion Prices for Adaptive Applications Xin Wang Henning Schulzrinne Columbia University.
Differentiated Multimedia Web Services Using Quality Aware Transcoding S. Chandra, C.Schlatter Ellis and A.Vahdat InfoCom 2000, IEEE Journal on Selected.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Virtualization in Data Centers Prashant Shenoy
1 stdchk : A Checkpoint Storage System for Desktop Grid Computing Matei Ripeanu – UBC Sudharshan S. Vazhkudai – ORNL Abdullah Gharaibeh – UBC The University.
© 2008 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Automated Workload Management in.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Matei Ripeanu.
Distributed Systems Meet Economics: Pricing In The Cloud Authors: Hongyi Wang, Qingfeng Jing, Rishan Chen, Bingsheng He, Zhengping He, Lidong Zhou Presenter:
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
New Challenges in Cloud Datacenter Monitoring and Management
Exploiting Virtualization for Delivering Cloud based IPTV Services Speaker : 吳靖緯 MA0G IEEE Conference on Computer Communications Workshops.
Buffer Management for Shared- Memory ATM Switches Written By: Mutlu Apraci John A.Copelan Georgia Institute of Technology Presented By: Yan Huang.
1 Exploring Data Reliability Tradeoffs in Replicated Storage Systems NetSysLab The University of British Columbia Abdullah Gharaibeh Advisor: Professor.
Harold C. Lim, Shinath Baba and Jeffery S. Chase from Duke University AUTOMATED CONTROL FOR ELASTIC STORAGE Presented by: Yonggang Liu Department of Electrical.
Asaf Cohen (joint work with Rami Atar) Department of Mathematics University of Michigan Financial Mathematics Seminar University of Michigan March 11,
Dynamic and Decentralized Approaches for Optimal Allocation of Multiple Resources in Virtualized Data Centers Wei Chen, Samuel Hargrove, Heh Miao, Liang.
PARAID: The Gear-Shifting Power-Aware RAID Charles Weddle, Mathew Oldham, An-I Andy Wang – Florida State University Peter Reiher – University of California,
Dynamic Resource Allocation Using Virtual Machines for Cloud Computing Environment.
Network Aware Resource Allocation in Distributed Clouds.
Storage Management in Virtualized Cloud Environments Sankaran Sivathanu, Ling Liu, Mei Yiduo and Xing Pu Student Workshop on Frontiers of Cloud Computing,
An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.
1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.
© 2009 IBM Corporation 1 Improving Consolidation of Virtual Machines with Risk-aware Bandwidth Oversubscription in Compute Clouds Amir Epstein Joint work.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
1/20 Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications Sheng Di, Mohamed Slim Bouguerra, Leonardo Bautista-gomez, Franck Cappello.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.
1 Virtual Machine Memory Access Tracing With Hypervisor Exclusive Cache USENIX ‘07 Pin Lu & Kai Shen Department of Computer Science University of Rochester.
Chapter 3 System Performance and Models Introduction A system is the part of the real world under study. Composed of a set of entities interacting.
Scale up Vs. Scale out in Cloud Storage and Graph Processing Systems
1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Tackling I/O Issues 1 David Race 16 March 2010.
LIOProf: Exposing Lustre File System Behavior for I/O Middleware
PACMan: Coordinated Memory Caching for Parallel Jobs Ganesh Ananthanarayanan, Ali Ghodsi, Andrew Wang, Dhruba Borthakur, Srikanth Kandula, Scott Shenker,
Using Pattern-Models to Guide SSD Deployment for Big Data in HPC Systems Junjie Chen 1, Philip C. Roth 2, Yong Chen 1 1 Data-Intensive Scalable Computing.
A Hierarchical Edge Cloud Architecture for Mobile Computing IEEE INFOCOM 2016 Liang Tong, Yong Li and Wei Gao University of Tennessee – Knoxville 1.
Resource Specification Prediction Model Richard Huang joint work with Henri Casanova and Andrew Chien.
Introduction to Operating Systems
Seth Pugsley, Jeffrey Jestes,
Ignacio Cano, Srinivas Aiyar, Arvind Krishnamurthy
Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1
Green cloud computing 2 Cs 595 Lecture 15.
Fujitsu Training Documentation RAID Groups and Volumes
Improving Datacenter Performance and Robustness with Multipath TCP
Dave Bartoletti, Senior Analyst
PA an Coordinated Memory Caching for Parallel Jobs
Jiang Zhou, Wei Xie, Dong Dai, and Yong Chen
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Comparison of the Three CPU Schedulers in Xen
Understanding and Exploiting Amazon EC2 Spot Instances
Lecture 19 – TCP Performance
reFresh SSDs: Enabling High Endurance, Low Cost Flash in Datacenters
Determining the Peer Resource Contributions in a P2P Contract
Xiaoyang Zhang1, Yuchong Hu1, Patrick P. C. Lee2, Pan Zhou1
ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC
Buffer Management for Shared-Memory ATM Switches
CS246: Search-Engine Scale
Virtual Memory: Working Sets
On the Role of Burst Buffers in Leadership-Class Storage Systems
Authors: Jinliang Fan and Mostafa H. Ammar
Sophia: Online Reconfiguration of Clustered NoSQL Databases for Time-Varying Workloads Ashraf Mahgoub1, Paul Wood2, Alexander Medoff1, Subrata.
Efficient Migration of Large-memory VMs Using Private Virtual Memory
Presentation transcript:

How Much SSD Is Useful For Resilience In Supercomputers Aiman Fang1 and Andrew A. Chien1,2 1The University of Chicago 2Argonne National Laboratory Fault Tolerance at Extreme Scale (FTXS) at HPDC 2015 Portland, Oregon June 15, 2015

How Much SSD Is Useful For Resilience In Supercomputers Outline Motivation & Problem Main Contributions Modeling & Case Studies Related Work Summary and Future Work Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Motivation: Checkpoint/Restart and Bottleneck Todays checkpointing time 30 min[1] vs. Parallel file system bandwidth ~100GB/s vs. Future MTBF of system < 1 hour Future bandwidth demand (TB/s)[2] Hard disk drive (HDD) bandwidth is a critical bottleneck of performance. [1] F. Cappello, 2009, Fault tolerance in petascale/exascale systems: current knowledge, challenges and opportunities. [2] N. Liu, 2012, On the role of burst buffers in leadership-class storage systems. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Motivation: Burst Buffer Systems Burst buffers are a high-bandwidth, storage tier between compute nodes and disk storage. In the form of solid state drives (SSD). Drain checkpoints quickly. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Motivation: SSD Characteristics High-bandwidth: ~4x of HDD Limited write/erase cycles: 104 -- 105 Relative high cost: 6 to 7x of HDD Intel SSD DC S3710 Specifications Capacity 200GB 400GB 800GB 1.2TB Sequential Write 300MB/s 470MB/s 460MB/s 520MB/s Endurance 3.6PB 8.3PB 16.9PB 24.3PB Lifetime at full write rate 138 days 204 days 425 days 540 days MSRP $309 $619 $1,249 $1,909 Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

Effective use of SSD lifetime is important! Motivation: An Example Summit: Oak Ridge National Laboratory’s next generation of supercomputer ~ 3,400 nodes, each with 512 GB memory and 800 NVRAM as burst buffers Suppose we use half of burst buffers for checkpointing Write burst buffers at full rate (500MB/s * 3400) Total SSD lifetime available for checkpointing: (800 × 3400 × 1/2) × 104= 13,600 TB Write Time: 13600 TB ÷ (500 MB/s × 3400) = 8×106 seconds = 92 days Annual Cost: $1500 × 3400 × (365/92) = 20 million dollars Effective use of SSD lifetime is important! Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Problem #1: Allocation For a set of jobs, how to allocate SSD lifetime to maximize efficiency? More work loss Fewer checkpoints Less allocation More allocation Figure: An example of SSD lifetime allocation on two jobs Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Problem #2: Provisioning Given a supercomputer with a particular error rate, how much SSD lifetime is worth buying? Paid too much! No one uses… Job or System Efficiency Over-provisioning 98% How much increment? How much efficiency? Opportunities for under-provisioning: Trade-off between efficiency and SSD cost. Under-provisioning SSD Lifetime (GB) Figure: How to provision supercomputers with SSD lifetime? Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Main Contributions A model to determine optimal SSD lifetime allocation for a variety of objectives, including job-size fairness (size-based allocation) equal job efficiency (job-efficiency based allocation), and maximum system efficiency (system efficiency based allocation) A global perspective is required for SSD lifetime allocation, otherwise, system/job efficiency will suffer. With size-based and system-efficiency based allocation, large size jobs suffer 40% lower job efficiency than small size jobs. Job-efficiency based allocation eliminates job-size unfairness, but must allocate 50% more lifetime to large jobs. Job-efficiency based allocation’s fairness comes at a cost, decreasing system efficiency by as much as 14%. On cost-effective provisioning, only 10-20% of the optimal lifetime is needed to achieve 90% system efficiency at failure rates three times of current system. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Modeling Systems and Jobs: Wall Clock Time = solve time + dump time + rework time (Young 1974, Daly 2006) interval checkpointing Number of checkpoints Wall clock time of one job: Rework after a failure Number of failures α: failure rate Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Modeling Systems and Jobs: Many Jobs on A System A large-scale system with N nodes, and failure rate of λ failures per hour. The system has a limited SSD lifetime of L gigabytes. In a workload M jobs run concurrently on the system. SSD lifetime li used by a job: where Ts,i is the solve time, τi is the checkpoint interval, si is the checkpoint size. SSD lifetime constraint: Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

Modeling Systems and Jobs: Allocation Problem Optimal Allocation Without Resource Constraint Optimal checkpoint interval without resource constraint. Young’ formula: where δ is the time to write one checkpoint and M is MTBF of the job. Optimum SSD lifetime: With resource constraints of a system, how to decide SSD lifetime for each job and its checkpoint interval? Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Size Based (SB) Allocation SB allocates SSD lifetime proportional to job size. SB formulation: Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Job-efficiency based (JEB) allocation JEB allocates SSD lifetime such that job efficiencies are equalized within a workload. Job efficiency definition: JEB formulation: Newton’s iteration/method. Time complexity: logarithm Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers System-efficiency based (SEB) allocation SEB allocates SSD lifetime such that system efficiency is maximized. System efficiency definition: SEB formulation Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Studies Impact of job characteristics on allocation. Properties and performance of SB, JEB, and SEB allocation policies. How to provision systems to achieve acceptable system efficiency. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Studies: System Model An Intrepid BlueGene/P like system. Key system characteristics: Nodes: 40,960 (Intrepid) Node Failure Rate: 130 – 20,000 FIT/node (Projections from Snir et al. 2014) (FIT: number of failures in billion machine hours) Memory Size: 2 GB (Intrepid) SSD Bandwidth: 320 GB/s (Gordon system configuration) SSD Provisioning Ratio: 100%, 25%, and 6.25% (reducing 4x). Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Studies: Workload Model Trace-Based Realistic Workloads Workload Name Key Difference Job Size WL Vary job size from 512 to 16384 Solve Time WL Vary solve time from 1 hour to 17 hours Checkpoint Size Ratio WL Vary checkpoint size ratio from 0.1 ~ 0.9 Mixed WL Vary job size from 512 to 8192, and checkpoint size ratio from 0.1 to 0.9 Small Job Heavy WL Fraction of small jobs (≤ 512 nodes) is more than 60% of workload Medium Job Heavy WL Fraction of medium jobs (1024-4096 nodes) is more than 60% of workload Large Job Heavy WL Fraction of large jobs (≥ 8096 nodes) is more than 60% of workload Isolate impact of Job Feature Job type dominant Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Studies: Metrics SSD Service Ratio quantifies the SSD lifetime allocation compared to optimum SSD lifetime indicated by Young’s formula. Job Efficiency represents the performance of a job. System Efficiency is a typical metric for evaluating the performance of systems. High system efficiency indicates quick proceeding of jobs, which is desired by system administrators. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Study #1: Effect of Job Size on Allocation System: 20,000 FIT/node, 6.25% SSD provisioning 40% JEB: 9.25% for large jobs 9% SB and SEB produce same SSD service ratio (6.25%) as job size is varied. JEB exhibits a preference for large jobs. An increase from 6.25% to 9.25%, or 50% more lifetime is required to achieve equal job efficiency. In SB and SEB, large jobs suffer low job efficiency, 40% compared to small jobs. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Study #2: Effect of Solve Time on Allocation System: 20,000 FIT/node, 6.25% SSD provisioning SB gives a fixed allocation based on size, so short solve time jobs have higher service ratio JEB and SEB produce the same SSD service ratio (6.25%) as solve time varies. Under SB, long solve time jobs suffer degraded job efficiency Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Study #3: Effect of Checkpoint Size Ratio on Allocation System: 20,000 FIT/node, 6.25% SSD provisioning SB: Prefer small JEB: prefer large SB prefers small checkpoint size ratio jobs. JEB prefers large checkpoint size ratio jobs. SEB is neutral, 6.25% SSD service ratio for all jobs. Large jobs have 6-10% lower job efficiency compared to small jobs. Checkpoint size ratio has similar but smaller effect on allocation Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Study #4: Comparison of SB, JEB, and SEB System: 20,000 FIT/node, 6.25% SSD provisioning 5.5% 14% SEB always produces the best system efficiency. JEB can produce much lower system efficiency, with 14% drop for Job Size WL. SB produces 5.5% lower in the Mixed WL. Overall, with differences as large as 5-14%, careful choices are required. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Case Study #5: SSD Provisioning 37% 2x SSD Provisioning Ratio to achieve 90%, 95%, 98% System Efficiency, for varied workloads Only 10-20% of optimum SSD lifetime is needed to achieve 90% system efficiency even at failure rate three times that of today. Blue waters failure rate is 6100 FIT/node. 37% to achieve 95% system efficiency Moving from 90% to 95% increases the required SSD lifetime by 2-2.5x. Underprovisioning may be desired! Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Related Work Optimal checkpoint interval Young and Daly’s work [Young 1974 “A Frist Order Approximation to the Optimum Checkpoint Interval”, Daly 2006, “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps”] Optimize each job individually, and no resource constraint Resource-constrained optimization problem Power constrained problem [Sarood 2014 “Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget”] Assuming jobs run at 100% efficiency No resource-sharing interaction Resource provisioning Virtual machine (VM) resource provisioning [Di 2013 “Error-Tolerant Resource Allocation and Payment Minimization for Cloud System”, Chaisiri 2011 “Cost Minimization for Provisioning Virtual Servers in Amazon Elastic Compute Cloud”] Focus on user cost rather system We know of no work that looks at lifetime allocation across jobs Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Usage of Our Model On an expected workload, the model can be used to pre-compute SSD lifetime allocations based on job mix properties. Apply the model periodically to the system based on history or even dynamically. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Summary We derive a model that captures system and job characteristics, and use it to formulate SSD lifetime allocation problem. Exploring three allocation objectives – SB, JEB, and SEB, we show that a critical lifetime constraint changes checkpoint interval, and thereby achievable job and system efficiency. The results suggest that with introduction of SSD lifetime, there is trade off between job efficiency and system efficiency. Therefore careful management of SSD lifetime in burst buffer is important. Study of provisioning reveals that low provisioning is sufficient to achieve 90% and 95% system efficiency. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

How Much SSD Is Useful For Resilience In Supercomputers Future Work Study of a broader variety of workloads and system parameters. Extend the model to capture burst buffer contention. Study simultaneous variation of system and workload parameters. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers

Questions? Thanks!