How Much SSD Is Useful For Resilience In Supercomputers Aiman Fang1 and Andrew A. Chien1,2 1The University of Chicago 2Argonne National Laboratory Fault Tolerance at Extreme Scale (FTXS) at HPDC 2015 Portland, Oregon June 15, 2015
How Much SSD Is Useful For Resilience In Supercomputers Outline Motivation & Problem Main Contributions Modeling & Case Studies Related Work Summary and Future Work Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Motivation: Checkpoint/Restart and Bottleneck Todays checkpointing time 30 min[1] vs. Parallel file system bandwidth ~100GB/s vs. Future MTBF of system < 1 hour Future bandwidth demand (TB/s)[2] Hard disk drive (HDD) bandwidth is a critical bottleneck of performance. [1] F. Cappello, 2009, Fault tolerance in petascale/exascale systems: current knowledge, challenges and opportunities. [2] N. Liu, 2012, On the role of burst buffers in leadership-class storage systems. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Motivation: Burst Buffer Systems Burst buffers are a high-bandwidth, storage tier between compute nodes and disk storage. In the form of solid state drives (SSD). Drain checkpoints quickly. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Motivation: SSD Characteristics High-bandwidth: ~4x of HDD Limited write/erase cycles: 104 -- 105 Relative high cost: 6 to 7x of HDD Intel SSD DC S3710 Specifications Capacity 200GB 400GB 800GB 1.2TB Sequential Write 300MB/s 470MB/s 460MB/s 520MB/s Endurance 3.6PB 8.3PB 16.9PB 24.3PB Lifetime at full write rate 138 days 204 days 425 days 540 days MSRP $309 $619 $1,249 $1,909 Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
Effective use of SSD lifetime is important! Motivation: An Example Summit: Oak Ridge National Laboratory’s next generation of supercomputer ~ 3,400 nodes, each with 512 GB memory and 800 NVRAM as burst buffers Suppose we use half of burst buffers for checkpointing Write burst buffers at full rate (500MB/s * 3400) Total SSD lifetime available for checkpointing: (800 × 3400 × 1/2) × 104= 13,600 TB Write Time: 13600 TB ÷ (500 MB/s × 3400) = 8×106 seconds = 92 days Annual Cost: $1500 × 3400 × (365/92) = 20 million dollars Effective use of SSD lifetime is important! Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Problem #1: Allocation For a set of jobs, how to allocate SSD lifetime to maximize efficiency? More work loss Fewer checkpoints Less allocation More allocation Figure: An example of SSD lifetime allocation on two jobs Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Problem #2: Provisioning Given a supercomputer with a particular error rate, how much SSD lifetime is worth buying? Paid too much! No one uses… Job or System Efficiency Over-provisioning 98% How much increment? How much efficiency? Opportunities for under-provisioning: Trade-off between efficiency and SSD cost. Under-provisioning SSD Lifetime (GB) Figure: How to provision supercomputers with SSD lifetime? Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Main Contributions A model to determine optimal SSD lifetime allocation for a variety of objectives, including job-size fairness (size-based allocation) equal job efficiency (job-efficiency based allocation), and maximum system efficiency (system efficiency based allocation) A global perspective is required for SSD lifetime allocation, otherwise, system/job efficiency will suffer. With size-based and system-efficiency based allocation, large size jobs suffer 40% lower job efficiency than small size jobs. Job-efficiency based allocation eliminates job-size unfairness, but must allocate 50% more lifetime to large jobs. Job-efficiency based allocation’s fairness comes at a cost, decreasing system efficiency by as much as 14%. On cost-effective provisioning, only 10-20% of the optimal lifetime is needed to achieve 90% system efficiency at failure rates three times of current system. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Modeling Systems and Jobs: Wall Clock Time = solve time + dump time + rework time (Young 1974, Daly 2006) interval checkpointing Number of checkpoints Wall clock time of one job: Rework after a failure Number of failures α: failure rate Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Modeling Systems and Jobs: Many Jobs on A System A large-scale system with N nodes, and failure rate of λ failures per hour. The system has a limited SSD lifetime of L gigabytes. In a workload M jobs run concurrently on the system. SSD lifetime li used by a job: where Ts,i is the solve time, τi is the checkpoint interval, si is the checkpoint size. SSD lifetime constraint: Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
Modeling Systems and Jobs: Allocation Problem Optimal Allocation Without Resource Constraint Optimal checkpoint interval without resource constraint. Young’ formula: where δ is the time to write one checkpoint and M is MTBF of the job. Optimum SSD lifetime: With resource constraints of a system, how to decide SSD lifetime for each job and its checkpoint interval? Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Size Based (SB) Allocation SB allocates SSD lifetime proportional to job size. SB formulation: Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Job-efficiency based (JEB) allocation JEB allocates SSD lifetime such that job efficiencies are equalized within a workload. Job efficiency definition: JEB formulation: Newton’s iteration/method. Time complexity: logarithm Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers System-efficiency based (SEB) allocation SEB allocates SSD lifetime such that system efficiency is maximized. System efficiency definition: SEB formulation Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Studies Impact of job characteristics on allocation. Properties and performance of SB, JEB, and SEB allocation policies. How to provision systems to achieve acceptable system efficiency. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Studies: System Model An Intrepid BlueGene/P like system. Key system characteristics: Nodes: 40,960 (Intrepid) Node Failure Rate: 130 – 20,000 FIT/node (Projections from Snir et al. 2014) (FIT: number of failures in billion machine hours) Memory Size: 2 GB (Intrepid) SSD Bandwidth: 320 GB/s (Gordon system configuration) SSD Provisioning Ratio: 100%, 25%, and 6.25% (reducing 4x). Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Studies: Workload Model Trace-Based Realistic Workloads Workload Name Key Difference Job Size WL Vary job size from 512 to 16384 Solve Time WL Vary solve time from 1 hour to 17 hours Checkpoint Size Ratio WL Vary checkpoint size ratio from 0.1 ~ 0.9 Mixed WL Vary job size from 512 to 8192, and checkpoint size ratio from 0.1 to 0.9 Small Job Heavy WL Fraction of small jobs (≤ 512 nodes) is more than 60% of workload Medium Job Heavy WL Fraction of medium jobs (1024-4096 nodes) is more than 60% of workload Large Job Heavy WL Fraction of large jobs (≥ 8096 nodes) is more than 60% of workload Isolate impact of Job Feature Job type dominant Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Studies: Metrics SSD Service Ratio quantifies the SSD lifetime allocation compared to optimum SSD lifetime indicated by Young’s formula. Job Efficiency represents the performance of a job. System Efficiency is a typical metric for evaluating the performance of systems. High system efficiency indicates quick proceeding of jobs, which is desired by system administrators. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Study #1: Effect of Job Size on Allocation System: 20,000 FIT/node, 6.25% SSD provisioning 40% JEB: 9.25% for large jobs 9% SB and SEB produce same SSD service ratio (6.25%) as job size is varied. JEB exhibits a preference for large jobs. An increase from 6.25% to 9.25%, or 50% more lifetime is required to achieve equal job efficiency. In SB and SEB, large jobs suffer low job efficiency, 40% compared to small jobs. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Study #2: Effect of Solve Time on Allocation System: 20,000 FIT/node, 6.25% SSD provisioning SB gives a fixed allocation based on size, so short solve time jobs have higher service ratio JEB and SEB produce the same SSD service ratio (6.25%) as solve time varies. Under SB, long solve time jobs suffer degraded job efficiency Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Study #3: Effect of Checkpoint Size Ratio on Allocation System: 20,000 FIT/node, 6.25% SSD provisioning SB: Prefer small JEB: prefer large SB prefers small checkpoint size ratio jobs. JEB prefers large checkpoint size ratio jobs. SEB is neutral, 6.25% SSD service ratio for all jobs. Large jobs have 6-10% lower job efficiency compared to small jobs. Checkpoint size ratio has similar but smaller effect on allocation Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Study #4: Comparison of SB, JEB, and SEB System: 20,000 FIT/node, 6.25% SSD provisioning 5.5% 14% SEB always produces the best system efficiency. JEB can produce much lower system efficiency, with 14% drop for Job Size WL. SB produces 5.5% lower in the Mixed WL. Overall, with differences as large as 5-14%, careful choices are required. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Case Study #5: SSD Provisioning 37% 2x SSD Provisioning Ratio to achieve 90%, 95%, 98% System Efficiency, for varied workloads Only 10-20% of optimum SSD lifetime is needed to achieve 90% system efficiency even at failure rate three times that of today. Blue waters failure rate is 6100 FIT/node. 37% to achieve 95% system efficiency Moving from 90% to 95% increases the required SSD lifetime by 2-2.5x. Underprovisioning may be desired! Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Related Work Optimal checkpoint interval Young and Daly’s work [Young 1974 “A Frist Order Approximation to the Optimum Checkpoint Interval”, Daly 2006, “A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps”] Optimize each job individually, and no resource constraint Resource-constrained optimization problem Power constrained problem [Sarood 2014 “Maximizing Throughput of Overprovisioned HPC Data Centers Under a Strict Power Budget”] Assuming jobs run at 100% efficiency No resource-sharing interaction Resource provisioning Virtual machine (VM) resource provisioning [Di 2013 “Error-Tolerant Resource Allocation and Payment Minimization for Cloud System”, Chaisiri 2011 “Cost Minimization for Provisioning Virtual Servers in Amazon Elastic Compute Cloud”] Focus on user cost rather system We know of no work that looks at lifetime allocation across jobs Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Usage of Our Model On an expected workload, the model can be used to pre-compute SSD lifetime allocations based on job mix properties. Apply the model periodically to the system based on history or even dynamically. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Summary We derive a model that captures system and job characteristics, and use it to formulate SSD lifetime allocation problem. Exploring three allocation objectives – SB, JEB, and SEB, we show that a critical lifetime constraint changes checkpoint interval, and thereby achievable job and system efficiency. The results suggest that with introduction of SSD lifetime, there is trade off between job efficiency and system efficiency. Therefore careful management of SSD lifetime in burst buffer is important. Study of provisioning reveals that low provisioning is sufficient to achieve 90% and 95% system efficiency. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
How Much SSD Is Useful For Resilience In Supercomputers Future Work Study of a broader variety of workloads and system parameters. Extend the model to capture burst buffer contention. Study simultaneous variation of system and workload parameters. Aiman Fang, Andrew A. Chien How Much SSD Is Useful For Resilience In Supercomputers
Questions? Thanks!