On Dynamic Resource Availability in Grids

Slides:

Advertisements

Similar presentations

Pricing for Utility-driven Resource Management and Allocation in Clusters Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS)

Advertisements

SLA-Oriented Resource Provisioning for Cloud Computing

June 1, Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema PDS Group, TU Delft, NL Todd Tannenbaum, Matt Farrellee,

June 3, ServMark A Hierarchical Architecture for Testing Grids Santiago, Chile A. Iosup, H. Mohamed, D.H.J. Epema PDS Group, ST/EWI, TU Delft C.

June 3, 2015 Synthetic Grid Workloads with Ibis, K OALA, and GrenchMark CoreGRID Integration Workshop, Pisa A. Iosup, D.H.J. Epema Jason Maassen, Rob van.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Informed Detour Selection Helps Reliability Boulat A. Bash.

1 A Performance Study of Grid Workflow Engines Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Corina Stratan Parallel.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

1 Trace-Based Characteristics of Grid Workflows Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Simon Ostermann,

Performance Evaluation

June 25, GrenchMark: A synthetic workload generator for Grids KOALA Workshop A. Iosup, H. Mohamed, D.H.J. Epema PDS Group, ST/EWI, TU Delft.

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

June 25, GrenchMark: Synthetic workloads for Grids First Demo at TU Delft A. Iosup, D.H.J. Epema PDS Group, ST/EWI, TU Delft.

OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.

July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.

Euro-Par 2008, Las Palmas, 27 August DGSim : Comparing Grid Resource Management Architectures Through Trace-Based Simulation Alexandru Iosup, Ozan.

1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,

Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.

Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick.

Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker, Hongzhang Shan Future Technology Group Lawrence Berkeley Research.

Integrated Risk Analysis for a Commercial Computing Service Chee Shin Yeo and Rajkumar Buyya Grid Computing and Distributed Systems (GRIDS) Lab. Dept.

MobSched: An Optimizable Scheduler for Mobile Cloud Computing S. SindiaS. GaoB. Black A.LimV. D. AgrawalP. Agrawal Auburn University, Auburn, AL 45 th.

1 EuroPar 2009 – POGGI: Puzzle-Based Online Games on Grid Infrastructures POGGI: Puzzle-Based Online Games on Grid Infrastructures Alexandru Iosup Parallel.

A Lightweight Platform for Integration of Resource Limited Devices into Pervasive Grids Stavros Isaiadis and Vladimir Getov University of Westminster

Profiling Grid Data Transfer Protocols and Servers George Kola, Tevfik Kosar and Miron Livny University of Wisconsin-Madison USA.

Heterogeneity and Dynamicity of Clouds at Scale: Google Trace Analysis [1] 4/24/2014 Presented by: Rakesh Kumar [1 ]

1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.

1 Challenge the future KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments Presenter: Lipu Fei Authors: Lipu Fei, Bogdan.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

Microsoft Reseach, CambridgeBrendan Murphy. Measuring System Behaviour in the field Brendan Murphy Microsoft Research Cambridge.

Case Study: A Database Service CSCI 8710 September 25, 2008.

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

NETE4631: Network Information System Capacity Planning (2) Suronapee Phoomvuthisarn, Ph.D. /

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Xi He Golisano College of Computing and Information Sciences Rochester Institute of Technology Rochester, NY THERMAL-AWARE RESOURCE.

Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.

Tailoring the ESS Reliability and Availability needs to satisfy the users Enric Bargalló WAO October 27, 2014.

Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.

15/02/2006CHEP 061 Measuring Quality of Service on Worker Node in Cluster Rohitashva Sharma, R S Mundada, Sonika Sachdeva, P S Dhekne, Computer Division,

Software Metrics and Reliability

OPERATING SYSTEMS CS 3502 Fall 2017

Jacob R. Lorch Microsoft Research

Introduction to Load Balancing:

Hardware & Software Reliability

Tao Zhu1,2, Chengchun Shu1, Haiyan Yu1

Software Architecture in Practice

Maximum Availability Architecture Enterprise Technology Centre.

HyperSim: High Performance Simulation Kernel

Understanding Real World Data Corruptions in Cloud Systems

US CMS Testbed.

A Framework for Automatic Resource and Accuracy Management in A Cloud Environment Smita Vijayakumar.

Chapter 10 Verification and Validation of Simulation Models

Class project by Piyush Ranjan Satapathy & Van Lepham

Department of Computer Science University of California, Santa Barbara

A Grid Research Toolbox

Resource and Test Management in Grids

An Adaptive Middleware for Supporting Time-Critical Event Response

Smita Vijayakumar Qian Zhu Gagan Agrawal

by Xiang Mao and Qin Chen

ANALYSIS OF USER SUBMISSION BEHAVIOR ON HPC AND HTC

Uniprocessor scheduling

Chapter 4: Simulation Designs

Department of Computer Science University of California, Santa Barbara

RadarGun: Toward a Performance Testing Framework

Cost Effective Presto on AWS

Presentation transcript:

On Dynamic Resource Availability in Grids A. Iosup, O. Sonmez, D.H.J. Epema M. Jan PDS Group, ST/EWI, TU Delft LRI/INRIA Futurs Paris, INRIA November 24, 2018 IEEE Grid 2007, Austin, TX, USA

Grids are far from being reliable job execution environments Server 99.99999% reliable Small Cluster 99.999% reliable In today’s grids, reliability is more important than performance! 99.9% reliable 5x decrease in failure rate after first year [Sch06] Production Cluster CERN LCG jobs now 25.29% failures that’s after 5 years (3 ~production) Source: dboard-gr.cern.ch, May’07. >10% jobs fail [Ios06] DAS-2 20-45% failures [Kha06] TeraGrid 27% failures, 5-10 retries [Dum05] Grid3 We see each day that a commercially acceptable server needs to be 99.99999% reliable. That’s one error every … hours! As soon as we increase in size, failures become more frequent. A top cluster manufacturer advertises only “three nines” reliability. We are not unhappy with this value. [goal for grids] It is well known that the reliability of a system is lower in the beginning, and towards the end of its life (the famous bath-tub shape). Production clusters have been found to be xx to yy % reliable, after passing the ??-year margin [Sch06]. With Grids, we seem to still be in the beginning: errors occur very frequently. Recent studies have shown that ... [Ios06, Kha06] [Ios06] A. Iosup and D. H. J. Epema. Grenchmark: A framework for analyzing, testing, and comparing grids. In CCGRID, pages 313–320. IEEE Computer Society, 2006. [Kha06] O. Khalili, Jiahue He, et al. Measuring the performance and reliability of production computational grids. In GRID. IEEE Computer Society, 2006. [Dum05] C. Dumitrescu, I. Raicu, and I. T. Foster. Experiences in running workloads over grid3. In H. Zhuge and G. Fox, editors, GCC, volume 3795 of LNCS, pages 274–286, 2005. Large scale clusters – 1 error per day for a few hours (4 hours?) at 1000 resources -> 999 resources + (24-4)/24 resources per day = 999.8 November 24, 2018

Sources of failure in grids and what to do about them Failures [HLi07] Error propagation [Tha02] User applications 1 We don’t know much about our work-horse, the cluster-based grid! Grid + User Middleware 2 Grid middleware tests [Ios07] Machine + OS (resource availability) Desktop grids [Kon04] Cluster-based grids? 3 November 24, 2018

Research Questions Q1: What are the characteristics of the resource (un)availability in grids? Q2: What is the performance impact of the dynamic grid resource availability? November 24, 2018

Outline Introduction and Motivation Resource Availability in Grids Environment and Traces Availability Characteristics Availability Model Performance Impact of Resource Availability Conclusion and Future Work November 24, 2018

2.1. Grid’5000: Environment and Traces Experimental platform Grid’5000: 9 sites, 15 clusters Target: 5000 nodes, >5000 CPUs All clusters managed by OAR Traces: jobs 05/2004-11/2006 (30 mo.) resource availability 05/2005-11/2006 (18 mo.) Jobs: 951 K Availability status change events: 600 K Users: 473, Groups: 10 CPUs: ~ 2500, Consumed CPU time: 651 years November 24, 2018

2.2. Availability Characteristics (1/3) Number of Resources Over Time Average availability Grid-level view Grid-level availability Availability range: 35-92% Average availability: 69% Cluster-level availability Availability range: 32-98% November 24, 2018

2.2. Availability Characteristics (2/3) Mean Time Between Failures (MTBF) Mean Time to Repair (MTTR) Grid-level ~12 minutes Node-level ~2 days MTTR Node-level ~half day CDF MTBF Grid-level view TBF [s] November 24, 2018

2.2. Availability Characteristics (3/3) Correlated Failures Maximal set of failures (ordered according to increasing event time), of time parameter in which for any two successive failures E and F, where returns the timestamp of the event; = 1-3600s. Grid-level view Range: 1-339 Average: 11 Cluster span Range: 1-3 Average: 1.06 Failures “stay” within cluster Average CDF Grid-level view Size of correlated failures November 24, 2018

see article for per-cluster parameter values 2.3. Availability Model MTBF MTTR Correl. Assume no correlation of failure occurrence between clusters Which site/cluster? fs, fraction of failures at cluster s Weibull distribution for IAT Shape parameter > 1: increasing hazard rate the longer a node is online, the higher the chances that it will fail Reboot machines just to keep them in “safe zone”? Software recovers from accumulating problems Hardware may crash during reboot; otherwise won’t help November 24, 2018

Outline Introduction and Motivation Resource Availability in Grids Performance Impact of Resource Availability Models of Availability Information Performance Evaluation Conclusion and Future Work November 24, 2018

3.1. Models of Availability Information 4 types of systems: Resource availability Static Dynamic Availability Information Delay On-Time (0) Short period Long period SA KA AMA HMA Availability Info. Delay Real? 1 SA Steady Availability Static On-Time N 2 KA Known Availability Dynamic On-Time N Y 3 AMA Automated Monitoring of Availability Dynamic Short-Time Y Grids vs. clusters: resource availability information is not always present 4 HMA Human Monitoring of Availability Dynamic Long-Time Y November 24, 2018

3.2. Performance Evaluation (1/2) Overview Scenario: SA vs. KA vs. AMA vs. HMA AMA: information update 60s, 1h HMA: information update 1 week, 1 month, never (fixed availability information) Metrics Traditional: AWT, ART Adapted for dynamic availability: utilization, throughput, goodput Failures, number and reason (submission or execution) Experimental setup Custom trace-driven discrete event simulator Simulate Grid’5000 Use job and availability traces: 06/2005 - 10/2006 (739K jobs) November 24, 2018

3.2. Performance Evaluation (2/2) Sample Results see article for detailed results Avg. Norm. G’put. [cpus/day/proc] Goodput decreases with intervention delay Avg. Norm. T’put. [jobs/day/proc] Throughput decreases with intervention delay KA = AMA > HMA 1wk > HMA 1mo >> SA Job Failures [%] Human intervention leads to sub. failures Any intervention better than No intervention November 24, 2018 SA KA AMA 60s AMA 1h HMA 1w HMA 1mo HMA Never Model

4. Conclusion and Future Work Q1: What are the characteristics of the resource (un)availability in grids? Location, Inter-arrival (MTBF), Duration (MTTR), Size (Cor.Failures) Model for grid resource availability Q2: What is the performance impact of the dynamic grid resource availability? Four models for grid resource availability information Trace-based performance evaluation through simulations KA = AMA > HMA >> SA Future Work Validate the model for other grids Model-based performance evaluation Availability-aware scheduling policies for grids November 24, 2018

Thank you! Questions? Remarks? Observations? Help building our community’s Grid Workloads Archive: http://gwa.ewi.tudelft.nl/ Add your Job and Resource Availability Traces! November 24, 2018