Grid Failure Monitoring and Ranking using FailRank Demetris Zeinalipour (Open University of Cyprus) Kyriacos Neocleous, Chryssis Georgiou, Marios D. Dikaiakos.

Slides:

Advertisements

Similar presentations

Network Resource Broker for IPTV in Cloud Computing Lei Liang, Dan He University of Surrey, UK OGF 27, G2C Workshop 15 Oct 2009 Banff,

Advertisements

Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms Chenyang Lu, John A. Stankovic, Gang Tao, Sang H. Son Presented by Josh Carl.

Managing Web server performance with AutoTune agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigu Jangwon Han Seongwon Park

Information and Control in Gray-Box Systems Arpaci-Dusseau and Arpaci-Dusseau SOSP 18, 2001 John Otto Wi06 CS 395/495 Autonomic Computing Systems.

Cloud Computing Resource provisioning Keke Chen. Outline  For Web applications statistical Learning and automatic control for datacenters  For data.

CHESS: A Systematic Testing Tool for Concurrent Software CSCI6900 George.

CS 795 – Spring  “Software Systems are increasingly Situated in dynamic, mission critical settings ◦ Operational profile is dynamic, and depends.

All Hands Meeting, 2006 Title: Grid Workflow Scheduling in WOSE (Workflow Optimisation Services for e- Science Applications) Authors: Yash Patel, Andrew.

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

Chapter 6: Database Evolution Title: AutoAdmin “What-if” Index Analysis Utility Authors: Surajit Chaudhuri, Vivek Narasayya ACM SIGMOD 1998.

Transactions – T4.3 Title: Concurrency Control Performance Modeling: Alternatives and Implications Authors: R. Agarwal, M. J. Carey, M. Livny ACM TODS,

User Office Status CANARIE Site Visit July, 2009.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

1 Optimizing Utility in Cloud Computing through Autonomic Workload Execution Reporter : Lin Kelly Date : 2010/11/24.

Mr. Perminous KAHOME, University of Nairobi, Nairobi, Kenya. Dr. Elisha T.O. OPIYO, SCI, University of Nairobi, Nairobi, Kenya. Prof. William OKELLO-ODONGO,

A Search-based Method for Forecasting Ad Impression in Contextual Advertising Defense.

1 Efficient Management of Data Center Resources for Massively Multiplayer Online Games V. Nae, A. Iosup, S. Podlipnig, R. Prodan, D. Epema, T. Fahringer,

Computer System Lifecycle Chapter 1. Introduction Computer System users, administrators, and designers are all interested in performance evaluation. Whether.

A User Experience-based Cloud Service Redeployment Mechanism KANG Yu.

CHAPTER 12 ADVANCED INTELLIGENT SYSTEMS © 2005 Prentice Hall, Decision Support Systems and Intelligent Systems, 7th Edition, Turban, Aronson, and Liang.

1 Autonomic Computing An Introduction Guenter Kickinger.

Crowdsourcing Predictors of Behavioral Outcomes. Abstract Generating models from large data sets—and deter¬mining which subsets of data to mine—is becoming.

Program Evaluation. Program evaluation Methodological techniques of the social sciences social policy public welfare administration.

The Effects of Ranging Noise on Multihop Localization: An Empirical Study from UC Berkeley Abon.

Cristian Urs and Ben Riveira. Introduction The article we chose focuses on improving the performance of Genetic Algorithms by: Use of predictive models.

1 A Bayesian Method for Guessing the Extreme Values in a Data Set Mingxi Wu, Chris Jermaine University of Florida September 2007.

Cluster Reliability Project ISIS Vanderbilt University.

Data Mining Process A manifestation of best practices A systematic way to conduct DM projects Different groups has different versions Most common standard.

1 Performance Evaluation of Computer Systems and Networks Introduction, Outlines, Class Policy Instructor: A. Ghasemi Many thanks to Dr. Behzad Akbari.

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

A Framework For User Feedback Based Cloud Service Monitoring

Budget-based Control for Interactive Services with Partial Execution 1 Yuxiong He, Zihao Ye, Qiang Fu, Sameh Elnikety Microsoft Research.

CERN IT Department CH-1211 Genève 23 Switzerland t Internet Services Job Monitoring for the LHC experiments Irina Sidorova (CERN, JINR) on.

Chengjie Sun,Lei Lin, Yuan Chen, Bingquan Liu Harbin Institute of Technology School of Computer Science and Technology 1 19/11/ :09 PM.

INTERACTIVE ANALYSIS OF COMPUTER CRIMES PRESENTED FOR CS-689 ON 10/12/2000 BY NAGAKALYANA ESKALA.

Exploiting Context Analysis for Combining Multiple Entity Resolution Systems -Ramu Bandaru Zhaoqi Chen Dmitri V.kalashnikov Sharad Mehrotra.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Identifying Failures in Grids through Monitoring and Ranking Demetris Zeinalipour Open University of Cyprus Kyriacos Neocleous, Chryssis Georgiou, Marios.

Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.

Zibin Zheng DR 2 : Dynamic Request Routing for Tolerating Latency Variability in Cloud Applications CLOUD 2013 Jieming Zhu, Zibin.

Data Replication and Power Consumption in Data Grids Susan V. Vrbsky, Ming Lei, Karl Smith and Jeff Byrd Department of Computer Science The University.

Facilitating Document Annotation using Content and Querying Value.

George Tsouloupas University of Cyprus Task 2.3 GridBench ● 1 st Year Targets ● Background ● Prototype ● Problems and Issues ● What's Next.

Active Sampling for Accelerated Learning of Performance Models Piyush Shivam, Shivnath Babu, Jeff Chase Duke University.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.

WERST – Methodology Group

OPERATING SYSTEMS CS 3530 Summer 2014 Systems and Models Chapter 03.

Managing Web Server Performance with AutoTune Agents by Y. Diao, J. L. Hellerstein, S. Parekh, J. P. Bigus Presented by Changha Lee.

CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.

EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Sunpyo Hong, Hyesoon Kim

Resource Optimization for Publisher/Subscriber-based Avionics Systems Institute for Software Integrated Systems Vanderbilt University Nashville, Tennessee.

A Presentation on Adaptive Neuro-Fuzzy Inference System using Particle Swarm Optimization and it’s Application By Sumanta Kundu (En.R.No.

Spark on Entropy : A Reliable & Efficient Scheduler for Low-latency Parallel Jobs in Heterogeneous Cloud Huankai Chen PhD Student at University of Kent.

Online Parameter Optimization for Elastic Data Stream Processing Thomas Heinze, Lars Roediger, Yuanzhen Ji, Zbigniew Jerzak (SAP SE) Andreas Meister (University.

Introduction to Machine Learning, its potential usage in network area,

SketchVisor: Robust Network Measurement for Software Packet Processing

Introduction to Load Balancing:

Demetrios Zeinalipour-Yazti (Univ. of Cyprus)

Applying Control Theory to Stream Processing Systems

WSRec: A Collaborative Filtering Based Web Service Recommender System

FailRank: Towards a Unified Grid Failure Monitoring and Ranking System

Control Theory in Log Processing Systems

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

WSExpress: A QoS-Aware Search Engine for Web Services

Presentation transcript:

Grid Failure Monitoring and Ranking using FailRank Demetris Zeinalipour (Open University of Cyprus) Kyriacos Neocleous, Chryssis Georgiou, Marios D. Dikaiakos (University of Cyprus)

2 Motivation “Things tend to fail” Examples –The FlexX and Autodock challenges of the WISDOM 1 project (Aug’05) show that only 32% and 57% of the jobs exited with an “OK” status. –Our group conducted a 9-month study 2 of the SEE-VO (Feb’06-Nov’06) and found that only 48% of the jobs completed successfully. Our objective: A Dependable Grid –Extremely complex task that currently relies on over- provisioning of resources, ad-hoc monitoring and user intervention Analyzing the Workload of the South-East Federation of the EGEE Grid Infrastructure Coregrid TR-0063 G.D. Costa, S. Orlando, M.D. Dikaiakos.

3 Solutions? GridICE: GStat: To make the Grid dependable we have to efficiently manage failures. Currently, Administrators monitor the Grid for failures through monitoring sites, e.g. GridICE:

4 Limitations Limitations of Current Monitoring Systems Require Human Monitoring and Intervention: –This introduces Errors and Omissions –Human Resources are very expensive Reactive vs. Proactive Failure Prevention: –Reactive: Administrators (might) reactively respond to important failure conditions. –On the contrary, proactive prevention mechanisms could be utilized to identify failures and divert job submissions away from sites that will fail.

5 Problem Definition Can we coalesce information from monitoring systems to create some useful knowledge that can be exploited for: –Online Applications: e.g. Predicting Failures. Subsequently improve job scheduling. –Offline Applications : e.g. Finding Interesting Rules (e.g. whenever the Disk Pool Manager then cy-01-kimon and cy-03-intercollege fail as well). Timeseries Similarity Search (e.g. which attribute (disk util., waitingjobs, etc) is similar to the CPU util. for a given site).

6 Our Approach: FailRank A new framework for failure management in very large and complex environments such as Grids. FailRank Outline: 1.Integrate & Rank, the failure-related information from monitoring systems (e.g. GStat, GridICE, etc.) 2.Identify Candidates, that have the highest potential to fail (based on the acquired info). 3.(Temporarily) Exclude Candidates: from the pool of resources available to the Resource Broker.

7 Presentation Outline  Motivation and Introduction  The FailRank Architecture  The FailBase Repository  Experimental Evaluation  Conclusions & Future Work

8 FailRank Architecture Grid Sites: i) report statistics to the Feedback sources; ii) allow the execution of micro-benchmarks that reveal the performance characteristics of a site.

9 FailRank Architecture Feedback Sources (Monitoring Systems) Examples: Information Index LDAP Queries: grid status at a fine granularity. Service Availability Monitoring (SAM): periodic test jobs. Grid Statistics: by sites such as GStat and GridICE Network Tomography Data: obtained through pinging and tracerouting. Active Benchmarking: Low level probes using tools such as GridBench, DiPerf, etc etc.

10 FailRank Architecture FailShot Matrix (FSM): A Snapshot of all failure-related parameters at a given timestamp. Top-K Ranking Module: Efficiently finds the K sites with the highest potential to feature a failure by utilizing FSM. Data Exploration Tools: Offline tools used for exploratory data analysis, learning and prediction by utilizing FSM.

11 The Failshot Matrix The FailShot Matrix (FSM) integrates the failure information, available in a variety of formats and sources, into a representative array of numeric vectors. The Failbase Repository we developed contains 75 attributes and 2,500 queues from 5 feedback sources.

12 The Top-K Ranking Module Objective: To continuously rank the FSM Matrix and identify the K highest-ranked sites that will feature an error. Scoring Function: combines the individual attributes to generate a score per site (queue) TOP-K e.g., W CPU =0.1, W DISK =0.2, W NET =0.2, W FAIL =0.5

13 Presentation Outline  Introduction and Motivation  The FailRank Architecture  The FailBase Repository  Experimental Evaluation  Conclusions & Future Work

14 The FailBase Repository A 38GB corpus of feedback information that characterizes EGEE for one month in Paves the way to systematically study and uncover new, previously unknown, knowledge from the EGEE operation. Trace Interval: March 16 th – April 17 th, 2007 Size: 2,565 Computing Element Queues. Testbed: Dual Xeon 2.4GHz, 1GB RAM connected to GEANT at 155Mbps.

15 Presentation Outline  Introduction and Motivation  The FailRank Architecture  The FailBase Repository  Experimental Evaluation  Conclusions & Future Work

16 We utilize a trace-driven simulator that utilizes 197 OPS queues from the FailBase repository for 32 days. At each chronon we identify: –Top-K queues which might fail (denoted as I set ) –Top-K queues that have failed (denoted as R set ), derived through the SAM tests. We then measure the Penalty: i.e., the number of queues that were not identified as failing sites but failed. Experimental Methodology R set I set

17 Experiment 1: Evaluating FailRank Task: “At each chronon identify K=20 (~8%) of the queues that might fail” Evaluation Strategies –FailRank Selection: Utilize the FSM matrix in order to determine which queues have to be eliminated. –Random Selection: Choose the queues that have to be eliminated at random.

18 Experiment 1: Evaluating FailRank FailRank misses failing sites in 9% of the cases while Random in 91% of the cases (20 is 100%) ~2.14 ~18.19 (A) Point A: Missing Values in the Trace. (B) Point B: Penalty > K might happen when |R set |> K

19 Experiment 2: the Scoring Function Question: “Can we decrease the penalty even further by adjusting the scoring weights?”. i.e., instead of setting W j =1/m (Naïve Scoring) use different weights for individual attributes. –e.g.,W CPU =0.1, W DISK =0.2, W NET =0.2, W FAIL =0.5 Methodology: We requested from our administrators to provide us with indicative weights for each attribute (Expert Scoring)

20 Experiment 2: Scoring Function Expert scoring misses failing sites in only 7.4% of the cases while Naïve scoring in 9% of the cases ~2.14 ~1.48 (A) Point A: Missing Values in the Trace.

21 Experiment 2: the Scoring Function Expert Scoring Advantages –Fine-grained (compared to Random strategy). –Significantly reduces the Penalty. Expert Scoring Disadvantages –Requires Manual Tuning. –Doesn’t provide the optimal assignment of weights. –Shifting conditions might deteriorate the importance of the initially identified weights. Future Work: Automatically tune the weights

22 Presentation Outline  Introduction and Motivation  The FailRank Architecture  The FailBase Repository  Experimental Evaluation  Conclusions & Future Work

23 Conclusions We have presented FailRank, a new framework for integrating and ranking information sources that characterize failures in a Grid framework. We have also presented the structure of the Failbase Repository. Experimenting with FailRank has shown that it can accurately identify the sites that will fail in 91% of the cases

24 Future Work In-Depth assessment of the ranking algorithms presented in this paper. –Objective: Minimize the number of attributes required to compute the K highest ranked sites. Study the trade-offs of different K and different scoring functions. Develop and deploy a real prototype of the FailRank system. –Objective: Validate that the FailRank concept can be beneficial in a real environment.

Grid Failure Monitoring and Ranking using FailRank Thank you! This presentation is available at: Related Publications available at: Questions?