Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.

Slides:

Advertisements

Similar presentations

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University

Advertisements

Distributed Systems Major Design Issues Presented by: Christopher Hector CS8320 – Advanced Operating Systems Spring 2007 – Section 2.6 Presentation Dr.

A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.

Distributed Process Scheduling Summery Distributed Process Scheduling Summery BY:-Yonatan Negash.

Workshop on HPC in India Grid Middleware for High Performance Computing Sathish Vadhiyar Grid Applications Research Lab (GARL) Supercomputer Education.

GridFlow: Workflow Management for Grid Computing Kavita Shinde.

A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter ： S.Y.Chen.

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

Fault-tolerant Adaptive Divisible Load Scheduling Xuan Lin, Sumanth J. V. Acknowledge: a few slides of DLT are from Thomas Robertazzi ’ s presentation.

1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.

Present by Chen, Ting-Wei Adaptive Task Checkpointing and Replication: Toward Efficient Fault-Tolerant Grids Maria Chtepen, Filip H.A. Claeys, Bart Dhoedt,

Workload Management Massimo Sgaravatto INFN Padova.

On Fairness, Optimizing Replica Selection in Data Grids Husni Hamad E. AL-Mistarihi and Chan Huah Yong IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS,

Diffusion scheduling in multiagent computing system MotivationArchitectureAlgorithmsExamplesDynamics Robert Schaefer, AGH University of Science and Technology,

Ajou University, South Korea ICSOC 2003 “Disconnected Operation Service in Mobile Grid Computing” Disconnected Operation Service in Mobile Grid Computing.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.

Scheduling in Heterogeneous Grid Environments: The Effects of Data Migration Leonid Oliker, Hongzhang Shan Future Technology Group Lawrence Berkeley Research.

PicsouGrid Viet-Dung DOAN. Agenda Motivation PicsouGrid’s architecture –Pricing scenarios PicsouGrid’s properties –Load balancing –Fault tolerance Perspectives.

Self Adaptivity in Grid Computing Reporter : Po - Jen Lo Sathish S. Vadhiyar and Jack J. Dongarra.

Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”

 Escalonamento e Migração de Recursos e Balanceamento de carga Carlos Ferrão Lopes nº M6935 Bruno Simões nº M6082 Celina Alexandre nº M6807.

Scheduling of Parallel Jobs In a Heterogeneous Multi-Site Environment By Gerald Sabin from Ohio State Reviewed by Shengchao Yu 02/2005.

Cluster Reliability Project ISIS Vanderbilt University.

WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.

EXPOSE GOOGLE APP ENGINE AS TASKTRACKER NODES AND DATA NODES.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

ROBUST RESOURCE ALLOCATION OF DAGS IN A HETEROGENEOUS MULTI-CORE SYSTEM Luis Diego Briceño, Jay Smith, H. J. Siegel, Anthony A. Maciejewski, Paul Maxwell,

Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.

1 Distributed Process Scheduling: A System Performance Model Vijay Jain CSc 8320, Spring 2007.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

Scientific Workflow Scheduling in Computational Grids Report: Wei-Cheng Lee 8th Grid Computing Conference IEEE 2007 – Planning, Reservation,

1 Distributed Energy-Efficient Scheduling for Data-Intensive Applications with Deadline Constraints on Data Grids Cong Liu and Xiao Qin Auburn University.

Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.

Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.

The Owner Share scheduler for a distributed system 2009 International Conference on Parallel Processing Workshops Reporter: 李長霖.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

AN ADAPTIVE CYBERINFRASTRUCTURE FOR THREAT MANAGEMENT IN URBAN WATER DISTRIBUTION SYSTEMS Kumar Mahinthakumar North Carolina State University DDDAS Workshop,

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

Optimal Power Allocation for Multiprogrammed Workloads on Single-chip Heterogeneous Processors Euijin Kwon 1,2 Jae Young Jang 2 Jae W. Lee 2 Nam Sung Kim.

Sarat Sreepathi North Carolina State University Internet2 – SURAgrid Demo Dec 6, 2006.

George Goulas, Christos Gogos, Panayiotis Alefragis, Efthymios Housos Computer Systems Laboratory, Electrical & Computer Engineering Dept., University.

MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.

GSAF: A Grid-based Services Transfer Framework Chunyan Miao, Wang Wei, Zhiqi Shen, Tan Tin Wee.

AN ADAPTIVE CYBERINFRASTRUCTURE FOR THREAT MANAGEMENT IN URBAN WATER DISTRIBUTION SYSTEMS Kumar Mahinthakumar North Carolina State University DDDAS BOF,

Node Reclamation and Replacement for Long-lived Sensor Networks Bin Tong, Wensheng Zhang, and Chuang Wang Department of Computer Science, Iowa State University.

1/22 Optimization of Google Cloud Task Processing with Checkpoint-Restart Mechanism Speaker: Sheng Di Coauthors: Yves Robert, Frédéric Vivien, Derrick.

Static Process Scheduling

A System Performance Model Distributed Process Scheduling.

Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.

University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.

Fault Tolerant Grid Workflow in Water Threat Management Master’s project / thesis seminar Young Suk Moon Chair: Prof. Gregor von Laszewski Reader: Observer:

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

Data Consolidation: A Task Scheduling and Data Migration Technique for Grid Networks Author: P. Kokkinos, K. Christodoulopoulos, A. Kretsis, and E. Varvarigos.

Grid-enabled Probabilistic Model Checking with PRISM Yi Zhang, David Parker, Marta Kwiatkowska University of Birmingham.

1 IDGF International Desktop Grid Federation ASSESSING THE PERFORMANCE OF DESKTOP GRID APPLICATIONS A. Afanasiev, N. Khrapov, and M. Posypkin DEGISCO is.

Euro-Par, HASTE: An Adaptive Middleware for Supporting Time-Critical Event Handling in Distributed Environments ICAC 2008 Conference June 2 nd,

An Algorithm for Automatically Obtaining Distributed and Fault Tolerant Static Schedules Alain Girault - Hamoudi Kalla - Yves Sorel - Mihaela Sighireanu.

Holding slide prior to starting show. Processing Scientific Applications in the JINI-Based OGSA-Compliant Grid Yan Huang.

Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,

1 Supporting a Volume Rendering Application on a Grid-Middleware For Streaming Data Liang Chen Gagan Agrawal Computer Science & Engineering Ohio State.

Jacob R. Lorch Microsoft Research

Introduction to Load Balancing:

Class project by Piyush Ranjan Satapathy & Van Lepham

湖南大学-信息科学与工程学院-计算机与科学系

Wide Area Workload Management Work Package DATAGRID project

Presentation transcript:

Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1

Outline Introduction to the Water Threat Management Project Motivation Research Objectives Fault-Tolerant Queue Evaluation Conclusion 2

Water Threat Management Motivation Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. Methods Detect contamination using the sensors located across the WDSs. Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations. 3

Existing Water Threat Management System Architecture 4 Optimization Engine: Runs Evolutionary Algorithm (EA) Simulation Engine: Runs EPANET

Water Threat Management System Requirements Requirements Time sensitive Massive calculation Dynamic adaptation to a Grid environment Fault tolerance Our goal The current system is not fault-tolerant - develop a fault-tolerant framework in the dynamic environment. 5

Motivation Resource (Site) Outage 5% down during 2009 Queue Wait Time 6 TeraGrid User & System News (

Research Objectives Develop a fault-tolerant framework dealing with resource outages Strategy: generation distribution on multiple sites Reduce queue wait time Strategy: dynamic job dependency 7

Water Threat Management Application Sequential & parallel processing 8

Generation Distribution Divide generations into multiple parts as multiple jobs. Distribute them on multiple sites. 9

Dynamic Job Dependency Problems of generation distribution on multiple sites Additional queue wait times Each job is dependent on another. Cannot submit a job before the prior job finishes. 10 Solution: determine job dependency at run time. Submit jobs at the same time. Any job start first computes the first set of generations

Dynamic WTM Workflow Management Example scenario 11

Fault-tolerant Queue Most common fault-tolerant strategies in a Grid Replication Checkpointing Limitation of checkpointing with time-criticality Checkpointing performance degradation Checkpointing may not be compatible on a different site (heterogeneity) Cannot reschedule job on the same site in case of site outage Choosing the replication strategy within the fault- tolerant queue 12

Fault-tolerant Queue Design Components Command Line Interface Task Pool Resource Pool Scheduler Resource Checker (intergration with the TeraGrid Information Services) 13

Fault Detection in Fault-tolerant Queue Fault detection Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit Communicate with GRAM to detect job failure TeraGrid Information Services GRAM service may fail when the resource is down Publishes XML documents containing the outage information 14

Evaluation – WTM performance WTM application performance (original) 15 AbeBig Red #CPUs16 CPU per Node 84

Evaluation – Queue Wait Time Queue wait time statistics AbeBig Red Avg. (min)8242 Var sd

Evaluation – Performance Overhead Performance overhead Integrating a fault-tolerant framework usually causes performance degradation No performance loss in our framework 17

Different type of workflow run time comparison Original deployment VS. fault-tolerant deployment Dynamic job dependency VS. static job dependency Test each type of deployment in the real Grid system including queue wait time WorkflowDependencySite Name# JobsGen. range Original-Abe11-20 Original-Big Red11-20 Fault- tolerant staticAbe, Big Red21-10 (Abe), (Big Red) Fault- tolerant dynamicAbe, Big Red21-10, Evaluation – Workflow Performance

Workflow comparison results Experiment 1 Experiment 2 Experiment 3 19

Simulation – Worst Case Run Time Comparison A threat management system must deliver results in any circumstances. Thus, a run time of the worst case is a critical factor in the Water Threat Management system. 20

Simulation – Worst Case Run Time Comparison Simulation setup The generations are equally distributed among the machines. Use the 2009 TeraGrid outage data. Submit jobs every 5 minutes starting from 1/1/ :00 am EST. 21 AbeBig RedQueen Bee Run Time per Gen. (min) #CPUs16 8

Simulation – Worst Case Run Time Comparison Simulation queue wait time setup (unit: minutes) 22

Simulation – Worst Case Run Time Comparison 23 TeraGrid User & System News (

Simulation – Worst Case Run Time Comparison 24

Simulation – Worst Case Run Time Comparison 25

Simulation – Median Run Time, Worst Case (Max.) Run Time 26

Conclusion Achievement: Worst case run time is significantly reduced. Limitation: In “general” cases, the dynamic workflow has performance degradation. Due to the low failure rate & compute performance difference between difference machines. Possible improvement: Migrate the generation process to a faster machine whenever possible. 27