Download presentation
Presentation is loading. Please wait.
Published byDwayne Moody Modified over 9 years ago
1
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1
2
Outline Introduction to the Water Threat Management Project Motivation Research Objectives Fault-Tolerant Queue Evaluation Conclusion 2
3
Water Threat Management Motivation Urban Water Distribution Systems (WDSs) can be an easy target of terror attacks - e.g. contaminating the water. Methods Detect contamination using the sensors located across the WDSs. Run algorithms (developed by NCSU) to determine the sensor locations to minimize the searching time to find the contaminant source locations. 3
4
Existing Water Threat Management System Architecture 4 Optimization Engine: Runs Evolutionary Algorithm (EA) Simulation Engine: Runs EPANET
5
Water Threat Management System Requirements Requirements Time sensitive Massive calculation Dynamic adaptation to a Grid environment Fault tolerance Our goal The current system is not fault-tolerant - develop a fault-tolerant framework in the dynamic environment. 5
6
Motivation Resource (Site) Outage 5% down during 2009 Queue Wait Time 6 TeraGrid User & System News (http://news.teragrid.org/)
7
Research Objectives Develop a fault-tolerant framework dealing with resource outages Strategy: generation distribution on multiple sites Reduce queue wait time Strategy: dynamic job dependency 7
8
Water Threat Management Application Sequential & parallel processing 8
9
Generation Distribution Divide generations into multiple parts as multiple jobs. Distribute them on multiple sites. 9
10
Dynamic Job Dependency Problems of generation distribution on multiple sites Additional queue wait times Each job is dependent on another. Cannot submit a job before the prior job finishes. 10 Solution: determine job dependency at run time. Submit jobs at the same time. Any job start first computes the first set of generations
11
Dynamic WTM Workflow Management Example scenario 11
12
Fault-tolerant Queue Most common fault-tolerant strategies in a Grid Replication Checkpointing Limitation of checkpointing with time-criticality Checkpointing performance degradation Checkpointing may not be compatible on a different site (heterogeneity) Cannot reschedule job on the same site in case of site outage Choosing the replication strategy within the fault- tolerant queue 12
13
Fault-tolerant Queue Design Components Command Line Interface Task Pool Resource Pool Scheduler Resource Checker (intergration with the TeraGrid Information Services) 13
14
Fault Detection in Fault-tolerant Queue Fault detection Message from Grid Resource Allocation and Management (GRAM) in the Globus Toolkit Communicate with GRAM to detect job failure TeraGrid Information Services GRAM service may fail when the resource is down Publishes XML documents containing the outage information 14
15
Evaluation – WTM performance WTM application performance (original) 15 AbeBig Red #CPUs16 CPU per Node 84
16
Evaluation – Queue Wait Time Queue wait time statistics AbeBig Red Avg. (min)8242 Var.385135354 sd.19673 16
17
Evaluation – Performance Overhead Performance overhead Integrating a fault-tolerant framework usually causes performance degradation No performance loss in our framework 17
18
Different type of workflow run time comparison Original deployment VS. fault-tolerant deployment Dynamic job dependency VS. static job dependency Test each type of deployment in the real Grid system including queue wait time WorkflowDependencySite Name# JobsGen. range Original-Abe11-20 Original-Big Red11-20 Fault- tolerant staticAbe, Big Red21-10 (Abe),11- 20 (Big Red) Fault- tolerant dynamicAbe, Big Red21-10,11-20 18 Evaluation – Workflow Performance
19
Workflow comparison results Experiment 1 Experiment 2 Experiment 3 19
20
Simulation – Worst Case Run Time Comparison A threat management system must deliver results in any circumstances. Thus, a run time of the worst case is a critical factor in the Water Threat Management system. 20
21
Simulation – Worst Case Run Time Comparison Simulation setup The generations are equally distributed among the machines. Use the 2009 TeraGrid outage data. Submit jobs every 5 minutes starting from 1/1/2009 12:00 am EST. 21 AbeBig RedQueen Bee Run Time per Gen. (min) 0.522.071.02 #CPUs16 8
22
Simulation – Worst Case Run Time Comparison Simulation queue wait time setup (unit: minutes) 22
23
Simulation – Worst Case Run Time Comparison 23 TeraGrid User & System News (http://news.teragrid.org/)
24
Simulation – Worst Case Run Time Comparison 24
25
Simulation – Worst Case Run Time Comparison 25
26
Simulation – Median Run Time, Worst Case (Max.) Run Time 26
27
Conclusion Achievement: Worst case run time is significantly reduced. Limitation: In “general” cases, the dynamic workflow has performance degradation. Due to the low failure rate & compute performance difference between difference machines. Possible improvement: Migrate the generation process to a faster machine whenever possible. 27
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.