R, Zhang, A. Chien, A. Mandal, C. Koelbel,

Slides:



Advertisements
Similar presentations
Load Balancing Parallel Applications on Heterogeneous Platforms.
Advertisements

Design of the fast-pick area Based on Bartholdi & Hackman, Chpt. 7.
Scheduling in Distributed Systems Gurmeet Singh CS 599 Lecture.
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Fast Algorithms For Hierarchical Range Histogram Constructions
Walter Binder University of Lugano, Switzerland Niranjan Suri IHMC, Florida, USA Green Computing: Energy Consumption Optimized Service Hosting.
GridFlow: Workflow Management for Grid Computing Kavita Shinde.
An Algebraic Approach to Practical and Scalable Overlay Network Monitoring Yan Chen, David Bindel, Hanhee Song, Randy H. Katz Presented by Mahesh Balakrishnan.
Rutgers PANIC Laboratory The State University of New Jersey Self-Managing Federated Services Francisco Matias Cuenca-Acuna and Thu D. Nguyen Department.
Probabilistic Data Aggregation Ling Huang, Ben Zhao, Anthony Joseph Sahara Retreat January, 2004.
Virtual Topology Adaptation in WDM Mesh Networks (for ECS 259: A. Gencata and B. Mukherjee, UC Davis) 1 Virtual Topology  Wavelength routed network 
Approximation Algorithms Motivation and Definitions TSP Vertex Cover Scheduling.
On-Demand Media Streaming Over the Internet Mohamed M. Hefeeda, Bharat K. Bhargava Presented by Sam Distributed Computing Systems, FTDCS Proceedings.
FLANN Fast Library for Approximate Nearest Neighbors
Algorithms for Self-Organization and Adaptive Service Placement in Dynamic Distributed Systems Artur Andrzejak, Sven Graupner,Vadim Kotov, Holger Trinks.
 Escalonamento e Migração de Recursos e Balanceamento de carga Carlos Ferrão Lopes nº M6935 Bruno Simões nº M6082 Celina Alexandre nº M6807.
Network Aware Resource Allocation in Distributed Clouds.
An affinity-driven clustering approach for service discovery and composition for pervasive computing J. Gaber and M.Bakhouya Laboratoire SeT Université.
Stochastic sleep scheduling (SSS) for large scale wireless sensor networks Yaxiong Zhao Jie Wu Computer and Information Sciences Temple University.
Cluster Reliability Project ISIS Vanderbilt University.
A Survey of Distributed Task Schedulers Kei Takahashi (M1)
1 Andreea Chis under the guidance of Frédéric Desprez and Eddy Caron Scheduling for a Climate Forecast Application ANR-05-CIGC-11.
Secure In-Network Aggregation for Wireless Sensor Networks
Schreiber, Yevgeny. Value-Ordering Heuristics: Search Performance vs. Solution Diversity. In: D. Cohen (Ed.) CP 2010, LNCS 6308, pp Springer-
ApproxHadoop Bringing Approximations to MapReduce Frameworks
Network Dynamics and Simulation Science Laboratory Structural Analysis of Electrical Networks Jiangzhuo Chen Joint work with Karla Atkins, V. S. Anil Kumar,
Scheduling Strategies for Mapping Application Workflows Onto the Grid A. Mandal, K. Kennedy, C. Koelbel, G. Marin, J. Mellor- Crummey, B. Liu, L. Johnsson.
VgES Version 0.7 Release Overview UCSD VGrADS Team Andrew A. Chien, Henri Casanova, Yang-suk Kee, Jerry Chou, Dionysis Logothetis, Richard.
MicroGrid Update & A Synthetic Grid Resource Generator Xin Liu, Yang-suk Kee, Andrew Chien Department of Computer Science and Engineering Center for Networked.
Scheduling Algorithms Performance Evaluation in Grid Environments R, Zhang, C. Koelbel, K. Kennedy.
VGES Demonstrations Andrew A. Chien, Henri Casanova, Yang-suk Kee, Richard Huang, Dionysis Logothetis, and Jerry Chou CSE, SDSC, and CNS University of.
Lessons from LEAD/VGrADS Demo Yang-suk Kee, Carl Kesselman ISI/USC.
Interaction and Animation on Geolocalization Based Network Topology by Engin Arslan.
Resource Specification Prediction Model Richard Huang joint work with Henri Casanova and Andrew Chien.
TU/e Algorithms (2IL15) – Lecture 11 1 Approximation Algorithms.
Jacob R. Lorch Microsoft Research
EMAN, Scheduling, Performance Prediction, and Virtual Grids
N-Tier Architecture.
Improving searches through community clustering of information
LEAD-VGrADS Day 1 Notes.
Introduction to Wireless Sensor Networks
Ramya Kandasamy CS 147 Section 3
New Workflow Scheduling Techniques Presentation: Anirban Mandal
VGrADS Tools Activities
CFA: A Practical Prediction System for Video Quality Optimization
CNRS applications in medical imaging
University of Maryland
PA an Coordinated Memory Caching for Parallel Jobs
Privacy and Fault-Tolerance in Distributed Optimization Nitin Vaidya University of Illinois at Urbana-Champaign.
Plethora: Infrastructure and System Design
Liang Chen Advisor: Gagan Agrawal Computer Science & Engineering
Reasoning in Psychology Using Statistics
Presented by Rich Goyette
Distributed In-Memory Processing of All k Nearest Neighbor Queries G
Towards Next Generation Panel at SAINT 2002
CLUSTER COMPUTING.
Development & Evaluation of Network Test-beds
Cluster Load Balancing for Fine-grain Network Services
The Design of a Grid Computing System for Drug Discovery and Design
by Xiang Mao and Qin Chen
Reasoning in Psychology Using Statistics
GATES: A Grid-Based Middleware for Processing Distributed Data Streams
Priority Queues An abstract data type (ADT) Similar to a queue
Boltzmann Machine (BM) (§6.4)
First Hop Offloading of Mobile DAG Computations
Performance And Scalability In Oracle9i And SQL Server 2000
Retrieval Performance Evaluation - Measures
Towards Predictable Datacenter Networks
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Distributed Systems and Algorithms
Presentation transcript:

Decoupled Resource Selection and Application Scheduling with Virtual Grids R, Zhang, A. Chien, A. Mandal, C. Koelbel, H. Casanova, J. Chou, K. Kennedy, R. Huang

Motivation Application scheduling algorithms can be unscalable (albeit polynomial) and thus unusable in large-scale environment One reason for unscalability is that they perform implicit resource selection. Over the past years, Grid infrastructures have been deployed at larger and larger scales, with envisioned deployments comprising tens of thousands of resources. Therefore, scheduling algorithm scalability is a critical problem.

Three Hypotheses One can achieve better scalability by decoupling resource selection from scheduling (aka “decoupled” algorithms). One can achieve similar performance as the non-decoupled approach (aka “one step” algorithms) by selecting resources judiciously. If the application is communication intensive, one can achieve better performance by structuring resources into “close” (in terms of connectivity) groups.

Experimental Design Case Study: Workflow Applications (DAGs) Using Anirban’s scheduler as a starting point Define and generate the experimental environment including universe of compute and network resources and DAGs representing different applications. Three scheduling approaches Improve the scheduler’s implementation so that it can handle large-scale environments (over 36k nodes). Equip the scheduler with the capability to sort and select resources, and schedules applications within pre-selected resources in a decoupled fashion. Query vgES, using vgDL, to get VGs with different structures and try scheduling applications within those VGs. Conduct experiments, compare the three approaches

Experimental Environment We use simulation Environment Application model Representative DAGS from EMAN and Montage A few simple parameters varied, e.g., width, comp/comm ratio Network model We use BRITE to generate network topology but also two random sets that follow normal distributions Resource model We use Yang-Suk’s synthetic cluster generator Assumptions Performance model is accurate and network measurements are also accurate There is no other load on the nodes we use. Binding is instantaneous and always successful The time to obtain resource information is negligible

One-step Approach Run a polynomial-time scheduling algorithm over all resources Objective: minimize application turnaround time (scheduling time + makespan) We measure the scheduling time and we compute the makespan Scheduling algorithms Greedy, Anirban’s min-min, min-max, sufferage heuristic

Decoupled Approach Perform resource selection Random selection (out of 36K resources, pick X at random) Guided selection (out of 36K resources, I pick the X fastest in terms of clockrate) vgDL specification and selected resources returned as a VG Run the one-step algorithms over the selected resources Measure the time for selection and the time to compute the schedule, and compute the schedule length

Experimental Methodology (without vgES) BRITE DML file Cluster Generator DML parser file random selection non-random selection Scheduler Alg #1 Alg #2 Alg #3 ...

Experimental Methodology (with vgES) BRITE DML file vgES DB Cluster Generator DML Wrapper Agent vgFAB vgDL spec Scheduler Alg #1 Alg #2 Alg #3 ... VG

Three Questions What is the gain in scalability? How does one create a “good” vgDL spec? What is the change in the total schedule length?

Scalability

One-step vs Pre-selection

What kind of VG to ask for? VG Structure The overall structure is LooseBAG of TightBag of Nodes. The argument is it guarantees the desired subset of resources that are both fast and close together. Type of resources We have a rough performance model for certain processors and prefer nodes with higher clock speed What if there isn’t a performance model or the performance is not good enough? Number of resources Simplest estimation is based on the DAG width More experiments will help create more precise models

vgdl query VG = LooseBagOf ( tb ) [1:500] [LooseBag.Nodes == 379] { tb = TightBagOf ( node ) [1:500] [Rank= Nodes] node = [ (Processor == OPTERON) || (Processor== ITANIUM ) ] Rank=Clock }

VG Performance

VG Performance

Future work Communication / Computation Ratio Generate VGDL from DAG The current approach is just sum all the computation time on each nodes and communication time between them Both the experiment result and analysis shows that it is not good enough to reveal the property of the DAG Generate VGDL from DAG The structure of the VGDL largely depends on the comm/comp ratio of the DAG. It’s a trade off between better or closer resources. More experiments are needed to determine the right approach. Experiments on real resources The Pegasus/vgES integration Schedule on the test bed Consider the cost model Schedule based on Cluster (hybrid model) Fault Tolerance(detect failure and reschedule after it happens)