Rakhi Anand Optimizing the Execution of Parallel Applications in Volunteer Environments Parallel Software Technologies Laboratory Department of Computer.

Slides:



Advertisements
Similar presentations
Computer Systems & Architecture Lesson 2 4. Achieving Qualities.
Advertisements

MPI Message Passing Interface
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
Rheeve: A Plug-n-Play Peer- to-Peer Computing Platform Wang-kee Poon and Jiannong Cao Department of Computing, The Hong Kong Polytechnic University ICDCSW.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment Chapter 11: Monitoring Server Performance.
Chapter 1 and 2 Computer System and Operating System Overview
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
Lesson 1: Configuring Network Load Balancing
Undergraduate Poster Presentation Match 31, 2015 Department of CSE, BUET, Dhaka, Bangladesh Wireless Sensor Network Integretion With Cloud Computing H.M.A.
DISTRIBUTED COMPUTING
MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.
Maintaining Windows Server 2008 File Services
Chapter 2 Computer Clusters Lecture 2.1 Overview.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Research on cloud computing application in the peer-to-peer based video-on-demand systems Speaker : 吳靖緯 MA0G rd International Workshop.
STRATEGIES INVOLVED IN REMOTE COMPUTATION
 Cloud computing  Workflow  Workflow lifecycle  Workflow design  Workflow tools : xcp, eucalyptus, open nebula.
Oracle10g RAC Service Architecture Overview of Real Application Cluster Ready Services, Nodeapps, and User Defined Services.
Version 4.0. Objectives Describe how networks impact our daily lives. Describe the role of data networking in the human network. Identify the key components.
DISTRIBUTED COMPUTING
Artdaq Introduction artdaq is a toolkit for creating the event building and filtering portions of a DAQ. A set of ready-to-use components along with hooks.
Cluster Reliability Project ISIS Vanderbilt University.
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.
SUMA: A Scientific Metacomputer Cardinale, Yudith Figueira, Carlos Hernández, Emilio Baquero, Eduardo Berbín, Luis Bouza, Roberto Gamess, Eric García,
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Processes and Threads Processes have two characteristics: – Resource ownership - process includes a virtual address space to hold the process image – Scheduling/execution.
Event-Based Hybrid Consistency Framework (EBHCF) for Distributed Annotation Records Ahmet Fatih Mustacoglu Advisor: Prof. Geoffrey.
Advanced Computer Networks Topic 2: Characterization of Distributed Systems.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
OMIS Approach to Grid Application Monitoring Bartosz Baliś Marian Bubak Włodzimierz Funika Roland Wismueller.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
70-290: MCSE Guide to Managing a Microsoft Windows Server 2003 Environment, Enhanced Chapter 11: Monitoring Server Performance.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
1 ACTIVE FAULT TOLERANT SYSTEM for OPEN DISTRIBUTED COMPUTING (Autonomic and Trusted Computing 2006) Giray Kömürcü.
PARALLEL COMPUTING overview What is Parallel Computing? Traditionally, software has been written for serial computation: To be run on a single computer.
BOINC Workshop 10 Hien Nguyen, Eshwar Rohit University of Houston Supervisors: Dr. Jaspal Subhlok University of Houston Dr. David P. Anderson SSL – U.C,
PMI: A Scalable Process- Management Interface for Extreme-Scale Systems Pavan Balaji, Darius Buntinas, David Goodell, William Gropp, Jayesh Krishna, Ewing.
11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.
Globus Grid Tutorial Part 2: Running Programs Across Multiple Resources.
CC-MPI: A Compiled Communication Capable MPI Prototype for Ethernet Switched Clusters Amit Karwande, Xin Yuan Department of Computer Science, Florida State.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
Sockets Direct Protocol for Hybrid Network Stacks: A Case Study with iWARP over 10G Ethernet P. Balaji, S. Bhagvat, R. Thakur and D. K. Panda, Mathematics.
Em Spatiotemporal Database Laboratory Pusan National University File Processing : Database Management System Architecture 2004, Spring Pusan National University.
Distributed File System. Outline Basic Concepts Current project Hadoop Distributed File System Future work Reference.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Seminar On Rain Technology
An Introduction to GPFS
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Self Healing and Dynamic Construction Framework:
Job Scheduling in a Grid Computing Environment
Grid Computing.
Chapter 16: Distributed System Structures
Supporting Fault-Tolerance in Streaming Grid Applications
湖南大学-信息科学与工程学院-计算机与科学系
Fault Tolerance Distributed Web-based Systems
Distributed computing deals with hardware
Introduction To Distributed Systems
Presentation transcript:

Rakhi Anand Optimizing the Execution of Parallel Applications in Volunteer Environments Parallel Software Technologies Laboratory Department of Computer Science University of Houston Presented by: Rakhi Anand Advisor: Edgar Gabriel

Rakhi Anand Outline Introduction  Background, Challenges, Related Work Research objectives Introduction to VolpexMPI  VolpexMPI Design and Experimental results Target Selection module  Implementation of various algorithms  Experiments and results Runtime Environment support  Event handler and Tools implemented Summary and conclusion Future Work

Rakhi Anand Introduction (I) ‏ Why parallel computing  To solve larger problems  To solve given problems fast Classes of parallel applications:  Bag of task applications (no communication): e.g.  Low degree, static communication pattern: e.g. CFD...  Low degree, dynamic communication pattern: e.g. Adaptive mesh refinement…  High degree communication pattern: e.g. FFTs...

Rakhi Anand Introduction (II) ‏ Cluster computing  Tightly coupled computers that act like a single system  Expensive computing resources Grid computing  The combination of computer resources from multiple administrative domain for a common goal  Support only non-interactive jobs

Rakhi Anand Introduction (III) ‏ Cloud Computing  Internet based computing providing compute resources and/or software on demand  High speed network interconnects are not currently supported

Rakhi Anand Introduction (IV) ‏ Volunteer Computing  Volunteers around the world donate a portion of computer resources  E.g. runs on about 1 million computers Advantages  High resources in terms memory and computing power  Easy to use  Compute cost

Rakhi Anand Challenges of volunteer computing (I) ‏ High failure rate:  Hard failures: system crash, shutdown or hardware failure  Soft failures: owner starting to utilize his machine  Failure rates are much higher in volunteer environment than e.g. on compute clusters Communication requirements:  No information about where the nodes are located.  No guarantee whether public IP addresses are used

Rakhi Anand Challenges of volunteer computing (II) ‏ Heterogeneity:  Volunteer environment provide heterogeneous collection of nodes  Different processor types, frequency, memory size...  Node properties change dynamically

Rakhi Anand Related Work (I) ‏ Various volunteer computing exploit unused cycles on ordinary desktops.  Boinc : provide support for bag of task applications No mechanism for node-to-node communication  Condor: a batch scheduler that allows to control distributed resources can run MPI jobs on clusters No support for executing MPI applications on volunteer nodes

Rakhi Anand Related Work (II) ‏ MPI is the dominant programming paradigm for parallel scientific applications  Message passing paradigm  Support for heterogeneous environments through Notion of data types Process grouping Collective (group) communication operations MPI specification does not provide any mechanism to deal with process failures

Rakhi Anand Related Work (III) ‏ Research projects exploring fault tolerance for MPI can be divided into 3 categories  Extension of MPI semantics: FT-MPI Requires major changes to user program  Roll-back recovery: MPICH-V, LAM/MPI Better for less frequent failures  Replication: MPI/FT, P2P-MPI, rMPI All replicas executed in a lock-step fashion

Rakhi Anand Thesis Goals (I) ‏ Extend an efficient and scalable communication infrastructure for volatile compute environments  Deal with heterogeneity of volunteer computing environments  Deal with frequent process failures efficiently  Deal with challenges of new computer architecture Increase the class of applications that can be executed in volunteer computing

Rakhi Anand Thesis Goals (II) ‏ Target Selection :  To optimize the communication operations by contacting the most appropriate copy of target process. Runtime environment support :  Support for flexible number of processes, replication levels, process failures and various process management systems.  Support for tools for managing runtime environment

Rakhi Anand Introduction VolpexMPI VolpexMPI is an MPI library designed for executing parallel applications in volunteer environments Key Features-  Controlled redundancy  Receiver based direct communication  Distributed sender based logging

Rakhi Anand VolpexMPI Design (I) ‏ Point to point communication  Based on pull model  Goal: make the application progress according to the fast replica Message matching scheme  Virtual timestamp: a sequence counter used to count number of messages having the same message envelope [communicator id, message tag, sender rank, receiver rank].  Messages matching based on message envelope + timestamp Sender Buffer Receiver

Rakhi Anand VolpexMPI Design (II) ‏ Buffer Management  Messages are stored with the message envelope  Circular buffer is used to store messages  Oldest entry is overwritten Data transfer  Non blocking sockets  Handling connection setup on demand  Timeout for connection establishment and communication operations

Rakhi Anand Managing Replicated MPI processes Processes are spawned in teams Only in case of failure, processes from different team is contacted

Rakhi Anand Experimental results VolpexMPI (I) ‏ Runs for 32 procs Runs for 64 procs

Rakhi Anand Experimental results VolpexMPI (II) ‏ Redundancy runs(16 procs) ‏ Failure runs (16 procs) ‏

Rakhi Anand The Target Selection Problem (I) ‏ Identifying best set of replicas Beneficial to connect to fastest replica Will make fast replica slow by making it handle more number of requests n-1

Rakhi Anand The Target Selection Problem (II) ‏ Definition: create an order of replicas for each application process in order to optimize the performance of application Four algorithms explored:  Network performance based  Virtual timestamp based  Timeout based  Hybrid approach

Rakhi Anand Network performance based algorithm Each process prioritize replicas based on latency/bandwidth values  Measurements performed during the regularly occurring communication operations of the application Advantage: can dynamically detect changes in network characteristics Disadvantage: misleading performance numbers due to overlapping, asynchronous communication operations 0,A 1,B 0,B 1,A

Rakhi Anand Virtual timestamp based algorithm Two processes which are close in execution perspective should contact each other Advantage: processes close in execution state will group together without interfering the execution of other processes Disadvantage: difficult to determine for synchronised MPI applications

Rakhi Anand Timeout based algorithm Switching the replica if the request is not handled within the given time frame Advantage: if a replica is too slow the application will advance at the speed of fast set of replicas Disadvantage: difficult to determine good timeout value

Rakhi Anand Hybrid algorithm Combination of network based and virtual timestamp based algorithms First step:  Pairwise communication is initialised during initialization  Best target is determined based on network parameters Second step:  Based on virtual timestamp  If process is lagging behind it changes its target to slow process Disadvantage: Increased initialization time

Rakhi Anand Experiments (I) ‏ Three different sets of experiments are performed  Heterogeneous network: reduce network performance to fast Ethernet for selected nodes  Heterogeneous processor: reduce processor frequency to 1.1 GHz for selected nodes  Heterogeneous network and processor: reduce network performance to fast Ethernet and processor frequency to 1.1 GHz for selected nodes Executed NAS Parallel Benchmarks (Class B) ‏  BT, CG, EP, FT, IS, SP (for double redundancy x2) ‏  CG, EP, IS (for triple redundancy x3) ‏

Rakhi Anand Experiments (II) No two replicas of same rank are on same network or same processor type All teams have processes on each set of nodes Runtime of original team based approach is compared with all other algorithms

Rakhi Anand Results for heterogeneous network configuration

Rakhi Anand Results for heterogeneous processor configurations

Rakhi Anand Results for heterogeneous network and processor(I) ‏

Rakhi Anand Results for heterogeneous network and processor(II) ‏ Team ATeam BTeam C CG IS EP Team A running on shared memory Team B running on Gigabit Ethernet Team C running on Fast Ethernet

Rakhi Anand Findings The hybrid approach shows the significant performance benefit over other algorithms for most common scenarios. Pairwise communication is required to determine network parameters

Rakhi Anand Run Time Environment Support A light weight environment support for spawning processes in volunteer environments Challenges involved  flexible number of processes  replication levels  process failures  supporting various software infrastructures(ssh, Condor)  to manage MPI processes  dynamically add new processes, new hosts or delete already existing processes

Rakhi Anand Process Manager (I) ‏ Command used to start different processes on different hosts./ mpirun -np 2 -redundancy 2 -hostfile host.txt./test number of processes file with list of host names Name of executable redundancy level

Rakhi Anand Process Manager (II) ‏ Goal: to spawn given number of processes according to the desired replication level Maintains the information about hosts and processes Processes are spawned according to the requested execution environment(ssh, Condor) ‏ If hostlist is provided processes are distributed to each host in round robin manner

Rakhi Anand Process Manager (III) ‏ Spawning on Condor(Volunteer environment) ‏  Process manager creates a submit file for Condor and spawns the executable on each available node  During initialization each process sends their hostname and ask for their rank  On receiving the message from each process, process manager sends the necessary information to each process

Rakhi Anand Experimental results for Condor

Rakhi Anand Event Management System messages use an integrate event management system  PRODUCER: who generates an event existing processes or external process  DISTRIBUTOR: who distributes the generated event Event Manager distributor event handling functions  CONSUMER: who should be aware of the event existing processes consumer event handling functions

Rakhi Anand Types of Events ADD: adding a new process DEATH : deleting an existing process EXIT : ask a process to stop execution ABORT: aborting the jobid INFO: get details of a process, host, or job

Rakhi Anand Handling process failure Proc 1 Proc 2 Proc 3 Process Manager = Event Manager

Rakhi Anand Tools Implemented Process Monitor :  Analyses the progress of an application  Retrieves the information about processes, hosts, jobs Process Controller:  Adds or deletes processes  Adds a new host  Provides the capability to spawn a new replica

Rakhi Anand Summary and conclusion (I) ‏ New target selection algorithm lead to significant benefits in heterogeneous settings  Hybrid algorithm performance numbers similar to the results as if all processes are running on fast nodes  Initial node distribution plays a very important role  Communication characteristic of an application can help to reduce the initialization time using hybrid approach

Rakhi Anand Summary and conclusion (II) ‏ Architecture for a dynamic, fault-tolerant run-time environment developed  Support for various replication levels  Support for dynamic process management  Support for various software infrastructures

Rakhi Anand Future Work (I) ‏ Initial node selection  Using available network proximity algorithms  Distributing processes according to the communication patterns Extending the Run-time environment  Integration with BOINC  Adding support for a new event: CHECKPOINT  Adding a policy component which allows to specify rules for automatic actions

Rakhi Anand Future Work (II) ‏ Multi-core optimizations  Matching the number of processes with the number of cores offered by volunteer clients  Share buffer management between multiple processes on a multi-core processor  Adapt communication scheme to process layout Explore real-world applications in volunteer compute environments Optimizing collective operations  Using underlying network topology  Extend topology aware algorithms

Rakhi Anand Timetable

Rakhi Anand Publications Troy LeBlanc, Rakhi Anand, Edgar Gabriel, and Jaspal Subhlok. VolpexMPI: an MPI Library for Execution of Parallel Applications on Volatile Nodes. in M. Ropo, J. Westerholm, J. Dongarra (Eds.) 'Recent Advances in Parallel Virtual Machine and Message Passing Interface', LNCS 5759, pp Rakhi Anand, Edgar Gabriel, and Jaspal Subhlok. Communication Target Selection for Replicated MPI Processes. Submitted to: EuroMPI 2010 Conference, Stuttgart, Germany, September 12-15,2010 Troy LeBlanc, Rakhi Anand, Edgar Gabriel, and Jaspal Subhlok. A Robust and Efficient Message Passing Library for Volunteer Computing Environments. Submitted to: Journal of Grid Computing, April 2010.