Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison High Throughput Computing (HTC)

Slides:

Advertisements

Similar presentations

Distributed Processing, Client/Server and Clusters

Advertisements

Welcome to HTCondor Week #15 (year #30 for our project)

SLA-Oriented Resource Provisioning for Cloud Computing

Why do I build CI and what did it teach me? Miron Livny Wisconsin Institutes for Discovery Madison-Wisconsin.

High Performance Computing Course Notes Grid Computing.

By Adam Balla & Wachiu Siu

Cloud Computing to Satisfy Peak Capacity Needs Case Study.

Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.

City University London

The flight of the Condor - a decade of High Throughput Computing Miron Livny Computer Sciences Department University of Wisconsin-Madison

MASPLAS ’02 Creating A Virtual Computing Facility Ravi Patchigolla Chris Clarke Lu Marino 8th Annual Mid-Atlantic Student Workshop On Programming Languages.

Miron Livny Computer Sciences Department University of Wisconsin-Madison High Throughput Computing.

MULTIPROCESSOR SYSTEMS OUTLINE  Coordinated job Scheduling  Separate Systems  Homogeneous Processor Scheduling  Master/Slave Scheduling.

DISTRIBUTED COMPUTING

Tunis, Tunisia, June 2012 Cloud Research Activities Pr. Mohamed JEMNI Computing Center Al Khawarizmi (CCK) Research Laboratory LaTICE

MULTICOMPUTER 1. MULTICOMPUTER, YANG DIPELAJARI Multiprocessors vs multicomputers Interconnection topologies Switching schemes Communication with messages.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Commodity Computing.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Principles for High Throughput Computing Alain Roy Open Science Grid Software Coordinator.

Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.

Miron Livny Computer Sciences Department University of Wisconsin-Madison (my) Principals of Distributed.

Virtual Machine Hosting for Networked Clusters: Building the Foundations for “Autonomic” Orchestration Based on paper by Laura Grit, David Irwin, Aydan.

Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.

A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.

DISTRIBUTED COMPUTING

Miron Livny Computer Sciences Department University of Wisconsin-Madison Taking stock of Grid technologies - accomplishments and challenges.

SAMANVITHA RAMAYANAM 18 TH FEBRUARY 2010 CPE 691 LAYERED APPLICATION.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor : A Concept, A Tool and.

Virtual Data Grid Architecture Ewa Deelman, Ian Foster, Carl Kesselman, Miron Livny.

Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison

Chapter 8-2 : Multicomputers Multiprocessors vs multicomputers Multiprocessors vs multicomputers Interconnection topologies Interconnection topologies.

1 Condor Team 2011 Established 1985.

Authors: Ronnie Julio Cole David

Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison The Principals and Power of Distributed Computing.

11 CLUSTERING AND AVAILABILITY Chapter 11. Chapter 11: CLUSTERING AND AVAILABILITY2 OVERVIEW  Describe the clustering capabilities of Microsoft Windows.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor-G: A Computation Management.

A Fully Automated Fault- tolerant System for Distributed Video Processing and Offsite Replication George Kola, Tevfik Kosar and Miron Livny University.

Scheduling & Resource Management in Distributed Systems Rajesh Rajamani, May 2001.

3/12/2013Computer Engg, IIT(BHU)1 CLOUD COMPUTING-1.

Tool Integration with Data and Computation Grid “Grid Wizard 2”

 Distributed Database Concepts  Parallel Vs Distributed Technology  Advantages  Additional Functions  Distribution Database Design  Data Fragmentation.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Welcome!!! Condor Week 2006.

Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

An attempt to summarize…or … some highly subjective observations Matthias Kasemann, CERN & DESY.

Welcome to CW 2008!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.

Condor Project Computer Sciences Department University of Wisconsin-Madison Condor Introduction.

System Models Advanced Operating Systems Nael Abu-halaweh.

Chapter 16 Client/Server Computing Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Miron Livny Computer Sciences Department University of Wisconsin-Madison The Principals and Power of Distributed Computing.

Condor A New PACI Partner Opportunity Miron Livny

Clouds , Grids and Clusters

(HT)Condor - Past and Future

Dynamic Deployment of VO Specific Condor Scheduler using GT4

Condor – A Hunter of Idle Workstation

Miron Livny John P. Morgridge Professor of Computer Science

Building Grids with Condor

Introduction to Cloud Computing

Dean Martin Cadwallader Dean of the Graduate School

Welcome to HTCondor Week #18 (year 33 of our project)

Basic Grid Projects – Condor (Part I)

CLUSTER COMPUTING.

Satisfying the Ever Growing Appetite of HEP for HTC

Welcome to (HT)Condor Week #19 (year 34 of our project)

Welcome to HTCondor Week #17 (year 32 of our project)

Presentation transcript:

Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison High Throughput Computing (HTC)

The words of Koheleth son of David, king in Jerusalem ~ 200 A.D. Only that shall happen Which has happened, Only that occur Which has occurred; There is nothing new Beneath the sun! Ecclesiastes Chapter 1 verse 9 Ecclesiastes, ( קֹהֶלֶת, Kohelet, "son of David, and king in Jerusalem" alias Solomon, Wood engraving Gustave Doré (1832–1883)

In 1983 I wrote a Ph.D. thesis – “Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems”

BASICS OF A M/M/1 SYSTEM  Expected # of customers is 1/(1-  ), where   is the utilization When utilization is 80%, you wait on the average 4 units for every unit of service

BASICS OF TWO M/M/1 SYSTEMS   When utilization is 80%, you wait on the average 4 units for every unit of service When utilization is 80%, 25% of the time a customer is waiting for service while a server is idle

Wait while Idle (WwI) in m*M/M/1 0 m = 2 m = 5 m = 10 m = 20 Prob (WwI) 1 01 Utilization

Claims for “benefits” provided by Distributed Processing Systems  High Availability and Reliability  High System Performance  Ease of Modular and Incremental Growth  Automatic Load and Resource Sharing  Good Response to Temporary Overloads  Easy Expansion in Capacity and/or Function P.H. Enslow, “What is a Distributed Data Processing System?” Computer, January 1978

Definitional Criteria for a Distributed Processing System  Multiplicity of resources  Component interconnection  Unity of control  System transparency  Component autonomy P.H. Enslow and T. G. Saponas “”Distributed and Decentralized Control in Fully Distributed Processing Systems” Technical Report, 1981

Multiplicity of resources The system should provide a number of assignable resources for any type of service demand. The greater the degree of replication of resources, the better the ability of the system to maintain high reliability and performance

Component interconnection A Distributed System should include a communication subnet which interconnects the elements of the system. The transfer of information via the subnet should be controlled by a two-party, cooperative protocol (loose coupling).

Unity of Control All the component of the system should be unified in their desire to achieve a common goal. This goal will determine the rules according to which each of these elements will be controlled.

System transparency From the users point of view the set of resources that constitutes the Distributed Processing System acts like a “ single virtual machine ”. When requesting a service the user should not require to be aware of the physical location or the instantaneous load of the various resources

Component autonomy The components of the system, both the logical and physical, should be autonomous and are thus afforded the ability to refuse a request of service made by another element. However, in order to achieve the system’s goals they have to interact in a cooperative manner and thus adhere to a common set of policies. These policies should be carried out by the control schemes of each element.

“ … Since the early days of mankind the primary motivation for the establishment of communities has been the idea that by being part of an organized group the capabilities of an individual are improved. The great progress in the area of inter-computer communication led to the development of means by which stand-alone processing sub- systems can be integrated into multicomputer ‘communities’. … “ Miron Livny, “ Study of Load Balancing Algorithms for Decentralized Distributed Processing Systems.”, Ph.D thesis, July 1983.

We first introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center in July of 1996 and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing. High Throughput Computing

Why HTC? For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year.

High Throughput Computing is a activity FLOPY  (60*60*24*7*52)*FLOPS

Obstacles to HTC › Ownership Distribution › Customer Awareness › Size and Uncertainties › Technology Evolution › Physical Distribution (Sociology) (Education) (Robustness) (Portability) (Technology)

10 Dubna/Berlin Amsterdam 3 Warsaw 3 The ’94 Worldwide Condor Flock Madison Delft 10 3 Geneva 30 10

The grid promises to fundamentally change the way we think about and use computing. This infrastructure will connect multiple regional and national computational grids, creating a universal source of pervasive and dependable computing power that supports dramatically new classes of applications. The Grid provides a clear vision of what computational grids are, why we need them, who will use them, and how they will be programmed. The Grid: Blueprint for a New Computing Infrastructure Edited by Ian Foster and Carl Kesselman July 1998, 701 pages.

“ … We claim that these mechanisms, although originally developed in the context of a cluster of workstations, are also applicable to computational grids. In addition to the required flexibility of services in these grids, a very important concern is that the system be robust enough to run in “production mode” continuously even in the face of component failures. … “ Miron Livny & Rajesh Raman, "High Throughput Resource Management", in “The Grid: Blueprint for a New Computing Infrastructure”.

Client Server Master Worker

min p  a ij b p(i)p(j) 30  i=1  The NUG30 Quadratic Assignment Problem (QAP) 30  j=1 

NUG30 Personal Grid … Managed by one Linux box at Wisconsin Flocking: -- the main Condor pool at Wisconsin (500 processors) -- the Condor pool at Georgia Tech (284 Linux boxes) -- the Condor pool at UNM (40 processors) -- the Condor pool at Columbia (16 processors) -- the Condor pool at Northwestern (12 processors) -- the Condor pool at NCSA (65 processors) -- the Condor pool at INFN Italy (54 processors) Glide-in: -- Origin 2000 (through LSF ) at NCSA. (512 processors) -- Origin 2000 (through LSF) at Argonne (96 processors) Hobble-in: -- Chiba City Linux cluster (through PBS) at Argonne (414 processors).

Solution Characteristics. Scientists 4 Workstations 1 Wall Clock Time6:22:04:31 Avg. # CPUs653 Max. # CPUs1007 Total CPU TimeApprox. 11 years Nodes11,892,208,412 LAPs574,254,156,532 Parallel Efficiency92%

Workers The NUG30 Workforce Condor crash Application Upgrade System Upgrade

Being a Master Customer “delegates” task(s) to the master who is responsible for:  Obtaining allocation of resources  Deploying and managing workers on allocated resources  Delegating work unites to deployed workers  Receiving and processing results  Delivering results to customer

Master must be … › Persistent – work and results must be safely recorded on non-volatile media › Resourceful – delegates “DAGs” of work to other masters › Speculative – takes chances and knows how to recover from failure › Self aware – knows its own capabilities and limitations › Obedience – manages work according to plan › Reliable – can mange “large” numbers of work items and resource providers › Portable – can be deployed “on the fly” to act as a “sub master”

“ … Grid computing is a partnership between clients and servers. Grid clients have more responsibilities than traditional clients, and must be equipped with powerful mechanisms for dealing with and recovering from failures, whether they occur in the context of remote execution, work management, or data output. When clients are powerful, servers must accommodate them by using careful protocols.… “ Douglas Thain & Miron Livny, "Building Reliable Clients and Servers", in “The Grid: Blueprint for a New Computing Infrastructure”,2nd edition

Resource Allocation (resource -> customer) vs. Work Delegation (job -> resource)

Resource Allocation A limited assignment of temporary “ownership” of a resource offered by a provider to a requestor  Requestor is charged for allocation regardless of actual consumption  Requestor may be given the right to allocate resource to others  Provider has the right and means to revoke the allocation  Allocation is governed by an “agreement” between the provider and the requestor  Allocation is a “lease” that expires if not renewed by the requestor  Tree of allocations

Work Delegation An assignment of a responsibility to perform a task  Delegation involved a definition of these “responsibilities”  Responsibilities my be further delegated  Delegation consumes resources  Delegation is a “lease” that expires if not renewed by the assigner  Can form a tree of delegations

Focus of the grid “movement” has been remote job delegation (Gate Keepers), commercial clouds are all about remote resource allocation (Provisioning)

In Condor we use a two phase matchmaking process to first allocate a resource to a requestor and then to select a task to be delegated to this resource

I am D and I am willing to offer you a resource I am S and am looking for a resource Match! Match! W3 Wi MM

Overlay Resource Managers Ten years ago we introduced the concept of Condor glide-ins as a tool to support ‘just in time scheduling’ in a distributed computing infrastructure that consists of recourses that are managed by (heterogeneous) autonomous resource managers. By dynamically deploying a distributed resource manager on resources allocated (provisioned) by the local resource managers, the overlay resource manager can implement a unified resource allocation policy. In other words, we use remote job invocation to get resources allocated.

Application/Portal SchedD (Condor G) G-app Local Remote Condor C-app MM Cloud/Grid Interfaces EC2LSF Condor MM StartD (Glide-in) StartD (Glide-in) StartD (Glide-in) Condor MM C-app SchedD (Condor C) SchedD (Condor C) SchedD (Condor C) MM

How can we accommodate an unbounded need for computing and an unbounded amount of data with an unbounded amount of resources?