(HT)Condor - Past and Future

Slides:



Advertisements
Similar presentations
– n° 1 Review resources access policy, procedures, rules and challenges: The Italian experience and future challenges Antonia Ghiselli INFN-CNAF Workshop.
Advertisements

Welcome to HTCondor Week #15 (year #30 for our project)
Deployment Team. Deployment –Central Management Team Takes care of the deployment of the release, certificates the sites and manages the grid services.
SLA-Oriented Resource Provisioning for Cloud Computing
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Why do I build CI and what did it teach me? Miron Livny Wisconsin Institutes for Discovery Madison-Wisconsin.
Cloud Computing to Satisfy Peak Capacity Needs Case Study.
Status of Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
The flight of the Condor - a decade of High Throughput Computing Miron Livny Computer Sciences Department University of Wisconsin-Madison
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
INFN Testbed status report L. Gaido WP6 meeting CERN - October 30th, 2002.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Grid Computing 7700 Fall 2005 Lecture 17: Resource Management Gabrielle Allen
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Principles for High Throughput Computing Alain Roy Open Science Grid Software Coordinator.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
Welcome to HTCondor Week #16 (year 31 of our project)
The EDG Testbed Deployment Details The European DataGrid Project
Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.
History of the National INFN Pool P. Mazzanti, F. Semeria INFN – Bologna (Italy) European Condor Week 2006 Milan, 29-Jun-2006.
Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)
Grid Computing at The Hartford Condor Week 2008 Robert Nordlund
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
NGS Innovation Forum, Manchester4 th November 2008 Condor and the NGS John Kewley NGS Support Centre Manager.
Derek Wright Computer Sciences Department University of Wisconsin-Madison MPI Scheduling in Condor: An.
1 Condor Team 2011 Established 1985.
THE INFN GRID PROJECT zScope: Study and develop a general INFN computing infrastructure, based on GRID technologies, to be validated (as first use case)
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Automatic Statistical Evaluation of Resources for Condor Daniel Nurmi, John Brevik, Rich Wolski University of California, Santa Barbara.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison High Throughput Computing (HTC)
Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN.
Condor Services for the Global Grid: Interoperability between OGSA and Condor Clovis Chapman 1, Paul Wilson 2, Todd Tannenbaum 3, Matthew Farrellee 3,
M. Cristina Vistoli EGEE SA1 Organization Meeting EGEE is proposed as a project funded by the European Union under contract IST Regional Operations.
Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
Workload Management Workpackage
Univa Grid Engine Makes Work Management Automatic and Efficient, Accelerates Deployment of Cloud Services with Power of Microsoft Azure MICROSOFT AZURE.
Condor A New PACI Partner Opportunity Miron Livny
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Driven by the potential of Distributed Computing to advance Scientific Discovery
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
ATLAS Cloud Operations
Condor – A Hunter of Idle Workstation
High Availability in HTCondor
INFN – GRID status and activities
Miron Livny John P. Morgridge Professor of Computer Science
Condor: Job Management
The Scheduling Strategy and Experience of IHEP HTCondor Cluster
Management of Virtual Execution Environments 3 June 2008
Grid Means Business OGF-20, Manchester, May 2007
Dean Martin Cadwallader Dean of the Graduate School
HTCondor and the Network
Welcome to HTCondor Week #18 (year 33 of our project)
Basic Grid Projects – Condor (Part I)
Introduction to High Throughput Computing and HTCondor
Brian Lin OSG Software Team University of Wisconsin - Madison
Condor: Firewall Mirroring
Mats Rynge USC Information Sciences Institute
High Throughput Computing for Astronomers
GLOW A Campus Grid within OSG
Satisfying the Ever Growing Appetite of HEP for HTC
Welcome to (HT)Condor Week #19 (year 34 of our project)
Job Submission Via File Transfer
Welcome to HTCondor Week #17 (year 32 of our project)
Presentation transcript:

(HT)Condor - Past and Future Miron Livny John P. Morgridge Professor of Computer Science Wisconsin Institutes for Discovery University of Wisconsin-Madison

חי has the value of 18

חי means alive

Europe HTCondor Week #3 (HT)Condor Week #18 12 years to the Open Science Grid (OSG) 21 years to High Throughput Computing (HTC) 32 years to the first production deployment of (HT)Condor 40 years since I started to work on distributed systems

CERN 92

1994 Worldwide Flock of Condors Delft Amsterdam 30 200 10 3 3 3 3 Madison Warsaw 10 10 Geneva Dubna/Berlin D. H. J Epema, Miron Livny, R. van Dantzig, X. Evers, and Jim Pruyne, "A Worldwide Flock of Condors : Load Sharing among Workstation Clusters" Journal on Future Generations of Computer Systems, Volume 12, 1996

6 ckpt servers  25 ckpt servers USA INFN Condor Pool on WAN: checkpoint domains EsNet 155Mbps 10 15 TRENTO 4 40 UDINE TORINO MILANO GARR-B Topology 155 Mbps ATM based Network access points (PoP) main transport nodes PADOVA LNL FERRARA TRIESTE 15 10 PAVIA GENOVA 65 PARMA Central Manager CNAF BOLOGNA 3 PISA 1 FIRENZE S.Piero 6 PERUGIA LNGS 10 5 ROMA 3 ROMA2 L’AQUILA LNF SASSARI 3 NAPOLI 15 BARI 2 CKPT domain # hosts Default CKPT domain @ Cnaf SALERNO LECCE 2 T3 CAGLIARI COSENZA ~180 machines  500-1000 machines 6 ckpt servers  25 ckpt servers PALERMO 5 CATANIA LNS USA

“The members of OSG are united by a commitment to promote the adoption and to advance the state of the art of distributed high throughput computing (DHTC) – shared utilization of autonomous resources where all the elements are optimized for maximizing computational throughput.”

1.42B core hours in 12 months! Almost all jobs executed by the OSG leverage (HT)Condor technologies: Condor-G HTCondor-CE Basco Condor Collectors HTCondor overlays HTCondor pools

Two weeks ago …

Only that shall happen Which has happened, Only that occur The words of Koheleth son of David, king in Jerusalem ~ 200 A.D. Only that shall happen Which has happened, Only that occur Which has occurred; There is nothing new Beneath the sun! Ecclesiastes Chapter 1 verse 9 Ecclesiastes, (קֹהֶלֶת, Kohelet, "son of David, and king in Jerusalem" alias Solomon, Wood engraving Gustave Doré (1832–1883)

Claims for “benefits” provided by Distributed Processing Systems High Availability and Reliability High System Performance Ease of Modular and Incremental Growth Automatic Load and Resource Sharing Good Response to Temporary Overloads Easy Expansion in Capacity and/or Function P.H. Enslow, “What is a Distributed Data Processing System?” Computer, January 1978

Minimize wait (job/task queued) while Idle (a resource that is capable and willing to serve the job/task is running a lower priority job/task)

Submit locally (queue and manage your jobs/tasks locally; leverage your local resources) and run globally (acquire any resource that is capable and willing to run your job/task)

Job owner identity is local Owner identity should never “travel” with the job to execution site Owner attributes are local Name spaces are local File names are locally defined Resource acquisition is local Submission site (local) is responsible for the acquisition of all resources

“external” forces moved us away from this “pure” local centric view of the distributed computing environment. With the help of capabilities (short lived tokens) and reassignment of responsibilities we are committed to regain local control. Handing users with money (real or funny) to acquire commuting resources helps us move (push) in this positive direction.

In 1996 I introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at NASA Goddard Flight Center and a month later at European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing.

“… many fields today rely on high-throughput computing for discovery.” “Many fields increasingly rely on high-throughput computing”

High Throughput Computing requires automation as it is a 24-7-365 activity that involves large numbers of jobs FLOPY  (60*60*24*7*52)*FLOPS 100K Hours*1 Job ≠ 1 H*100K J

HTC is about many jobs, many users, many servers, many sites and (potentially) long running workflows

HTCondor uses a matchmaking process to dynamically acquire resources HTCondor uses a matchmaking process to dynamically acquire resources. HTCondor uses a matchmaking process to provision them to queued jobs. HTCondor launches jobs via a task delegation protocol.

HTCondor 101 Jobs are submitted to the HTCondor SchedD A job can be a Container or a VM The SchedD can Flock to additional Matchmakers The SchedD can delegate a job for execution to a HTCondor StartD The SchedD can delegate a job for execution to a another Batch system. The SchedD can delegate a job for execution to a Grid Compute Element (CE) The SchedD can delegate a job for execution to a Commercial Cloud

All Managed by HTCondor! 350K

A crowded and growing space of distributed execution environments that can benefit from the capabilities offered by HTCondor

97

The road from PVM to DASK In the 80’s we added support for resource management primitives (AddHost, DelHost and Spawn) of Parallel Virtual Machine (PVM) message passing environment Today we working on adding support for dynamic (on the fly) addition and removal of worker nodes to Dask

J. Pruyne and M. Livny, ``Parallel Processing on Dynamic Resources with CARMI,'' in Job Scheduling Strategies for Parallel Processing, D. G. Feitelson and L. Rudolph (eds.), Springer-Verlag, 1995.

How to expose all shared resources that require protection (allocation) to application (user) software and control (configuration)?

Request/Acquire Schedule/Provision Allocate Protect Monitor Reclaim Account

Miron Livny John P. Morgridge Professor of Computer Science Director Center for High Throughput Computing University of Wisconsin Madison It is all about Storage Management, stupid! (Allocate B bytes for T time units) We do not have tools to manage storage allocations We do not have tools to schedule storage allocations We do not have protocols to request allocation of storage We do not have means to deal with “sorry no storage available now” We do not know how to manage, use and reclaim opportunistic storage We do not know how to budget for storage space We do not know how to schedule access to storage devices