Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Building Campus HTC Sharing Infrastructures Derek Weitzel University of Nebraska – Lincoln (Open Science Grid Hat)
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
 Basic Concepts  Scheduling Criteria  Scheduling Algorithms.
Silberschatz, Galvin and Gagne  2002 Modified for CSCI 399, Royden, Operating System Concepts Operating Systems Lecture 19 Scheduling IV.
Chapter 5 CPU Scheduling. CPU Scheduling Topics: Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Building a Real Workflow Thursday morning, 9:00 am Greg Thain University of Wisconsin - Madison.
Dealing with real resources Wednesday Afternoon, 3:00 pm Derek Weitzel OSG Campus Grids University of Nebraska.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
Pilot Factory using Schedd Glidein Barnett Chiu BNL
An Introduction to High-Throughput Computing With Condor Tuesday morning, 9am Zach Miller University of Wisconsin-Madison.
OSG Site Admin Workshop - Mar 2008Using gLExec to improve security1 OSG Site Administrators Workshop Using gLExec to improve security of Grid jobs by Alain.
An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
UCS D OSG Summer School 2011 Single sign-on OSG Summer School Single sign-on in Open Science Grid by Igor Sfiligoi University of California San Diego.
Security in OSG Tuesday afternoon, 4:15pm Igor Sfiligoi Member of the OSG Security team University of California San Diego.
Rome, Sep 2011Adapting with few simple rules in glideinWMS1 Adaptive 2011 Adapting to the Unknown With a few Simple Rules: The glideinWMS Experience by.
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.
UCS D OSG School 11 Grids vs Clouds OSG Summer School Comparing Grids to Clouds by Igor Sfiligoi University of California San Diego.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Dealing with real resources Wed July 21st, 3:15pm Igor Sfiligoi, OSG Scalability Area coordinator and OSG glideinWMS factory manager.
Quick Architecture Overview INFN HTCondor Workshop Oct 2016
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Workload Management System
Building a Virtual Infrastructure
Factory Ops Mission Statement by Jeff Dost (UCSD)
CS 425 / ECE 428 Distributed Systems Fall 2016 Nov 10, 2016
I2G CrossBroker Enol Fernández UAB
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
Grid Computing.
Security in OSG Rob Quick
CS 425 / ECE 428 Distributed Systems Fall 2017 Nov 16, 2017
Building Grids with Condor
Chapter 6: CPU Scheduling
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 5: CPU Scheduling
Operating System Concepts
Basic Grid Projects – Condor (Part I)
3: CPU Scheduling Basic Concepts Scheduling Criteria
Upgrading Condor Best Practices
Chapter5: CPU Scheduling
Chapter 6: CPU Scheduling
Brian Lin OSG Software Team University of Wisconsin - Madison
Chapter 5: CPU Scheduling
Shortest-Job-First (SJR) Scheduling
Chapter 6: CPU Scheduling
Module 5: CPU Scheduling
Chapter 6: CPU Scheduling
CPU Scheduling: Basic Concepts
Module 5: CPU Scheduling
Presentation transcript:

Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego

2012 OSG User School About Me Working with distributed computing since 1996 Working with Grids since 2005 Leader of the OSG glidein factory ops since 2009 Deeply involved in overlay system development and deployments Mostly worked with Physics communities (KLOE, CDF, CMS) 2

2012 OSG User School Logistical reminder It is OK to ask questions  During the lecture  During the demos  During the exercises  During the breaks If I don't know the answer, I will find someone who likely does

2012 OSG User School High Throughput Computing Alain yesterday introduced you to HTC  The concept of getting as many CPU cycles as possible over the long run Based on batch job processing  No interactive access to resources 4 Schedul er Jobs Batch slots

2012 OSG User School HTC in words As our esteemed Miron would put it 5 HTC is about extending the compute power of my own machine. I could run my work on my own machine, but then it would take a very large number of calendar days/months/years to complete. To finish the computation in a reasonable time, I have to expand the capacity of my own machine by obtaining and using temporary resources. HTC is about extending the compute power of my own machine. I could run my work on my own machine, but then it would take a very large number of calendar days/months/years to complete. To finish the computation in a reasonable time, I have to expand the capacity of my own machine by obtaining and using temporary resources.

2012 OSG User School Introducing DHTC So what is Distributed HTC???  HTC is always distributed, right? What we mean here is MASSIVELY distributed  i.e. more than you can afford to host and operate in one place 6

2012 OSG User School Anatomy of DHTC So DHTC is about computing on more than one HTC system 7 Schedul er

2012 OSG User School Why DHTC Many reasons:  Practical (a site has a limit to how much HW can host)  Political (you only get money for HW if it is hosted at X)  Economical (hosting and operating HW myself is too expensive) (someone else can offer you hosted HW for less)  Opportunistic (owners of site X have temporarily no jobs, might as well allow others to use them (for free or for pay) ) 8

2012 OSG User School Why is DHTC different? Not a single system anymore  How do I partition my jobs? Different clusters likely operated by different people  Leads to variations in compute environment Likely no globally shared file system Typically Wide Area Networking  Likely lower bandwidth and high RTT 9

2012 OSG User School Why is DHTC different? Not a single system anymore  How do I partition my jobs? Different clusters likely operated by different people  Leads to variations in compute environment Likely no globally shared file system Typically Wide Area Networking  Likely lower bandwidth and high RTT 10 Will concentrate on this part today.

2012 OSG User School Why is DHTC different? Not a single system anymore  How do I partition my jobs? Different clusters likely operated by different people  Leads to variations in compute environment Likely no globally shared file system Typically Wide Area Networking  Likely lower bandwidth and high RTT 11 Tomorrow's topic. i.e. you will have to do explicit data movement.

2012 OSG User School Job partitioning? Used naively, DHTC requires job partitioning  Very hard to do it right in a multiuser world! Schedul er

2012 OSG User School Job partitioning? Used naively, DHTC requires job partitioning  Very hard to do it right in a multiuser world! Schedul er Unknown job length Higher priority jobs arriving later Different policies at different sites Partial information

2012 OSG User School Job partitioning? Used naively, DHTC requires job partitioning  Very hard to do it right in a multiuser world! Schedul er Is there a better way?

2012 OSG User School If only we had a global scheduler This would make life easy again Schedul er

2012 OSG User School If only we had a global scheduler This would make life easy again Schedul er Unfortunately not an option

2012 OSG User School Why we cannot have it? Existing infrastructure Local users, local policies Money & politics Different technologies Being able to work when WAN goes down...

2012 OSG User School Idea – Create virtual-private HTC What if we collected a set of batch slots and only then scheduled jobs on them?  Possibly for a set of users From the user point of view, a single, global scheduler  But the available batch slots change in time Schedul er

2012 OSG User School A leasing model Imagine leasing some of the batch slots  Once you have them, you decide what to do with them Schedul er

2012 OSG User School Batch leasing Sites still own the resources  So we have to play by sites' rules Must use the sites' schedulers to request the lease 20 Schedul er Lease manage r

2012 OSG User School Batch leasing Sites still own the resources  So we have to play by sites' rules Must use the sites' schedulers to request the lease 21 Schedul er Lease manage r The lease request is thus a batch job

2012 OSG User School Overlay or Pilot systems We submit pilot jobs to sites Each pilot job holds the lease on the batch slot  Creating an overlay HTC system  Overlay == 2 nd level Schedul er Lease manage r Also known as resource provisioning

2012 OSG User School Overlay or Pilot systems We submit pilot jobs to sites Each pilot job holds the lease on the batch slot  Creating an overlay HTC system  Overlay == 2 nd level Schedul er Lease manage r Also known as resource provisioning But didn't we just move the problem?

2012 OSG User School Provisioning not as hard Main problem in user job partitioning  All jobs are important!  Typical user interested in when the last job finishes In pilot job provisioning  All jobs are the same  User interested in the total number of resources provisioned

2012 OSG User School Fighting heterogeneity Pilot jobs can tweak the environment before starting user jobs  So users see a much more homogeneous system Cannot do miracles, of course  Usually limited to unprivileged-user tweaks (e.g. cannot replace the kernel)  But this is enough most of the time

2012 OSG User School Fighting heterogeneity Pilot jobs can tweak the environment before starting user jobs  So users see a much more homogeneous system Cannot do miracles, of course  Usually limited to unprivileged-user tweaks (e.g. cannot replace the kernel)  But this is enough most of the time Only a few pilot system administrators need to worry about heterogeneity

2012 OSG User School Pilot systems Many possible implementations We will concentrate on glideinWMS  Based on Condor  The one used by most user communities on OSG Others available  PANDA, DIRAC, ALIEN,...

2012 OSG User School glideinWMS A Condor based overlay system  i.e. looks like a regular Condor system to the users  Adds a resource provisioning service (i.e. the lease manager) Schedul er Schedd glideinWMS Collector Negotiato r

2012 OSG User School Condor pilots Condor pilot == A glidein A properly configured Condor Startd Schedul er Schedd glideinWMS Collector Negotiato r Startd

2012 OSG User School Two level matchmaking The system now has two matchmaking points  The glideinWMS decides when and where to send glideins  The Condor negotiator decides which job runs on which glidein The two must treat jobs the same way  Or we end up with either unused glideins or jobs that never start

2012 OSG User School Moving policy in glideinWMS In glideinWMS, user jobs never have requirements All policy is implemented by system administrators  Users just provide parameters Example policy (DESIRED_SITES=?=”any”) || stringListMember(GLIDEIN_Site,DESIRED_SITES) +DESIRED_SITES = “UCSD,Wisconsin” queue Example job JDL +DESIRED_SITES = “any” queue Example job JDL

2012 OSG User School Know your system The matchmaking is thus less flexible  You can only work within the frame of the system policy But arguably easier to use  No complex boolean expressions to write Be sure to ask for the system policy of your system

2012 OSG User School Down to practice This is all for the theoretical part Next we have the hands-on session

2012 OSG User School Questions? 34 Questions? Comments?  Feel free to ask me questions later:  Igor Sfiligoi Upcoming sessions  Now - 11:00am  Hands-on exercises  11:00am – 11:15am  Break  11:15am –  Next lecture – The Grid and glideinWMS architecture

2012 OSG User School Beware the power 35 Courtesy of fanpop.com