Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model.

Slides:



Advertisements
Similar presentations
Todd Tannenbaum Condor Team GCB Tutorial OGF 2007.
Advertisements

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Building a secure Condor ® pool in an open academic environment Bruce Beckles University of Cambridge Computing Service.
M. Muztaba Fuad Masters in Computer Science Department of Computer Science Adelaide University Supervised By Dr. Michael J. Oudshoorn Associate Professor.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Efficiently Sharing Common Data HTCondor Week 2015 Zach Miller Center for High Throughput Computing Department of Computer Sciences.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
glideinWMS: Quick Facts  glideinWMS is an open-source Fermilab Computing Sector product driven by CMS  Heavy reliance on HTCondor from UW Madison and.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Presented by Xiaoyu Qin Virtualized Access Control & Firewall Virtualization.
Hao Wang Computer Sciences Department University of Wisconsin-Madison Security in Condor.
BaBar MC production BaBar MC production software VU (Amsterdam University) A lot of computers EDG testbed (NIKHEF) Jobs Results The simple question:
Campus Grids Report OSG Area Coordinator’s Meeting Dec 15, 2010 Dan Fraser (Derek Weitzel, Brian Bockelman)
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
ETICS All Hands meeting Bologna, October 23-25, 2006 NMI and Condor: Status + Future Plans Andy PAVLO Peter COUVARES Becky GIETZEL.
Todd Tannenbaum Computer Sciences Department University of Wisconsin-Madison Condor RoadMap.
The Alternative Larry Moore. 5 Nodes and Variant Input File Sizes Hadoop Alternative.
Condor week – April 2006Artyom Sharov, Technion, Haifa1 Adding High Availability to Condor Central Manager Artyom Sharov Technion – Israel Institute of.
Michael Fenn CPSC 620, Fall 09.  Grid computing is the process of allowing loosely-coupled virtual organizations to share resources over a wide area.
Condor week – March 2005©Gabriel Kliot, Technion1 Adding High Availability to Condor Central Manager Gabi Kliot Technion – Israel Institute of Technology.
Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.
22 nd Oct 2008Euro Condor Week 2008 Barcelona 1 Condor Gotchas III John Kewley STFC Daresbury Laboratory
Dan Bradley Condor Project CS and Physics Departments University of Wisconsin-Madison CCB The Condor Connection Broker.
Parag Mhashilkar Computing Division, Fermi National Accelerator Laboratory.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
3 Compute Elements are manageable By hand 2 ? We need middleware – specifically a Workload Management System (and more specifically, “glideinWMS”) 3.
UCS D OSG Summer School 2011 Life of an OSG job OSG Summer School A peek behind the scenes The life of an OSG job by Igor Sfiligoi University of.
How to get the needed computing Tuesday afternoon, 1:30pm Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California San Diego.
Introduction to Distributed HTC and overlay systems Tuesday morning, 9:00am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University of California.
Condor Week 2007Glidein Factories - by I. Sfiligoi1 Condor Week 2007 Glidein Factories (and in particular, the glideinWMS) by Igor Sfiligoi.
Introduction to the Grid and the glideinWMS architecture Tuesday morning, 11:15am Igor Sfiligoi Leader of the OSG Glidein Factory Operations University.
UCS D OSG Summer School 2011 Intro to DHTC OSG Summer School An introduction to Distributed High-Throughput Computing with emphasis on Grid computing.
OSG Consortium Meeting - March 6th 2007Evaluation of WMS for OSG - by I. Sfiligoi1 OSG Consortium Meeting Evaluation of Workload Management Systems for.
UCS D OSG Summer School 2011 Overlay systems OSG Summer School An introduction to Overlay systems Also known as Pilot systems by Igor Sfiligoi University.
Condor Week 2006, University of Wisconsin 1 Matthew Norman Using Condor Glide-ins and GCB to run in a grid environment Elliot Lipeles, Matthew Norman,
Madison, Apr 2010Igor Sfiligoi1 Condor World 2010 Condor-G – A few lessons learned by Igor UCSD.
Condor Week May 2012No user requirements1 Condor Week 2012 An argument for moving the requirements out of user hands - The CMS experience presented.
Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.
Arlington, Dec 7th 2006 Glidein Based WMS 1 A pilot-based (PULL) approach to the Grid An overview by Igor Sfiligoi.
Progress Apama Fundamentals
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
HTCondor Networking Concepts
HTCondor Networking Concepts
HTCondor Security Basics
Dynamic Deployment of VO Specific Condor Scheduler using GT4
Examples Example: UW-Madison CHTC Example: Global CMS Pool
Operating a glideinWMS frontend by Igor Sfiligoi (UCSD)
Why API?.
Workload Management System
Messaging at CERN Lionel Cons – CERN IT/CM 18 Jan 2017
High Availability in HTCondor
Advanced Operating Systems Lecture notes
The CMS use of glideinWMS by Igor Sfiligoi (UCSD)
Building Grids with Condor
Introduction to Computers
Chapter 21: Cloud Computing and Related Security Issues
Chapter 22: Cloud Computing Technology and Security
The Stanford Clean Slate Program
HTCondor Security Basics HTCondor Week, Madison 2016
Hold up, wait a minute, let me put some async in it
Developing for Windows Azure
Condor: Firewall Mirroring
Condor-G Making Condor Grid Enabled
Condor-G: An Update.
Presentation transcript:

Condor Week 09Condor WAN scalability improvements1 Condor Week 2009 Condor WAN scalability improvements A needed evolution to support the CMS compute model by Dan Bradley, Igor Sfiligoi and Todd Tannenbaum

Condor Week 09Condor WAN scalability improvements2 Condor and CMS ● CMS relies on Grids to manage the O(10k) cores needed to analyze its data ● CMS uses Condor in many different ways ● Burt H. had a talk about this yesterday ● One use of Condor is to use it for creating a virtual- private Condor pool on top of the Grid ➔ Condor glide-ins

Condor Week 09Condor WAN scalability improvements3 Condor glide-ins Collector Negotiato r Grid API Glidein Startd Glidein Startd Glidein Factory Schedd (1) (2) (3) (4) (5)

Condor Week 09Condor WAN scalability improvements4 Glidein scalability at CMS ● Spring 2007 ● GCB is unreliable – Although OK with a few hunderd of glideins – But breaks easily ➔ Glidein usable on LAN but not on WAN (firewalls) Igor: Glideins are the way to go! But I need your help. Miron: We have to fix GCB! Alan/Jaime/Todd/Derek: We will make it work, trust us.

Condor Week 09Condor WAN scalability improvements5 GCB get fixed ● Fall 2007 ● GCB code has been polished – Reduce port use – Use only TCP (before UDP was used as well) – Fix many bugs ● GCB now scales over 5k glideins Igor is happy.

Condor Week 09Condor WAN scalability improvements6 Glidein scalability at CMS ● Winter 2008 ● CMS tested at Fermilab – One schedd node + 3 GCBs (for test purposes) ✔ 10k running jobs & 200k queued jobs ● Life is good Igor: We are ready for production Frank/Sanjay: We will run CCRC08 * with glideins! * CMS scalability challenge

Condor Week 09Condor WAN scalability improvements7 CMS starts CCRC08 ● CCRC08 (Spring 2008) ● CMS sets up a glidein factory to all CMS Tier-2s –... and the whole hell breaks loose... ● Condor is not scaling as expected! – Difficult to sustain O(1k) running jobs – Many glideins are sitting underutilized without work Frank: This thing is broken! Igor: Don't worry, we will find out what is wrong. Igor: Why is it not scaling as in my last tests at Fermilab? Dan: Must be the network latencies.

Condor Week 09Condor WAN scalability improvements8 Why are latencies hurting so much? ● In one word: Secure Authentication ● Glideins require strong, mutual authentication ● Secure authentication requires multiple message exchanges ● But just once, then use session key ● Condor daemons are single-threaded CollectorStartd I am new and I support FS and GSI OK, let's use GSI Here is my server ID Here is my client ID Welcome Use this session key in the future Here is my ClassAd

Condor Week 09Condor WAN scalability improvements9 Where is the major bottleneck? ● The collector handles all the daemons ● With 1.4s per daemon, it takes a long time to register O(10k) of them Igor: Why don't we create a tree of collectors? Like you do with CondorView? Dan: Requires minor changes, but should work! Collector Startd

Condor Week 09Condor WAN scalability improvements10 Life looks good. Collector tree at CCRC08 ● CMS deploys a tree of 1+20 collectors ● All on the same node, each using a different port, CONDOR_VIEW_HOST points to the main one ● The collector now handles 5k glideins with ease Collector Startd Collector

Condor Week 09Condor WAN scalability improvements11 More troubles with CCRC08 ● Efficiencies are still very low ● Most of the glideins are just sitting there idle! ● The schedd is very slow at claiming the glideins, and is hitting timeouts left and right Sanjay: The system is still broken! Igor: Let me have a look Igor: Looks like we have problems with latencies again! But why? Dan: Let's analyze what is going on.

Condor Week 09Condor WAN scalability improvements12 Schedd talks a lot, too ● Schedd handles many connections ● Each connection is a new secure handshake – Glideins come and go ● All the above block 1.4s on WAN Igor: What can we do? I don't want to use many schedds! Dan: I will make all connections nonblocking. Startd Schedd I am new and I support FS and GSI OK, let's use GSI Here is my server ID Here is my client ID Welcome Use this session key in the future Please run this job

Condor Week 09Condor WAN scalability improvements13 A couple months later... ● Just in time for the last round of CCRC08 ● Many Condor libraries have been modified to be non-blocking ● Bringing WAN blocking time to 1.0s ● System behaves better, but O(10k) still just a dream Dan: Give me a few more months and I will make all connections non-blocking. Todd: Don’t unwind all our code; use cooperative threads. Igor: I trust you, but I need a solution soon Miron: Guys, think! Is there no better solution?

Condor Week 09Condor WAN scalability improvements14 Miron was right! condor_config.local: c = km/s

Condor Week 09Condor WAN scalability improvements15 The better solution ● Instead of trying to brute force the problem, we found a better solution ● Use the Collector as trust manager ● Welcome “(security) match sessions” (enabled via SEC_ENABLE_MATCH_PASSWORD_AUTHENTICATION) Startd Schedd Collector Negotiato r ClassAd + match session match session I have the match session Please run my job (1) (2) (3) (4)

Condor Week 09Condor WAN scalability improvements16 Glidein scalability at CMS ● Winter 2009 ● CMS tested across the ocean – 1+70 collectors (and using CCB) – Using the “match sessions” ● 23k running jobs & 400k queued jobs – Limited by port usage (2 ports x running job) – But way above the target of 10k+ running jobs ● 200k jobs processed in a day! ● Life is good again

Condor Week 09Condor WAN scalability improvements17 Glideins in production at CMS ● Winter 2009: ● CMS uses glideins for worldwide data processing

Condor Week 09Condor WAN scalability improvements18 A comment on CCB ● Since Fall 2007 GCB scaled fine, but ● Used a lot of ports (5-6 per glidein → max 8k glideins x GCB ) ● Was not fault tolerant (could not restart GCB without losing the pool) ● CCB was designed based on GCB experience ● Uses just one port ● It can be restarted without harm ● Scales just as much as GCB

Condor Week 09Condor WAN scalability improvements19 Conclusion ● Major progress made in WAN setups ● GCB fixed → experience inspired CCB ● Tree of collectors to distribute authentication load ● “Match sessions” for smarter security ● CMS heavy user of glideins ● Via glideinWMSglideinWMS ● Could not have used them without the effort invested by the Condor team

Condor Week 09Condor WAN scalability improvements20 Acknowledgments ● This work was supported by ● U.S. DoE under contract No. DE-AC02-07CH11359 ● U.S. NSF grants PHY (RACE) and PHY (DISUN)