Grid Laboratory Of Wisconsin (GLOW) UW Madison’s Campus Grid Dan Bradley Department of Physics & CS Representing the GLOW + Condor Teams http://www.cs.wisc.edu/condor/glow May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop The Premise Many researchers have computationally intensive problems. Individual workflows rise and fall over the coarse of weeks and months. Computers and computing people are less volatile than a researcher’s demand for them. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
Grid Laboratory of Wisconsin 2003 Initiative funded by NSF/UW Six Initial GLOW Sites Computational Genomics, Chemistry Amanda, Ice-cube, Physics/Space Science High Energy Physics/CMS, Physics Materials by Design, Chemical Engineering Radiation Therapy, Medical Physics Computer Science Diverse users with different deadlines and usage patterns. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop UW Madison Campus Grid Condor pools in various departments, made accessible via Condor ‘flocking’ Users submit jobs to their own private or department Condor scheduler. Jobs are dynamically matched to available machines. Crosses multiple administrative domains. No common uid-space across campus. No cross-campus NFS for file access. Users rely on Condor remote I/O, file-staging, AFS, SRM, gridftp, etc. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
UW Campus Grid Machines GLOW Condor pool is distributed across the campus to provide locality with big users. 1200 2.8 GHz Xeon CPUs 200 1.8 GHz Opteron cores 100 TB disk Computer Science Condor pool 1000 ~1GHz CPUs testbed for new Condor releases Other private pools job submission and execution private storage space excess jobs flock to GLOW and CS pools May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop New GLOW Members Proposed minimum involvement One rack with about 50 CPUs Identified system support person who joins GLOW-tech Can be an existing member of GLOW-tech PI joins the GLOW executive committee Adhere to current GLOW policies Sponsored by existing GLOW members UW ATLAS and other physics groups were proposed by CMS and CS, and were accepted as new members Expressions of interest from other groups May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop Housing the Machines Condominium Style centralized computing center space, power, cooling, management standardized packages Neighborhood Association Style each group hosts its own machines each contributes to administrative effort base standards (e.g. Linux & Condor) to make easy sharing of resources GLOW has elements of both, but leans towards neighborhood style May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop What About “The Grid” Who needs a campus grid? Why not have each cluster join “The Grid” independently? May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
The Value of Campus Scale simplicity software stack is just Linux + Condor fluidity high common denominator makes sharing easier and provides richer feature-set collective buying power we speak to vendors with one voice standardized administration e.g. GLOW uses one centralized cfengine synergy face-to-face technical meetings mailing list scales well at campus level May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop The value of the big G Our users want to collaborate outside the bounds of the campus (e.g. Atlas and CMS are international). We also don’t want to be limited to sharing resources with people who have made identical technological choices. The Open Science Grid gives us the opportunity to operate at both scales, which is ideal. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop On the OSG Map Any GLOW member is free to link their resources to other grids. facility: WISC site: UWMadisonCMS May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
Submitting Jobs within UW Campus Grid UW HEP User HEP matchmaker CS matchmaker GLOW matchmaker schedd (Job caretaker) condor_submit flocking startd (Job Executor) Supports full feature-set of Condor: matchmaking remote system calls checkpointing MPI suspension VMs preemption policies May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
Submitting jobs through OSG to UW Campus Grid Open Science Grid User HEP matchmaker CS matchmaker GLOW matchmaker flocking Globus gatekeeper schedd (Job caretaker) condor_submit schedd (Job caretaker) startd (Job Executor) condor gridmanager May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
Routing Jobs from UW Campus Grid to OSG HEP matchmaker CS matchmaker GLOW matchmaker schedd (Job caretaker) condor_submit Grid JobRouter globus gatekeeper condor gridmanager Combining both worlds: simple, feature-rich local mode when possible, transform to grid job for traveling globally May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
GLOW Architecture in a Nutshell One big Condor pool But backup central manager runs at each site (Condor HAD service) Users submit jobs as members of a group (e.g. “CMS” or “MedPhysics”) Computers at each site give highest priority to jobs from same group (via machine RANK) Jobs run preferentially at the “home” site, but may run anywhere when machines are available May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
Accommodating Special Cases Members have flexibility to make arrangements with each other when needed Example: granting 2nd priority Opportunistic access Long-running jobs which can’t easily be checkpointed can be run as bottom feeders that are suspended instead of being killed by higher priority jobs Computing on Demand tasks requiring low latency (e.g. interactive analysis) may quickly suspend any other jobs while they run May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop Example Uses Chemical Engineering Students do not know where the computing cycles are coming from - they just do it - largest user group ATLAS Over 15 Million proton collision events simulated at 10 minutes each CMS Over 70 Million events simulated, reconstructed and analyzed (total ~10 minutes per event) in the past one year IceCube / Amanda Data filtering used 12 CPU-years in one month Computational Genomics Prof. Shwartz asserts that GLOW has opened up a new paradigm of work patterns in his group They no longer think about how long a particular computational job will take - they just do it May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop
2006 ESCC/Internet2 Joint Techs Workshop Summary Researchers are demanding to be well connected to both local and global computing resources. The Grid Laboratory of Wisconsin is our attempt to meet that demand. We hope you too will find a solution! May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop