Grid Laboratory Of Wisconsin (GLOW)

Slides:



Advertisements
Similar presentations
Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.
Advertisements

Current methods for negotiating firewalls for the Condor ® system Bruce Beckles (University of Cambridge Computing Service) Se-Chang Son (University of.
Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.
Overview of Wisconsin Campus Grid Dan Bradley Center for High-Throughput Computing.
Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
Jaeyoung Yoon Computer Sciences Department University of Wisconsin-Madison Virtual Machines in Condor.
SCD FIFE Workshop - GlideinWMS Overview GlideinWMS Overview FIFE Workshop (June 04, 2013) - Parag Mhashilkar Why GlideinWMS? GlideinWMS Architecture Summary.
S. Dasu, CHEP04, Interlacken, Switzerland1 Use of Condor and GLOW for CMS Simulation Production What are Condor & GLOW? What is special about Condor &
Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.
Miron Livny Computer Sciences Department University of Wisconsin-Madison Submit locally and run globally – The GLOW and OSG Experience.
Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.
Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.
Parallel Computing The Bad News –Hardware is not getting faster fast enough –Too many architectures –Existing architectures are too specific –Programs.
High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.
A. Mohapatra, HEPiX 2013 Ann Arbor1 UW Madison CMS T2 site report D. Bradley, T. Sarangi, S. Dasu, A. Mohapatra HEP Computing Group Outline  Infrastructure.
Grid Laboratory Of Wisconsin (GLOW) Sridhara Dasu, Dan Bradley, Steve Rader Department of Physics Miron Livny, Sean.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.
Grid Computing I CONDOR.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
Discussion Topics DOE Program Managers and OSG Executive Team 2 nd June 2011 Associate Executive Director Currently planning for FY12 XD XSEDE Starting.
Developing & Managing A Large Linux Farm – The Brookhaven Experience CHEP2004 – Interlaken September 27, 2004 Tomasz Wlodek - BNL.
Data Intensive Science Network (DISUN). DISUN Started in May sites: Caltech University of California at San Diego University of Florida University.
Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005.
CMS Week, June 7-11, CMS Production in Wisconsin Status of recent developments. Dan Bradley Sridhara Dasu Vivek Puttabuddhi Wesley Smith The Condor.
OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.
Dan Bradley University of Wisconsin-Madison Condor and DISUN Teams Condor Administrator’s How-to.
Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.
Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.
Condor Week 2004 The use of Condor at the CDF Analysis Farm Presented by Sfiligoi Igor on behalf of the CAF group.
GLIDEINWMS - PARAG MHASHILKAR Department Meeting, August 07, 2013.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Peter Couvares Associate Researcher, Condor Team Computer Sciences Department University of Wisconsin-Madison
High Energy FermiLab Two physics detectors (5 stories tall each) to understand smallest scale of matter Each experiment has ~500 people doing.
CMS Usage of the Open Science Grid and the US Tier-2 Centers Ajit Mohapatra, University of Wisconsin, Madison (On Behalf of CMS Offline and Computing Projects)
Computing Issues for the ATLAS SWT2. What is SWT2? SWT2 is the U.S. ATLAS Southwestern Tier 2 Consortium UTA is lead institution, along with University.
Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.
David P. Anderson Space Sciences Laboratory University of California – Berkeley Public Distributed Computing with BOINC.
A. Mohapatra, T. Sarangi, HEPiX-Lincoln, NE1 University of Wisconsin-Madison CMS Tier-2 Site Report D. Bradley, S. Dasu, A. Mohapatra, T. Sarangi, C. Vuosalo.
Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.
1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.
Campus Grid Technology Derek Weitzel University of Nebraska – Lincoln Holland Computing Center (HCC) Home of the 2012 OSG AHM!
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
Accessing the VI-SEEM infrastructure
Elastic Cyberinfrastructure for Research Computing
Workload Management Workpackage
Integrating Scientific Tools and Web Portals
Condor A New PACI Partner Opportunity Miron Livny
Xiaomei Zhang CMS IHEP Group Meeting December
Outline Expand via Flocking Grid Universe in HTCondor ("Condor-G")
Western Analysis Facility
Building Grids with Condor
Condor: Job Management
US CMS Testbed.
Grid Canada Testbed using HEP applications
Introduce yourself Presented by
Condor and Multi-core Scheduling
Basic Grid Projects – Condor (Part I)
Upgrading Condor Best Practices
STORK: A Scheduler for Data Placement Activities in Grid
Brian Lin OSG Software Team University of Wisconsin - Madison
Sun Grid Engine.
Condor-G Making Condor Grid Enabled
GLOW A Campus Grid within OSG
JRA 1 Progress Report ETICS 2 All-Hands Meeting
Welcome to (HT)Condor Week #19 (year 34 of our project)
Condor-G: An Update.
Introduction to research computing using Condor
Presentation transcript:

Grid Laboratory Of Wisconsin (GLOW) UW Madison’s Campus Grid Dan Bradley Department of Physics & CS Representing the GLOW + Condor Teams http://www.cs.wisc.edu/condor/glow May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop The Premise Many researchers have computationally intensive problems. Individual workflows rise and fall over the coarse of weeks and months. Computers and computing people are less volatile than a researcher’s demand for them. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

Grid Laboratory of Wisconsin 2003 Initiative funded by NSF/UW Six Initial GLOW Sites Computational Genomics, Chemistry Amanda, Ice-cube, Physics/Space Science High Energy Physics/CMS, Physics Materials by Design, Chemical Engineering Radiation Therapy, Medical Physics Computer Science Diverse users with different deadlines and usage patterns. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop UW Madison Campus Grid Condor pools in various departments, made accessible via Condor ‘flocking’ Users submit jobs to their own private or department Condor scheduler. Jobs are dynamically matched to available machines. Crosses multiple administrative domains. No common uid-space across campus. No cross-campus NFS for file access. Users rely on Condor remote I/O, file-staging, AFS, SRM, gridftp, etc. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

UW Campus Grid Machines GLOW Condor pool is distributed across the campus to provide locality with big users. 1200 2.8 GHz Xeon CPUs 200 1.8 GHz Opteron cores 100 TB disk Computer Science Condor pool 1000 ~1GHz CPUs testbed for new Condor releases Other private pools job submission and execution private storage space excess jobs flock to GLOW and CS pools May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop New GLOW Members Proposed minimum involvement One rack with about 50 CPUs Identified system support person who joins GLOW-tech Can be an existing member of GLOW-tech PI joins the GLOW executive committee Adhere to current GLOW policies Sponsored by existing GLOW members UW ATLAS and other physics groups were proposed by CMS and CS, and were accepted as new members Expressions of interest from other groups May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop Housing the Machines Condominium Style centralized computing center space, power, cooling, management standardized packages Neighborhood Association Style each group hosts its own machines each contributes to administrative effort base standards (e.g. Linux & Condor) to make easy sharing of resources GLOW has elements of both, but leans towards neighborhood style May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop What About “The Grid” Who needs a campus grid? Why not have each cluster join “The Grid” independently? May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

The Value of Campus Scale simplicity software stack is just Linux + Condor fluidity high common denominator makes sharing easier and provides richer feature-set collective buying power we speak to vendors with one voice standardized administration e.g. GLOW uses one centralized cfengine synergy face-to-face technical meetings mailing list scales well at campus level May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop The value of the big G Our users want to collaborate outside the bounds of the campus (e.g. Atlas and CMS are international). We also don’t want to be limited to sharing resources with people who have made identical technological choices. The Open Science Grid gives us the opportunity to operate at both scales, which is ideal. May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop On the OSG Map Any GLOW member is free to link their resources to other grids. facility: WISC site: UWMadisonCMS May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

Submitting Jobs within UW Campus Grid UW HEP User HEP matchmaker CS matchmaker GLOW matchmaker schedd (Job caretaker) condor_submit flocking startd (Job Executor) Supports full feature-set of Condor: matchmaking remote system calls checkpointing MPI suspension VMs preemption policies May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

Submitting jobs through OSG to UW Campus Grid Open Science Grid User HEP matchmaker CS matchmaker GLOW matchmaker flocking Globus gatekeeper schedd (Job caretaker) condor_submit schedd (Job caretaker) startd (Job Executor) condor gridmanager May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

Routing Jobs from UW Campus Grid to OSG HEP matchmaker CS matchmaker GLOW matchmaker schedd (Job caretaker) condor_submit Grid JobRouter globus gatekeeper condor gridmanager Combining both worlds: simple, feature-rich local mode when possible, transform to grid job for traveling globally May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

GLOW Architecture in a Nutshell One big Condor pool But backup central manager runs at each site (Condor HAD service) Users submit jobs as members of a group (e.g. “CMS” or “MedPhysics”) Computers at each site give highest priority to jobs from same group (via machine RANK) Jobs run preferentially at the “home” site, but may run anywhere when machines are available May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

Accommodating Special Cases Members have flexibility to make arrangements with each other when needed Example: granting 2nd priority Opportunistic access Long-running jobs which can’t easily be checkpointed can be run as bottom feeders that are suspended instead of being killed by higher priority jobs Computing on Demand tasks requiring low latency (e.g. interactive analysis) may quickly suspend any other jobs while they run May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop Example Uses Chemical Engineering Students do not know where the computing cycles are coming from - they just do it - largest user group ATLAS Over 15 Million proton collision events simulated at 10 minutes each CMS Over 70 Million events simulated, reconstructed and analyzed (total ~10 minutes per event) in the past one year IceCube / Amanda Data filtering used 12 CPU-years in one month Computational Genomics Prof. Shwartz asserts that GLOW has opened up a new paradigm of work patterns in his group They no longer think about how long a particular computational job will take - they just do it May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop

2006 ESCC/Internet2 Joint Techs Workshop Summary Researchers are demanding to be well connected to both local and global computing resources. The Grid Laboratory of Wisconsin is our attempt to meet that demand. We hope you too will find a solution! May 13, 2019 2006 ESCC/Internet2 Joint Techs Workshop