Miron Livny Computer Sciences Department University of Wisconsin-Madison Submit locally and run globally – The GLOW and OSG Experience.

Slides:

Advertisements

Similar presentations

Jaime Frey Computer Sciences Department University of Wisconsin-Madison OGF 19 Condor Software Forum Routing.

Advertisements

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.

Dan Bradley Computer Sciences Department University of Wisconsin-Madison Schedd On The Side.

Towards a Virtual European Supercomputing Infrastructure Vision & issues Sanzio Bassini

1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.

Open Science Grid June 28, 2006 Bill Kramer Chair of the Open Science Grid Council NERSC Center General Manager, LBNL.

Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Open Science Grid Frank Würthwein UCSD. 2/13/2006 GGF 2 “Airplane view” of the OSG  High Throughput Computing — Opportunistic scavenging on cheap hardware.

Workload Management Massimo Sgaravatto INFN Padova.

April 2009 OSG Grid School - RDU 1 Open Science Grid John McGee – Renaissance Computing Institute University of North Carolina, Chapel.

Assessment of Core Services provided to USLHC by OSG.

Open Science Ruth Pordes Fermilab, July 17th 2006 What is OSG Where Networking fits Middleware Security Networking & OSG Outline.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Harnessing the Capacity of Computational.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Welcome to CW 2007!!!. The Condor Project (Established ‘85) Distributed Computing research performed by.

Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.

Key Project Drivers - FY11 Ruth Pordes, June 15th 2010.

Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.

Grid Laboratory Of Wisconsin (GLOW) Sridhara Dasu, Dan Bradley, Steve Rader Department of Physics Miron Livny, Sean.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.

1 Evolution of OSG to support virtualization and multi-core applications (Perspective of a Condor Guy) Dan Bradley University of Wisconsin Workshop on.

Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Welcome and Condor Project Overview.

Miron Livny Center for High Throughput Computing Computer Sciences Department University of Wisconsin-Madison Open Science Grid (OSG)

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor : A Concept, A Tool and.

10/24/2015OSG at CANS1 Open Science Grid Ruth Pordes Fermilab

Interoperability Grids, Clouds and Collaboratories Ruth Pordes Executive Director Open Science Grid, Fermilab.

Data Intensive Science Network (DISUN). DISUN Started in May sites: Caltech University of California at San Diego University of Florida University.

Partnerships & Interoperability - SciDAC Centers, Campus Grids, TeraGrid, EGEE, NorduGrid,DISUN Ruth Pordes Fermilab Open Science Grid Joint Oversight.

Open Science Grid For CI-Days NYSGrid Meeting Sebastien Goasguen, John McGee, OSG Engagement Manager School of Computing.

The Open Science Grid OSG Ruth Pordes Fermilab. 2 What is OSG? A Consortium of people working together to Interface Farms and Storage to a Grid and Researchers.

1 Condor Team 2011 Established 1985.

What is SAM-Grid? Job Handling Data Handling Monitoring and Information.

Middleware Camp NMI (NSF Middleware Initiative) Program Director Alan Blatecky Advanced Networking Infrastructure and Research.

Ruth Pordes November 2004TeraGrid GIG Site Review1 TeraGrid and Open Science Grid Ruth Pordes, Fermilab representing the Open Science.

US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.

Campus grids: e-Infrastructure within a University Mike Mineter National e-Science Centre 14 February 2006.

Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center Open Science Grid (OSG) Introduction for the Ohio Supercomputer Center February.

6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.

Supercomputing in Social Science Maria Marta Ferreyra Carnegie Mellon University.

Open Science Grid & its Security Technical Group ESCC22 Jul 2004 Bob Cowles

Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.

U.S. Grid Projects and Involvement in EGEE Ian Foster Argonne National Laboratory University of Chicago EGEE-LHC Town Meeting,

Eileen Berman. Condor in the Fermilab Grid FacilitiesApril 30, 2008  Fermi National Accelerator Laboratory is a high energy physics laboratory outside.

Open Science Grid in the U.S. Vicky White, Fermilab U.S. GDB Representative.

© Copyright AARNet Pty Ltd PRAGMA Update & some personal observations James Sankar Network Engineer - Middleware.

Miron Livny Computer Sciences Department University of Wisconsin-Madison Condor and (the) Grid (one of.

Welcome!!! Condor Week 2006.

An Introduction to Campus Grids 19-Apr-2010 Keith Chadwick & Steve Timm.

Douglas Thain, John Bent Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau, Miron Livny Computer Sciences Department, UW-Madison Gathering at the Well: Creating.

April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.

1 Open Science Grid.. An introduction Ruth Pordes Fermilab.

Towards deploying a production interoperable Grid Infrastructure in the U.S. Vicky White U.S. Representative to GDB.

Northwest Indiana Computational Grid Preston Smith Rosen Center for Advanced Computing Purdue University - West Lafayette West Lafayette Calumet.

1 Open Science Grid: Project Statement & Vision Transform compute and data intensive science through a cross- domain self-managed national distributed.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.

Scientific Computing at Fermilab Lothar Bauerdick, Deputy Head Scientific Computing Division 1 of 7 10k slot tape robots.

Bob Jones EGEE Technical Director

Accessing the VI-SEEM infrastructure

Gene Oleynik, Head of Data Storage and Caching,

Condor A New PACI Partner Opportunity Miron Livny

Miron Livny John P. Morgridge Professor of Computer Science

Dean Martin Cadwallader Dean of the Graduate School

Open Science Grid at Condor Week

Grid Laboratory Of Wisconsin (GLOW)

GLOW A Campus Grid within OSG

Welcome to (HT)Condor Week #19 (year 34 of our project)

Presentation transcript:

Miron Livny Computer Sciences Department University of Wisconsin-Madison Submit locally and run globally – The GLOW and OSG Experience

What impact does our computing infrastructure have on our scientists?

Supercomputing in Social Science Maria Marta Ferreyra Carnegie Mellon University

 What would happen if many families in the largest U.S. metropolitan areas received vouchers for private schools?  Completed my dissertation with Condor’s help  The contributions from my research are made possible by Condor Questions could not be answered otherwise Questions could not be answered otherwise

Research question vouchers allow people to choose the type of school they want vouchers allow people to choose the type of school they want vouchers may affect where families choose to live vouchers may affect where families choose to live  Problem has many moving parts (a general equilibrium problem)

Why Condor was a great match to my needs (cont.) I did not have to alter my code I did not have to alter my code I did not have to pay I did not have to pay since 19 March 2001 I have used 462,667 hours (about 53 years with one 1 Ghz processor) since 19 March 2001 I have used 462,667 hours (about 53 years with one 1 Ghz processor)

The search for SUSY* › Sanjay Padhi is a UW Chancellor Fellow who is working at the group of Prof. Sau Lan Wu located at CERN (Geneva) › Using Condor Technologies he established a “grid access point” in his office at CERN › Through this access-point he managed to harness in 3 month (12/05-2/06) more that 500 CPU years from the LHC Computing Grid (LCG) the Open Science Grid (OSG) the Grid Laboratory Of Wisconsin (GLOW) resources and local group owned desk-top resources. * Super-Symmetry

Claims for “benefits” provided by Distributed Processing Systems  High Availability and Reliability  High System Performance  Ease of Modular and Incremental Growth  Automatic Load and Resource Sharing  Good Response to Temporary Overloads  Easy Expansion in Capacity and/or Function “What is a Distributed Data Processing System?”, P.H. Enslow, Computer, January 1978

Democratization of Computing: You do not need to be a super-person to do super-computing

We first introduced the distinction between High Performance Computing (HPC) and High Throughput Computing (HTC) in a seminar at the NASA Goddard Flight Center in July of 1996 and a month later at the European Laboratory for Particle Physics (CERN). In June of 1997 HPCWire published an interview on High Throughput Computing. High Throughput Computing

HTC For many experimental scientists, scientific progress and quality of research are strongly linked to computing throughput. In other words, they are less concerned about instantaneous computing power. Instead, what matters to them is the amount of computing they can harness over a month or a year --- they measure computing power in units of scenarios per day, wind patterns per week, instructions sets per month, or crystal configurations per year.

High Throughput Computing is a activity FLOPY  (60*60*24*7*52)*FLOPS

“ … We claim that these mechanisms, although originally developed in the context of a cluster of workstations, are also applicable to computational grids. In addition to the required flexibility of services in these grids, a very important concern is that the system be robust enough to run in “production mode” continuously even in the face of component failures. … “ Miron Livny & Rajesh Raman, "High Throughput Resource Management", in “The Grid: Blueprint for a New Computing Infrastructure”, 1998.

HTC leads to a “bottom up” approach to building and operating a distributed computing infrastructure

My jobs should run … › … on my laptop if it is not connected to the network › … on my group resources if my grid certificate expired ›... on my campus resources if the meta scheduler is down › … on my national grid if the trans-Atlantic link was cut by a submarine

Taking HTC to the next level – The Open Science Grid (OSG)

What is OSG? The Open Science Grid is a US national distributed computing facility that supports scientific computing via an open collaboration of science researchers, software developers and computing, storage and network providers. The OSG Consortium is building and operating the OSG, bringing resources and researchers from universities and national laboratories together and cooperating with other national and international infrastructures to give scientists from a broad range of disciplines access to shared resources worldwide.

The OSG Project Co-funded by DOE and NSF at an annual rate of ~$6M for 5 years starting FY institutions involved – 4 DOE Labs and 12 universities Currently main stakeholders are from physics - US LHC experiments, LIGO, STAR experiment, the Tevatron Run II and Astrophysics experiments A mix of DOE-Lab and campus resources Active “engagement” effort to add new domains and resource providers to the OSG consortium

OSG PEP - Organization

OSG Project Execution Plan (PEP) - FTEs 33 Total FTEs 3.0Staff 9.0Extensions in capability and scale. 1.0 Facility management 2.0Education, outreach & training 2.0 Engagement 6.5 Software release and support 4.5 Security and troubleshooting 5.0 Facility operations FTEs

Part of the OSG Consortium Contributors Project

OSG Principles Characteristics -  Provide guaranteed and opportunistic access to shared resources.  Operate a heterogeneous environment both in services available at any site and for any VO, and multiple implementations behind common interfaces.  Interface to Campus and Regional Grids.  Federate with other national/international Grids.  Support multiple software releases at any one time. Drivers -  Delivery to the schedule, capacity and capability of LHC and LIGO: Contributions to/from and collaboration with the US ATLAS, US CMS, LIGO software and computing programs.  Support for/collaboration with other physics/non-physics communities.  Partnerships with other Grids - especially EGEE and TeraGrid.  Evolution by deployment of externally developed new services and technologies:.

Grid of Grids - from Local to Global Community Campus National

Who are you? A resource can be accessed by a user via the campus, community or national grid. A user can access a resource with a campus, community or national grid identity.

32 Virtual Organizations - participating Groups 3 with >1000 jobs max. (all particle physics) 3 with max. (all outside physics) 5 with max (particle, nuclear, and astro physics)

OSG Middleware Layering NSF Middleware Initiative (NMI): Condor, Globus, Myproxy Virtual Data Toolkit (VDT) Common Services NMI + VOMS, CEMon (common EGEE components), MonaLisa, Clarens, AuthZ OSG Release Cache: VDT + Configuration, Validation, VO management CMS Services & Framework LIGO Data Grid CDF, D0 SamGrid & Framework Infrastructure Applications ATLAS Services & Framework

OSG Middleware Deployment Domain science requirements. OSG stakeholders and middleware developer (joint) projects. Integrate into VDT Release. Deploy on OSG integration grid Provision in OSG release & deploy to OSG production. Condor, Globus, Privilege, EGEE etc Test on “VO specific grid”

Inter-operability with Campus grids At this point we have three operational campus grids – Fermi, Purdue and Wisconsin. We are working on adding Harvard (Crimson) and Lehigh. FermiGrid is an interesting example for the challenges we face when making the resources of a campus (in this case a DOE Laboratory) grid accessible to the OSG community

What is FermiGrid? Integrates resources across most (soon all) owners at Fermilab. Supports jobs from Fermilab organizations to run on any/all accessible campus FermiGrid and national Open Science Grid resources. Supports jobs from OSG to be scheduled onto any/all Fermilab sites. Unified and reliable common interface and services for FermiGrid gateway - including security, job scheduling, user management, and storage. More information is available at

Job Forwarding and Resource Sharing Gateway currently interfaces 5 Condor pools with diverse file systems and >1000 Job Slots. Plans to grow to 11 clusters (8 Condor, 2 PBS and 1 LSF) Job scheduling policies and in place agreements for sharing allow fast response to changes in resource needs by Fermilab and OSG users. Gateway provides single bridge between OSG wide area distributed infrastructure and FermiGrid local sites. Consists of a Globus gate-keeper and a Condor-G Each cluster has its own Globus gate-keeper Storage and Job execution policies applied through Site-wide managed security and authorization services.

Access to FermiGrid OSG General Users Fermilab Users CDF Condor pool FermiGrid Gateway OSG “agreed” Users GT-GK DZero Condor pool GT-GK Shared Condor pool GT-GK CMS Condor pool GT-GK Condor-G

The Crimson Grid is a Scalable collaborative computing environment for research at the interface of science and engineering a Gateway/Middleware release service to enable campus/community/national/global computing infrastructures for interdisciplinary research a Test bed for faculty & IT-industry affiliates within the framework of a production environment for integrating HPC solutions for higher education & research a Campus Resource for skills & knowledge sharing for advanced systems administration & management of switched architectures

CrimsonGrid Role as a Campus Grid Enabler

Homework? CrimsonGrid ATLAS OSG OSG Tier II Campus Grids

HTC on the UW campus (or what you can do with $1.5M?)

Grid Laboratory of Wisconsin Computational Genomics, Chemistry Amanda, Ice-cube, Physics/Space Science High Energy Physics/CMS, Physics Materials by Design, Chemical Engineering Radiation Therapy, Medical Physics Computer Science 2003 Initiative funded by NSF(MIR)/UW at ~ $1.5M Six Initial GLOW Sites Diverse users with different deadlines and usage patterns.

Example Uses Chemical Engineering –Students do not know where the computing cycles are coming from - they just do it - largest user group ATLAS –Over 15 Million proton collision events simulated at 10 minutes each CMS –Over 70 Million events simulated, reconstructed and analyzed (total ~10 minutes per event) in the past one year IceCube / Amanda –Data filtering used 12 CPU-years in one month Computational Genomics –Prof. Shwartz asserts that GLOW has opened up a new paradigm of work patterns in his group They no longer think about how long a particular computational job will take - they just do it

GLOW Usage 4/04-9/05 Over 7.6 million CPU-Hours (865 CPU-Years) served! Takes advantage of “shadow” jobs Take advantage of check-pointing jobs Leftover cycles available for “Others”

UW Madison Campus Grid Condor pools in various departments, made accessible via Condor ‘flocking’ –Users submit jobs to their own private or department Condor scheduler. –Jobs are dynamically matched to available machines. Crosses multiple administrative domains. –No common uid-space across campus. –No cross-campus NFS for file access. Users rely on Condor remote I/O, file-staging, AFS, SRM, gridftp, etc.

Housing the Machines Condominium Style –centralized computing center –space, power, cooling, management –standardized packages Neighborhood Association Style –each group hosts its own machines –each contributes to administrative effort –base standards (e.g. Linux & Condor) to make easy sharing of resources GLOW has elements of both, but leans towards neighborhood style

The value of the big G Our users want to collaborate outside the bounds of the campus (e.g. Atlas and CMS are international). We also don’t want to be limited to sharing resources with people who have made identical technological choices. The Open Science Grid (OSG) gives us the opportunity to operate at both scales, which is ideal.

Submitting Jobs within UW Campus Grid schedd (Job caretaker) condor_submit startd (Job Executor) HEP matchmaker CS matchmaker GLOW matchmaker flocking Supports full feature-set of Condor: matchmaking remote system calls checkpointing MPI suspension VMs preemption policies UW HEP User

Submitting jobs through OSG to UW Campus Grid schedd (Job caretaker) startd (Job Executor) HEP matchmaker CS matchmaker GLOW matchmaker flocking schedd (Job caretaker) condor_submit condor gridmanager Globus gatekeeper Open Science Grid User

Routing Jobs from UW Campus Grid to OSG schedd (Job caretaker) condor_submit globus gatekeeper condor gridmanager HEP matchmaker CS matchmaker GLOW matchmaker Grid JobRouter Combining both worlds: simple, feature-rich local mode when possible, transform to grid job for traveling globally

GLOW Architecture in a Nutshell One big Condor pool But backup central manager runs at each site (Condor HAD service) Users submit jobs as members of a group (e.g. “CMS” or “MedPhysics”) Computers at each site give highest priority to jobs from same group (via machine RANK) Jobs run preferentially at the “home” site, but may run anywhere when machines are available

Accommodating Special Cases Members have flexibility to make arrangements with each other when needed –Example: granting 2nd priority Opportunistic access –Long-running jobs which can’t easily be checkpointed can be run as bottom feeders that are suspended instead of being killed by higher priority jobs Computing on Demand –tasks requiring low latency (e.g. interactive analysis) may quickly suspend any other jobs while they run

Schedd On The Side Elevating from GLOW to OSG Job 1 Job 2 Job 3 Job 4 Job 5 … Job 4* job queue

Gatekeeper X The Grid Universe Schedd Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed easier to live with private networks may use non-Condor resources restricted Condor feature set (e.g. no std universe over grid) must pre-allocating jobs between vanilla and grid universe vanillasite X

Dynamic Routing Jobs Schedd Local Startds Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Random Seed Schedd On The Side Gatekeeper X Y Z vanillasite X Random Seed Random Seed site Ysite Z dynamic allocation of jobs between vanilla and grid universes. not every job is appropriate for transformation into a grid job.

What is the right $ balance between HPC & HTC?