Open Science Grid Frank Würthwein OSG Application Coordinator Experimental Elementary Particle Physics UCSD
8/7/06 NBCR Particle Physics & Computing Science Driver Event rate = Luminosity x Crossection LHC Revolution starting in 2008 — Luminosity x 10 — Crossection x 150 (e.g. top-quark) Computing Challenge — 20PB in first year of running — ~ 100MSpecInt2000 ~ close to 100,000 cores
8/7/06 NBCR Overview OSG in a nutshell Organization “Architecture” Using the OSG Present Utilization & Expected Growth Summary of OSG Status
8/7/06 NBCR OSG in a nutshell High Throughput Computing — Opportunistic scavenging on cheap hardware. — Owner controlled policies. “open consortium” — Add OSG project to an open consortium to provide cohesion and sustainability. Heterogeneous Middleware stack — Minimal site requirements & optional services — Production grid allows coexistence of multiple OSG releases. “Linux rules”: mostly RHEL3 on Intel/AMD Grid of clusters — Compute & storage (mostly) on private Gb/s LANs. — Some sites with (multiple) 10Gb/s WAN “uplink”.
Organization Started in 2005 as Consortium with contributed effort only. Now adding OSG project to sustain production grid. People coming together to build … People paid to operate …
8/7/06 NBCR Consortium & Project Consortium Council IT Departments & their hardware resources Science Application Communities Middleware Providers Funded Project (Starting 9/06) Operate services for a distributed facility. Improve, Extend, Expand & Interoperate Engagement, Education & Outreach Argonne Nat. Lab. Brookhaven Nat. Lab. CCR SUNY Buffalo Fermi Nat. Lab Thomas Jefferson Nat. Lab. Lawrence Berkeley Nat. Lab. Stanford Lin. Acc. Center Texas Adv. Comp. Center RENCI Purdue US Atlas Collaboration BaBar Collaboration CDF Collaboration US CMS Collaboration D0 Collaboration GRASE LIGO SDSS STAR Council Members: US Atlas S&C Project US CMS S&C Project Condor Globus SRM OSG Project
8/7/06 NBCR Consortium & Project Consortium Council IT Departments & their hardware resources Science Application Communities Middleware Providers Funded Project (Starting 9/06) Operate services for a distributed facility. Improve, Extend, Expand & Interoperate Engagement, Education & Outreach Argonne Nat. Lab. Brookhaven Nat. Lab. CCR SUNY Buffalo Fermi Nat. Lab Thomas Jefferson Nat. Lab. Lawrence Berkeley Nat. Lab. Stanford Lin. Acc. Center Texas Adv. Comp. Center RENCI Purdue US Atlas Collaboration BaBar Collaboration CDF Collaboration US CMS Collaboration D0 Collaboration GRASE LIGO SDSS STAR Council Members: US Atlas S&C Project US CMS S&C Project Condor Globus SRM OSG Project Middleware Hardware User Support
8/7/06 NBCR OSG Management Executive Director: Ruth Pordes Facility Coordinator: Miron Livny Application Coordinators: Torre Wenaus & fkw Resource Managers: P. Avery & A. Lazzarini Education Coordinator: Mike Wilde Engagement Coord.: Alan Blatecky Council Chair: Bill Kramer Diverse Set of people from Universities & National Labs, including CS, Science Apps, & IT infrastructure people.
8/7/06 NBCR OSG Management Structure
“Architecture” Grid of sites Me - My friends - the anonymous grid Grid of Grids
8/7/06 NBCR Grid of sites IT Departments at Universities & National Labs make their hardware resources available via OSG interfaces. — CE: (modified) pre-ws GRAM — SE: SRM for large volume, gftp & (N)FS for small volume Today’s scale: — “active” sites (depending on definition of “active”) — ~ 5000 batch slots — ~ 500TB storage — ~ 10 “active” sites with shared 10Gbps or better connectivity Expected Scale for End of 2008 — ~50 “active” sites — ~30-50,000 batch slots — Few PB of storage — ~ 25-50% of sites with shared 10Gbps or better connectivity
8/7/06 NBCR Making the Grid attractive Minimize entry threshold for resource owners — Minimize software stack. — Minimize support load. Minimize entry threshold for users — Feature rich software stack. — Excellent user support. Resolve contradiction via “thick” Virtual Organization layer of services between users and the grid.
8/7/06 NBCR Me -- My friends -- The grid O(10 4 ) Users O( ) Sites O( ) VOs Thin client Thin “Grid API” Thick VO Middleware & Support Me My friends The anonymous Grid Domain science specific Common to all sciences
8/7/06 NBCR Grid of Grids - from local to global Science Community Infrastructure CS/IT Campus Grids National & International CyberInfrastructure for Science (e.g. Teragrid, EGEE, …) (e.g Atlas, CMS, LIGO…) (e.g GLOW, FermiGrid, …) OSG enables its users to operate transparently across Grid boundaries globally.
Using the OSG Authentication & Authorization Moving & Storing Data Submitting jobs & “workloads”
8/7/06 NBCR Authentication & Authorization OSG Responsibilities — X509 based middleware — Accounts may be dynamic/static, shared/FQAN-specific VO Responsibilities — Instantiate VOMS — Register users & define/manage their roles Site Responsibilities — Choose security model (what accounts are supported) — Choose VOs to allow — Default accept of all users in VO but individuals or groups within VO can be denied.
8/7/06 NBCR User Management User obtains DN from CA that is vetted by TAGPMA User registers with VO and is added to VOMS of VO. — VO responsible for registration of VOMS with OSG GOC. — VO responsible for users to sign AUP. — VO responsible for VOMS operations. — VOMS shared for ops on multiple grids globally by some VOs. — Default OSG VO exists for new communities & single PIs. Sites decide which VOs to support (striving for default admit) — Site populates GUMS daily from VOMSes of all VOs — Site chooses uid policy for each VO & role — Dynamic vs static vs group accounts User uses whatever services the VO provides in support of users — VOs generally hide grid behind portal Any and all support is responsibility of VO — Helping its users — Responding to complains from grid sites about its users.
8/7/06 NBCR Moving & storing data OSG Responsibilities — Define storage types & their APIs from WAN & LAN — Define information schema for “finding” storage — All storage is local to site - no global filesystem! VO Responsibilities — Manage data transfer & catalogues Site Responsibilities — Choose storage type to support & how much — Implement storage type according to OSG rules — Truth in advertisement
8/7/06 NBCR Disk areas in some detail: Shared filesystem as applications area at site. — Read only from compute cluster. — Role based installation via GRAM. Batch slot specific local work space. — No persistency beyond batch slot lease. — Not shared across batch slots. — Read & write access (of course). SRM/gftp controlled data area. — “persistent” data store beyond job boundaries. — Job related stage in/out. — SRM v1.1 today. — SRM v2 expected in late 2006 (space reservation).
8/7/06 NBCR Securing your data Archival Storage in your trusted Archive — You control where your data is archived. Data moved by party you trust — You control who moves your data — You control encryption of your data You compute at sites you trust — E.g. sites that guarantee specific unix uid for you. — E.g. sites whose security model satisfies your needs. You decide how secure your data needs to be!
8/7/06 NBCR Submitting jobs/workloads OSG Responsibilities — Define Interface to batch system (today: pre-ws GRAM) — Define information schema — Provide middleware that implements the above. VO Responsibilities — Manage submissions & workflows — VO controlled workload management system or wms from other grids, e.g. EGEE/LCG. Site Responsibilities — Choose batch system — Configure interface according to OSG rules — Truth in advertisement
8/7/06 NBCR Simple Workflow Install Application Software at site(s) — VO admin install via GRAM. — VO users have read only access from batch slots. “Download” data to site(s) — VO admin move data via SRM/gftp. — VO users have read only access from batch slots. Submit job(s) to site(s) — VO users submit job(s)/DAG via condor-g. — Jobs run in batch slots, writing output to local disk. — Jobs copy output from local disk to SRM/gftp data area. Collect output from site(s) — VO users collect output from site(s) via SRM/gftp as part of DAG.
8/7/06 NBCR Some technical details Job submission — Condor: — Condor-g — “schedd on the side” (simple multi-site brokering using condor schedd) — Condor glide-in — EGEE workload management system — OSG CE compatble with glite Classic CE — Submissions via either LCG 2.7 RB or glite RB, including bulk submission — Virtual Data System (VDS) in use on OSG Data placement using SRM — SRM/dCache in use to virtualize many disks into one storage system — Schedule WAN Xfer across many gftp servers — Typical WAN IO capability today ~ 10TB/day ~ 2Gbps — Schedule random access from batch slots to many disks via LAN — Typical LAN IO capability today ~ Gbyte/sec — Space reservation
8/7/06 NBCR Middleware lifecycle Domain science requirements. Joint projects between OSG applications group & Middleware developers to develop & test on community grids. Integrate into VDT and deploy on OSG-itb. Inclusion into OSG release & deployment on (part of) production grid. EGEE et al.
Status of Utilization OSG job = job submitted via OSG CE “Accounting” of OSG jobs not (yet) required!
8/7/06 NBCR OSG use by Numbers 32 Virtual Organizations 3 with >1000 jobs max. (all particle physics) 3 with max. (all outside physics) 5 with max (particle, nuclear, and astro physics)
8/7/06 NBCR Experimental Particle Physics Bio/Eng/Med/Math Campus Grids Non-HEP physics 100 jobs 850 jobs 2250 jobs 5/05-5/06 GADU using VDS PI from Campus Grid
8/7/06 NBCR Example GADU run in April Bioinformatics App using VDS across 8 sites on OSG.
8/7/06 NBCR Number of running (and monitored) “OSG jobs” in June
8/7/06 NBCR CMS Xfer on OSG in June 2006 All CMS sites have exceeded 5TB per day in June Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day. Hoping to reach 30-40TB/day capability by end of MByte/sec
Grid of Grids OSG enable single PIs and user communities to operate transparently across Grid boundaries globally. E.g.: CMS a particle physics experiment
8/7/06 NBCR CMS Experiment - a global community grid Germany CMS Experiment Taiwan UK Italy Data & jobs moving locally, regionally & globally within CMS grid. Transparently across grid boundaries from campus to globus. Florida CERN Caltech Wisconsin UCSD France Purdue MIT UNL OSG EGEE
8/7/06 NBCR Grid of Grids - Production Interop Job submission: 16,000 jobs per day submitted across EGEE & OSG via “LCG RB”. Jobs brokered transparently onto both grids. Data Transfer: Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG sites. All 8 CMS sites on OSG have exceeded 5TB/day goal. Caltech, FNAL, Purdue, UCSD/SDSC, UFL, UW exceed 10TB/day.
8/7/06 NBCR The US CMS center at FNAL transfers data to 39 sites worldwide in CMS global Xfer challenge. Peak Xfer rates of ~5Gbps are reached. CMS Xfer FNAL to World
8/7/06 NBCR Summary of OSG Status OSG facility opened July 22nd OSG facility is under steady use — ~ jobs at all times — Mostly HEP but large Bio/Eng/Med occasionally — Moderate other physics (Astro/Nuclear) OSG project — 5 year Proposal to DOE & NSF funded starting FY07. — Facility & Improve/Expand/Extend/Interoperate & E&O Off to a running start … but lot’s more to do. — Routinely exceeding 1Gbps at 6 sites — Scale by x4 by 2008 and many more sites — Routinely exceeding 1000 running jobs per client — Scale by at least x10 by 2008 — Have reached 99% success rate for 10,000 jobs per day submission — Need to reach this routinely, even under heavy load