Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open Science Grid Frank Würthwein OSG Application Coordinator Experimental Elementary Particle Physics UCSD.

Similar presentations


Presentation on theme: "Open Science Grid Frank Würthwein OSG Application Coordinator Experimental Elementary Particle Physics UCSD."— Presentation transcript:

1 Open Science Grid Frank Würthwein OSG Application Coordinator Experimental Elementary Particle Physics UCSD

2 8/7/06 NBCR 2006 2 Particle Physics & Computing  Science Driver Event rate = Luminosity x Crossection  LHC Revolution starting in 2008 — Luminosity x 10 — Crossection x 150 (e.g. top-quark)  Computing Challenge — 20PB in first year of running — ~ 100MSpecInt2000 ~ close to 100,000 cores

3 8/7/06 NBCR 2006 3 Overview  OSG in a nutshell  Organization  “Architecture”  Using the OSG  Present Utilization & Expected Growth  Summary of OSG Status

4 8/7/06 NBCR 2006 4 OSG in a nutshell  High Throughput Computing — Opportunistic scavenging on cheap hardware. — Owner controlled policies.  “open consortium” — Add OSG project to an open consortium to provide cohesion and sustainability.  Heterogeneous Middleware stack — Minimal site requirements & optional services — Production grid allows coexistence of multiple OSG releases.  “Linux rules”: mostly RHEL3 on Intel/AMD  Grid of clusters — Compute & storage (mostly) on private Gb/s LANs. — Some sites with (multiple) 10Gb/s WAN “uplink”.

5 Organization Started in 2005 as Consortium with contributed effort only. Now adding OSG project to sustain production grid. People coming together to build … People paid to operate …

6 8/7/06 NBCR 2006 6 Consortium & Project  Consortium Council  IT Departments & their hardware resources  Science Application Communities  Middleware Providers  Funded Project (Starting 9/06)  Operate services for a distributed facility.  Improve, Extend, Expand & Interoperate  Engagement, Education & Outreach Argonne Nat. Lab. Brookhaven Nat. Lab. CCR SUNY Buffalo Fermi Nat. Lab Thomas Jefferson Nat. Lab. Lawrence Berkeley Nat. Lab. Stanford Lin. Acc. Center Texas Adv. Comp. Center RENCI Purdue US Atlas Collaboration BaBar Collaboration CDF Collaboration US CMS Collaboration D0 Collaboration GRASE LIGO SDSS STAR Council Members: US Atlas S&C Project US CMS S&C Project Condor Globus SRM OSG Project

7 8/7/06 NBCR 2006 7 Consortium & Project  Consortium Council  IT Departments & their hardware resources  Science Application Communities  Middleware Providers  Funded Project (Starting 9/06)  Operate services for a distributed facility.  Improve, Extend, Expand & Interoperate  Engagement, Education & Outreach Argonne Nat. Lab. Brookhaven Nat. Lab. CCR SUNY Buffalo Fermi Nat. Lab Thomas Jefferson Nat. Lab. Lawrence Berkeley Nat. Lab. Stanford Lin. Acc. Center Texas Adv. Comp. Center RENCI Purdue US Atlas Collaboration BaBar Collaboration CDF Collaboration US CMS Collaboration D0 Collaboration GRASE LIGO SDSS STAR Council Members: US Atlas S&C Project US CMS S&C Project Condor Globus SRM OSG Project Middleware Hardware User Support

8 8/7/06 NBCR 2006 8 OSG Management Executive Director: Ruth Pordes Facility Coordinator: Miron Livny Application Coordinators: Torre Wenaus & fkw Resource Managers: P. Avery & A. Lazzarini Education Coordinator: Mike Wilde Engagement Coord.: Alan Blatecky Council Chair: Bill Kramer Diverse Set of people from Universities & National Labs, including CS, Science Apps, & IT infrastructure people.

9 8/7/06 NBCR 2006 9 OSG Management Structure

10 “Architecture” Grid of sites Me - My friends - the anonymous grid Grid of Grids

11 8/7/06 NBCR 2006 11 Grid of sites  IT Departments at Universities & National Labs make their hardware resources available via OSG interfaces. — CE: (modified) pre-ws GRAM — SE: SRM for large volume, gftp & (N)FS for small volume  Today’s scale: — 20-50 “active” sites (depending on definition of “active”) — ~ 5000 batch slots — ~ 500TB storage — ~ 10 “active” sites with shared 10Gbps or better connectivity  Expected Scale for End of 2008 — ~50 “active” sites — ~30-50,000 batch slots — Few PB of storage — ~ 25-50% of sites with shared 10Gbps or better connectivity

12 8/7/06 NBCR 2006 12 Making the Grid attractive  Minimize entry threshold for resource owners — Minimize software stack. — Minimize support load.  Minimize entry threshold for users — Feature rich software stack. — Excellent user support. Resolve contradiction via “thick” Virtual Organization layer of services between users and the grid.

13 8/7/06 NBCR 2006 13 Me -- My friends -- The grid O(10 4 ) Users O(10 2-3 ) Sites O(10 1-2 ) VOs Thin client Thin “Grid API” Thick VO Middleware & Support Me My friends The anonymous Grid Domain science specific Common to all sciences

14 8/7/06 NBCR 2006 14 Grid of Grids - from local to global Science Community Infrastructure CS/IT Campus Grids National & International CyberInfrastructure for Science (e.g. Teragrid, EGEE, …) (e.g Atlas, CMS, LIGO…) (e.g GLOW, FermiGrid, …) OSG enables its users to operate transparently across Grid boundaries globally.

15 Using the OSG Authentication & Authorization Moving & Storing Data Submitting jobs & “workloads”

16 8/7/06 NBCR 2006 16 Authentication & Authorization  OSG Responsibilities — X509 based middleware — Accounts may be dynamic/static, shared/FQAN-specific  VO Responsibilities — Instantiate VOMS — Register users & define/manage their roles  Site Responsibilities — Choose security model (what accounts are supported) — Choose VOs to allow — Default accept of all users in VO but individuals or groups within VO can be denied.

17 8/7/06 NBCR 2006 17 User Management  User obtains DN from CA that is vetted by TAGPMA  User registers with VO and is added to VOMS of VO. — VO responsible for registration of VOMS with OSG GOC. — VO responsible for users to sign AUP. — VO responsible for VOMS operations. — VOMS shared for ops on multiple grids globally by some VOs. — Default OSG VO exists for new communities & single PIs.  Sites decide which VOs to support (striving for default admit) — Site populates GUMS daily from VOMSes of all VOs — Site chooses uid policy for each VO & role — Dynamic vs static vs group accounts  User uses whatever services the VO provides in support of users — VOs generally hide grid behind portal  Any and all support is responsibility of VO — Helping its users — Responding to complains from grid sites about its users.

18 8/7/06 NBCR 2006 18 Moving & storing data  OSG Responsibilities — Define storage types & their APIs from WAN & LAN — Define information schema for “finding” storage — All storage is local to site - no global filesystem!  VO Responsibilities — Manage data transfer & catalogues  Site Responsibilities — Choose storage type to support & how much — Implement storage type according to OSG rules — Truth in advertisement

19 8/7/06 NBCR 2006 19 Disk areas in some detail:  Shared filesystem as applications area at site. — Read only from compute cluster. — Role based installation via GRAM.  Batch slot specific local work space. — No persistency beyond batch slot lease. — Not shared across batch slots. — Read & write access (of course).  SRM/gftp controlled data area. — “persistent” data store beyond job boundaries. — Job related stage in/out. — SRM v1.1 today. — SRM v2 expected in late 2006 (space reservation).

20 8/7/06 NBCR 2006 20 Securing your data  Archival Storage in your trusted Archive — You control where your data is archived.  Data moved by party you trust — You control who moves your data — You control encryption of your data  You compute at sites you trust — E.g. sites that guarantee specific unix uid for you. — E.g. sites whose security model satisfies your needs. You decide how secure your data needs to be!

21 8/7/06 NBCR 2006 21 Submitting jobs/workloads  OSG Responsibilities — Define Interface to batch system (today: pre-ws GRAM) — Define information schema — Provide middleware that implements the above.  VO Responsibilities — Manage submissions & workflows — VO controlled workload management system or wms from other grids, e.g. EGEE/LCG.  Site Responsibilities — Choose batch system — Configure interface according to OSG rules — Truth in advertisement

22 8/7/06 NBCR 2006 22 Simple Workflow  Install Application Software at site(s) — VO admin install via GRAM. — VO users have read only access from batch slots.  “Download” data to site(s) — VO admin move data via SRM/gftp. — VO users have read only access from batch slots.  Submit job(s) to site(s) — VO users submit job(s)/DAG via condor-g. — Jobs run in batch slots, writing output to local disk. — Jobs copy output from local disk to SRM/gftp data area.  Collect output from site(s) — VO users collect output from site(s) via SRM/gftp as part of DAG.

23 8/7/06 NBCR 2006 23 Some technical details  Job submission — Condor: — Condor-g — “schedd on the side” (simple multi-site brokering using condor schedd) — Condor glide-in — EGEE workload management system — OSG CE compatble with glite Classic CE — Submissions via either LCG 2.7 RB or glite RB, including bulk submission — Virtual Data System (VDS) in use on OSG  Data placement using SRM — SRM/dCache in use to virtualize many disks into one storage system — Schedule WAN Xfer across many gftp servers — Typical WAN IO capability today ~ 10TB/day ~ 2Gbps — Schedule random access from batch slots to many disks via LAN — Typical LAN IO capability today ~ 0.5-5 Gbyte/sec — Space reservation

24 8/7/06 NBCR 2006 24 Middleware lifecycle Domain science requirements. Joint projects between OSG applications group & Middleware developers to develop & test on community grids. Integrate into VDT and deploy on OSG-itb. Inclusion into OSG release & deployment on (part of) production grid. EGEE et al.

25 Status of Utilization OSG job = job submitted via OSG CE “Accounting” of OSG jobs not (yet) required!

26 8/7/06 NBCR 2006 26 OSG use by Numbers 32 Virtual Organizations 3 with >1000 jobs max. (all particle physics) 3 with 500-1000 max. (all outside physics) 5 with 100-500 max (particle, nuclear, and astro physics)

27 8/7/06 NBCR 2006 27 Experimental Particle Physics Bio/Eng/Med/Math Campus Grids Non-HEP physics 100 jobs 850 jobs 2250 jobs 5/05-5/06 GADU using VDS PI from Campus Grid

28 8/7/06 NBCR 2006 28 Example GADU run in April Bioinformatics App using VDS across 8 sites on OSG.

29 8/7/06 NBCR 2006 29 Number of running (and monitored) “OSG jobs” in June 2006. 3000

30 8/7/06 NBCR 2006 30 CMS Xfer on OSG in June 2006 All CMS sites have exceeded 5TB per day in June 2006. Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day. Hoping to reach 30-40TB/day capability by end of 2006. 450MByte/sec

31 Grid of Grids OSG enable single PIs and user communities to operate transparently across Grid boundaries globally. E.g.: CMS a particle physics experiment

32 8/7/06 NBCR 2006 32 CMS Experiment - a global community grid Germany CMS Experiment Taiwan UK Italy Data & jobs moving locally, regionally & globally within CMS grid. Transparently across grid boundaries from campus to globus. Florida USA@FNAL CERN Caltech Wisconsin UCSD France Purdue MIT UNL OSG EGEE

33 8/7/06 NBCR 2006 33 Grid of Grids - Production Interop Job submission: 16,000 jobs per day submitted across EGEE & OSG via “LCG RB”. Jobs brokered transparently onto both grids. Data Transfer: Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG sites. All 8 CMS sites on OSG have exceeded 5TB/day goal. Caltech, FNAL, Purdue, UCSD/SDSC, UFL, UW exceed 10TB/day.

34 8/7/06 NBCR 2006 34 The US CMS center at FNAL transfers data to 39 sites worldwide in CMS global Xfer challenge. Peak Xfer rates of ~5Gbps are reached. CMS Xfer FNAL to World

35 8/7/06 NBCR 2006 35 Summary of OSG Status  OSG facility opened July 22nd 2005.  OSG facility is under steady use — ~2-3000 jobs at all times — Mostly HEP but large Bio/Eng/Med occasionally — Moderate other physics (Astro/Nuclear)  OSG project — 5 year Proposal to DOE & NSF funded starting FY07. — Facility & Improve/Expand/Extend/Interoperate & E&O  Off to a running start … but lot’s more to do. — Routinely exceeding 1Gbps at 6 sites — Scale by x4 by 2008 and many more sites — Routinely exceeding 1000 running jobs per client — Scale by at least x10 by 2008 — Have reached 99% success rate for 10,000 jobs per day submission — Need to reach this routinely, even under heavy load


Download ppt "Open Science Grid Frank Würthwein OSG Application Coordinator Experimental Elementary Particle Physics UCSD."

Similar presentations


Ads by Google