Presentation is loading. Please wait.

Presentation is loading. Please wait.

Open Science Grid and Applications Bockjoo Kim U of KISTI on July 5, 2007.

Similar presentations


Presentation on theme: "Open Science Grid and Applications Bockjoo Kim U of KISTI on July 5, 2007."— Presentation transcript:

1 Open Science Grid and Applications Bockjoo Kim U of Florida @ KISTI on July 5, 2007

2 2 07/05/2007 An Overview of OSG

3 3 07/05/2007 What is an OSG? A scientific grid consortium and project Rely on the commitments of the participants Share common goals and vision other projects An evolution of Grid3 Provides benefit to large scale science in the US

4 4 07/05/2007 Driving Principles for OSG Simple and flexible Built from the bottom up Coherent but heterogeneous Performing and persistent Maximize eventual commonality Principles apply end-to-end

5 5 07/05/2007 Virtual Organization in OSG Self Operated Research Vos: 15 Collider Detector at Fermilab (CDF) Compact Muon Solenoid (CMS) CompBioGrid (CompBioGrid) D0 Experiment at Fermilab (DZero) Dark Energy Survey (DES) Functional Magnetic Resonance Imaging (fMRI) Geant4 Software Toolkit (geant4) Genome Analysis and Database Update (GADU) International Linear Collider (ILC) Laser Interferometer Gravitational-Wave Observatory (LIGO) nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB) Sloan Digital Sky Survey (SDSS) Solenoidal Tracker at RHIC (STAR) Structural Biology Grid (SBGrid) United States ATLAS Collaboration (USATLAS) Campus Grids: 5. Georgetown University Grid (GUGrid) Grid Laboratory of Wisconsin (GLOW) Grid Research and Education Group at Iowa (GROW) University of New York at Buffalo (GRASE) Fermi National Accelerator Center (Fermilab) Regional Grids: 4 NYSGRID Distributed Organization for Scientific and Academic Research (DOSAR) Great Plains Network (GPN) Northwest Indiana Computational Grid (NWICG) OSG Operated VOs: 4 Engagement (Engage) Open Science Grid (OSG) OSG Education Activity (OSGEDU) OSG Monitoring & Operations

6 6 07/05/2007 Timeline 1999200020012002200520032004200620072008 2009 PPDG GriPhyN iVDGL TrilliumGrid3 OSG (DOE) (DOE+NSF) (NSF) Campus, regional grids LHC operations LHC construction, preparation LIGO operation LIGO preparation European Grid + Worldwide LHC Computing Grid OSG Consortium

7 7 07/05/2007 Levels of Participation  Participating in the OSG Consortium  Using the OSG  Sharing Resources on OSG => Either or both with minimal entry threshold  Becoming a Stakeholder  All (large scale) users & providers are stakeholders  Determining the Future of OSG  Council Members determine the Future  Taking on Responsibility for OSG Operations  OSG Project is responsible for OSG Operations

8 8 07/05/2007 OSG Architecture and How to Use OSG

9 9 07/05/2007 OSG : A Grid of Sites/Facilities  IT Departments at Universities & National Labs make their hardware resources available via OSG interfaces. —CE: (modified) pre-ws GRAM —SE: SRM for large volume, gftp & (N)FS for small volume  Today’s scale: —20-50 “active” sites (depending on definition of “active”) —~ 5000 batch slots —~ 1000TB storage —~ 10 “active” sites with shared 10Gbps or better connectivity  Expected Scale for End of 2008 —~50 “active” sites —~30-50,000 batch slots —Few PB of storage —~ 25-50% of sites with shared 10Gbps or better connectivity

10 10 07/05/2007 OSG Components: Compute Element From ~20 CPU Department Computers to 10,000 CPU Super Computers Jobs run under any local batch system OSG gateway machine + services the network & other OSG resources Globus GRAM interface (Pre-WS) which supports many different local batch system Priorities and policies : Through VO role mapping, Batch queue priority setting according to Site policites and priorities. OSG Base (OSG 0.6.0) OSG Environment/Publication OSG Monitoring/Accounting EGEE Interop

11 11 07/05/2007 Disk Areas in an OSG site  Shared filesystem as applications area at site. —Read only from compute cluster. —Role based installation via GRAM.  Batch slot specific local work space. —No persistency beyond batch slot lease. —Not shared across batch slots. —Read & write access (of course).  SRM/gftp controlled data area. —“persistent” data store beyond job boundaries. —Job related stage in/out. —SRM v1.1 today. —SRM v2.2 expected in Q2 2007 (space reservation).

12 12 07/05/2007 OSG Components: Storage Element From 20 GBytes Disk Cache To 4 Petabyte Robotic Tape Systems AnyShared Storage, e.g., dCache OSG SE gateway the network & other OSG resources Storage Services - access storage through storage resource manager (SRM) interface and GridFtp (Typically) VO oriented: Allocation of shared storage through agreements between site and VO(s) facilitated by OSG gsiftp://mygridftp.nowhere. edu srm://myse.nowhere.edu ( srm protocol ~ https protocol)

13 13 07/05/2007 Authentication and Authorization  OSG Responsibilities —X509 based middleware —Accounts may be dynamic/static, shared/FQAN-specific  VO Responsibilities —Instantiate VOMS —Register users & define/manage their roles  Site Responsibilities —Choose security model (what accounts are supported) —Choose VOs to allow —Default accept of all users in VO but individuals or groups within VO can be denied.

14 14 07/05/2007 User Management  User obtains CERT from CA that is vetted by TAGPMA  User registers with VO and is added to VOMS of VO. —VO responsible for registration of VOMS with OSG GOC. —VO responsible for users to sign AUP. —VO responsible for VOMS operations. —VOMS shared for ops on multiple grids globally by some VOs. —Default OSG VO exists for new communities & single PIs.  Sites decide which VOs to support (striving for default admit) —Site populates GUMS daily from VOMSes of all VOs —Site chooses uid policy for each VO & role —Dynamic vs static vs group accounts  User uses whatever services the VO provides in support of users —VOs generally hide grid behind portal  Any and all support is responsibility of VO —Helping its users —Responding to complains from grid sites about its users.

15 15 07/05/2007 Resource Management Many resources are owned or statically allocated to one user community –The institutions which own resources typically have ongoing relationships with (a few) particular user communities (VOs) The remainder of an organization’s available resources can be “used by everyone or anyone else” –Organization can decide against supporting particular VOs. –OSG staffs are responsible for monitoring and, if needed, managing this usage Our challenge is to maximize good - successful - output from the whole system

16 16 07/05/2007 Applications and Runtime Model Condor-G client Pre-WS or WS Gram as site gateway Priority through VO role and policy, mitigate by site policy User specific portion that comes with the job VO specific portion is preinstalled and published CPU access policies vary from site to site  Ideal runtime ~ O(hours)  Small enough to not loose too much due to preemption policies.  Large enough to be efficient despite long scheduling times of grid middleware.

17 17 07/05/2007 Simple Workflow  Install Application Software at site(s) —VO admin install via GRAM. —VO users have read only access from batch slots.  “Download” data to site(s) —VO admin move data via SRM/gftp. —VO users have read only access from batch slots.  Submit job(s) to site(s) —VO users submit job(s)/DAG via condor-g. —Jobs run in batch slots, writing output to local disk. —Jobs copy output from local disk to SRM/gftp data area.  Collect output from site(s) —VO users collect output from site(s) via SRM/gftp as part of DAG.

18 18 07/05/2007 Late Binding(A Strategy)  Grid is a hostile environment:  Scheduling policies are unpredictable  Many sites preempt, and only idle resources are free  Inherent diversity of Linux variants  Not everybody is truthful in their advertisement  Submit “pilot” jobs instead of user jobs  Bind user to pilot only after batch slot at a site is successfully leased, and “sanity checked”.  Re-bind user jobs to new pilot upon failure.

19 19 07/05/2007 OSG Activies

20 20 07/05/2007 OSG Activity Breakdown Software (UW)– provide a software stack that meets the needs of OSG sites, OSG VOs and OSG operation while supporting interoperability with other national and inter-national cyber infrastructures Integration (UC) – Verify, test and evaluate the OSG software Operation (IU) – Coordinate the OSG sites, monitor the facility, and maintain and operate centralized services Security (FNAL) – Define and evaluate procedures and software stack to prevent un-authorized activities and minimize interruption in service due to security concerns Troubleshooting (UIOWA) – help sites and VOs to identify and resolve unexpected behavior of the OSG software stack Engagement – (RENCI) Identify VOs and sites that can benefit from joining the OSG and “hold their hand” while becoming a productive member of the OSG community Resource Management (UF) - Manages resource Facility Management (UW) – Overall facility coordination.

21 21 07/05/2007 OSG Facility Management Activies Led by Miron Livny(Wisconsin, Condor) Help sites join the facility and enable effective guaranteed and opportunistic usage of their resources by remote users Help VOs join the facility and enable effective guaranteed and opportunistic harnessing of remote resources Identify (through active engagement) new sites and VOs

22 22 07/05/2007 OSG Software Activies Package the Virtual Data Toolkit(Led by Wisconsin Condor Team) –Requires local building and testing of all components –Tools for incremental installation –Tools for verification of configuration –Tools for functional testing Integration of the OSG stack –Verification Testbed (VTB) –Integration Testbed (ITB) Deployment of the OSG stack –Build and deploy Pacman caches

23 23 07/05/2007 OSG Software Release Process Input from stakeholders and OSG directors VDT Release OSG Integration Testbed Release OSG Production Release Test on OSG Validation Testbed

24 24 07/05/2007 How Many Softwares? 15 Linux-like platforms supported ~45 components on 8 platforms built

25 25 07/05/2007 OSG Security Activies Infrastructure X509 certificate based Operational security a priority Exercise incident response Prepare signed agreements, template policies Audit, assess and train Infrastructure X509 certificate based Operational security a priority Exercise incident response Prepare signed agreements, template policies Audit, assess and train

26 26 07/05/2007 Operations & Troubleshooting Activities Well established Grid Operations Center at Indiana University Users support distributed, including osg-general@opensciencegrid.org community support. Site coordinator supports team of sites. –Accounting and Site Validation required services of sites. Troubleshooting(U Iowa) looks at targetted end to end problems –Partnering with LBNL Troubleshooting work for auditing and forensics. Well established Grid Operations Center at Indiana University Users support distributed, including osg-general@opensciencegrid.org community support. Site coordinator supports team of sites. –Accounting and Site Validation required services of sites. Troubleshooting(U Iowa) looks at targetted end to end problems –Partnering with LBNL Troubleshooting work for auditing and forensics.

27 27 07/05/2007 OSG and Related Grids

28 28 07/05/2007 Campus Grids Sharing across compute clusters is a change and a challenge for many Universities. OSG, TeraGrid, Internet2, Educause working together Sharing across compute clusters is a change and a challenge for many Universities. OSG, TeraGrid, Internet2, Educause working together

29 29 07/05/2007 OSG and TeraGrid Complementary and interoperating infrastructures TeraGridOSG Networks supercomputer centers.Includes small to large clusters and organizations. Based on Condor & Globus s/w stack built at Wisconsin Build and Test. Based on Same versions of Condor & Globus in the Virtual Data Toolkit. Development of User Portals/Science Gateways. Supports jobs/data from TeraGrid science gateways. Currently relies mainly on remote login. No login access. Many sites expect VO attributes in the proxy certificate Training covers OSG and TeraGrid usage.

30 30 07/05/2007 International Activities Interoperate with Europe for large physics users. –Deliver the US based infrastructure for the World Wide Large Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments. Include off-shore sites when approached. Help bring common interfaces and best practices to the standards forums. Interoperate with Europe for large physics users. –Deliver the US based infrastructure for the World Wide Large Hadron Collider (LHC) Grid Collaboration (WLCG) in support of the LHC experiments. Include off-shore sites when approached. Help bring common interfaces and best practices to the standards forums.

31 31 07/05/2007 Applications and Status of Utilization

32 32 07/05/2007 Particile Physics and Computing  Science Driver Event rate = Luminosity x Crossection  LHC Revolution starting in 2008 —Luminosity x 10 —Crossection x 150 (e.g. top-quark)  Computing Challenge —20PB in first year of running —~ 100MSpecInt2000 ~ close to 100,000 cores

33 33 07/05/2007 CMS Experiment Germany CMS Experiment (P-P Collision Particle Physics Experiment) Taiwan UK Italy Data & jobs moving locally, regionally & globally within CMS grid. Transparently across grid boundaries from campus to globus. Florida USA@FNAL CERN Caltech Wisconsin UCSD France Purdue MIT UNL OSG EGEE

34 34 07/05/2007 CMS Data Analysis

35 35 07/05/2007 Opportunistic Resource Use In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07. The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs. The Council members agreed to contribute resources to meet this request. In Nov ‘06 D0 asked to use 1500-2000 CPUs for 2-4 months for re-processing of an existing dataset (~500 million events) for science results for the summer conferences in July ‘07. The Executive Board estimated there were currently sufficient opportunistically available resources on OSG to meet the request; We also looked into the local storage and I/O needs. The Council members agreed to contribute resources to meet this request.

36 36 07/05/2007 D0 Throughput D0 Event Throughput D0 OSG CPUHours / Week

37 37 07/05/2007 Lessons Learned from D0 Case Consortium members contributed significant opportunistic resources as promised. –VOs can use a significant number of sites they “don’t own” to achieve a large effective throughput. Combined teams make large production runs effective. How does this scale? –how we going to support multiple requests that oversubcribe the resources? We anticipate this may happen soon. Consortium members contributed significant opportunistic resources as promised. –VOs can use a significant number of sites they “don’t own” to achieve a large effective throughput. Combined teams make large production runs effective. How does this scale? –how we going to support multiple requests that oversubcribe the resources? We anticipate this may happen soon.

38 38 07/05/2007 Use Case by Other Disciplines Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April Weather Research Forecast: MPI job running on 1 OSG site; more to come CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast. NanoHUB at Purdue: Biomoca and Nanowire production. Rosetta@Kuhlman lab(protein research): in production across ~15 sites since April Weather Research Forecast: MPI job running on 1 OSG site; more to come CHARMM molecular dynamic simulation to the problem of water penetration in staphylococcal nuclease Genome Analysis and Database Update system (GADU): portal across OSG & TeraGrid. Runs Blast. NanoHUB at Purdue: Biomoca and Nanowire production.

39 39 07/05/2007 OSG Usage By Numbers 39 Virtual Communities 6 VOs with >1000 jobs max. (5 particle physics & 1 campus grid) 4 VOs with 500-1000 max. (two outside physics) 10 VOs with 100-500 max (campus grids and physics)

40 40 07/05/2007 Running Jobs During Last Year

41 41 07/05/2007 Jobs Running at Sites >1k max 5 sites >0.5k max 10 sites >100 max 29 sites Total: 47 sites Many small sites, or with mostly local activity.

42 42 07/05/2007 CMS Xfer on OSG in June ‘06 All CMS sites have exceeded 5TB per day in June 2006. Caltech, Purdue, UCSD, UFL, UW exceeded 10TB/day. 450MByte/sec

43 43 07/05/2007

44 44 07/05/2007 Summary OSG Facility utilization is steadily being increased –~2-4500 jobs all the time –HEP, Astro, Nuclear Phys. but also Bio/Eng/Med Constant effort/troubleshooting is being poured to make OSG usable, robust and performant. Show use to other sciences. Trying to bring campus into a pervasive distributed infrastructure. Bring research into a ubiquitous appreciation of the value of (distributed, opportunistic) computation Educate people to utilize the resources OSG Facility utilization is steadily being increased –~2-4500 jobs all the time –HEP, Astro, Nuclear Phys. but also Bio/Eng/Med Constant effort/troubleshooting is being poured to make OSG usable, robust and performant. Show use to other sciences. Trying to bring campus into a pervasive distributed infrastructure. Bring research into a ubiquitous appreciation of the value of (distributed, opportunistic) computation Educate people to utilize the resources

45 45 07/05/2007 Out of Bound Slides

46 46 07/05/2007 Principle: Simple and Flexible The OSG architecture will follow the principles of symmetry and recursion wherever possible This principle guides us in our approaches to - Support for hierachies of and property inheritance of VOs. - Federations and interoperability of grids (grids of grids). - Treatment of policies and resources.

47 47 07/05/2007 Principle: Coherent but heterogeneous The OSG architecture is VO based. Most services are instantiated within the context of a VO. This principle guides us in our approaches to - Scope of namespaces & action of services. - Definition of and services in support of an OSG-wide VO. - No concept of “global” scope. - Support for new and dynamic VOs should be light-weight.

48 48 07/05/2007 OSG Security Activies (continues…) User VO Site Jobs VO infra. Data Storage CECE W WW WWW WW W WW W W W W I trust it is the VO (or agent) I trust it is the user I trust it is the user’s job I trust the job is for the VO

49 49 07/05/2007 Principle: Bottom up/Persistency All services should function and operate in the local environment when disconnected from the OSG environment. This principle guides us in our approaches to - The Architecture of Services. E.G. Services are required to manage their own state, ensure their internal state is consistent and report their state accurately. - Development and execution of applications in a local context, without an active connection to the distributed services.

50 50 07/05/2007 Principle: Commonality OSG will provide baseline services and a reference implementation. The infrastructure will support support incremental upgrades The OSG infrastructure should have minimal impact on a Site. Services that must run with superuser privileges will be minimized Users are not required to interact directly with resource providers.

51 51 07/05/2007 Scale needed in 2008/2009 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scientific collaborations. High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely. Evolving and scaling smoothly to meet evolving requirements. E.g. for a single experiment 20-30 Petabyte tertiary automated tape storage at 12 centers world-wide physics and other scientific collaborations. High availability (365x24x7) and high data access rates (1GByte/sec) locally and remotely. Evolving and scaling smoothly to meet evolving requirements. E.g. for a single experiment

52 52 07/05/2007 OSG Software Concerns How quickly (and at what FTE cost) can we patch the OSG stack and redeploy it? –Critical for security patches –Very important for stability and QoS How dependable is our software? –Focus on testing (at all phases), troubleshooting of deployed software and careful adoption of new software Functionality of our software –Close consultation with stakeholders Impact on other cyber-infrastructures –Critical for interoperability

53 53 07/05/2007 OSG Software Providers OSG doesn’t write software, but gets it from providers –Condor Project –Globus Alliance Globus, MyProxy, GSISSH –EGEE VOMS, CEMon, Fetch-CRL… –OSG Extensions Gratia –Various open source projects Apache, MySQL, and many more


Download ppt "Open Science Grid and Applications Bockjoo Kim U of KISTI on July 5, 2007."

Similar presentations


Ads by Google