Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA, USA

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 2 July 22 nd, 2005  “The Open Science Grid Consortium today officially inaugurated the Open Science Grid, a national grid computing infrastructure for large scale science. The OSG is built and operated by teams from U.S. universities and national laboratories, and is open to small and large research groups nationwide from many different scientific disciplines.” - science grid this week-

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 3 Outline  OSG in a nutshell  OSG at SLAC: “PROD_SLAC” site  Authentication and Authorization in OSG  LSF-OSG integration  Running applications: US CMS and US ATLAS  Final thought

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 5 Once upon a time there was… 30 sites, ~3600 CPUs -Goal to build a shared Grid infrastructure to support opportunistic use of resources for stakeholders. Stakeholders are NSF, DOE sponsored Grid Projects (PPDG, GriPhyN, iVDGL), and US LHC software program. Team of computer and domain scientists deployed (simple) services in a Common infrastructure and interfaces across existing computing facilities. Operating stably for over a year in support of computationally intensive applications. Added communities without perturbation. CMS DC04 ATLAS DC2

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 6

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 8 Vision (1) The Open Science Grid: A production quality national grid infrastructure for large scale science.  Robust and scalable  Fully managed  Interoperates with other Grids

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 9 Vision (2)

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 10 What is the Open Science Grid? (Ian Foster)  Open  A new sort of multidisciplinary cyberinfrastructure community  An experiment in governance, incentives, architecture  Part of a larger whole, with TeraGrid, EGEE, LCG, etc.  Science  Driven by demanding scientific goals and projects who need results today (or yesterday)  Also a computer science experimental platform  Grid  Standardized protocols and interfaces  Software implementing infrastructure, services, applications  Physical infrastructure—computing, storage, networks  People who know & understand these things!

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 11 OSG Consortium Members of the OSG Consortium are those organizations that have made agreements to contribute to the Consortium.  DOE Labs: SLAC, BNL, FNAL  Universities: CCR- University of Buffalo  Grid Projects: iVDGL, PPDG, Grid3, GriPhyN  Experiments: LIGO, US CMS, US ATLAS, CDF Computing, D0 Computing, STAR, SDDS  Middleware Projects: Condor, Globus, SRM Collaboration, VDT Partners are those organizations with whom we are interfacing to work on interoperation of grid infrastructures and services.  LCG, EGEE, TeraGrid

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 12 Character of Open Science Grid (1)  Pragmatic approach:  Experiments/users drives requirements  “Keep it simple and make more reliable”  Guaranteed and opportunistic use of resources provided through Facility-VO contracts.  Validated, supported core services based on VDT and NMI Middleware. (Currently GT3 based, moving soon to GT4)  Adiabatic evolution to increase scale and complexity.  Services and applications contributed from external projects. Low threshold to contributions and new services.

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 13 Character of Open Science Grid (2)  Heterogeneous Infrastructure  All Linux but different versions of the Software Stack at different sites.  Site autonomy:  Distributed ownership of resources with diverse local policies, priorities, and capabilities.  “no” Grid software on compute nodes. But users want direct access for diagnosis and monitoring: Quote from physicist on CDF: “Experiments need to keep under control the progress of their application to take proper actions, helping the Grid to work by having it expose much of its status to the users”

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 14 Architecture

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 15 Services  Computing Service: GRAM form GT3.2.1+patches  Storage Service: SRM Interface (v1.1) as common interface to storage, DRM and dCache; most sites use NFS + GridFTP, we are looking into SRM-xrootd solution  File Transfer Service: GridFTP  VO Management Service: INFN VOMS  AA: GUMS v1.0.1, PRIMA v0.3, gPlazma  Monitoring Service: Monalisa, v1.2.34, MDS  Information Service: jClarens, v0.5.3-2, GridCat  Accounting Service: partially provided by Monalisa

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 16 Common Space across WN: $DATA (local SE) $APP $TMP Identity and Roles : X509 Certs Open Science Grid Release 0.2 Compute Element WN $WN_TMP Site Boundary (WAN->LAN) CE Authentication Mapping:GUMS Submit Host: Condor-G Globus RSL User Portal Catalogs & Displays: GridCat ACDC MonaLisa Monitoring & Information GridCat, ACDC MonaLisa, SiteVerify Storage Element SRM V1.1 GridFTP Compute Element GT2 GRAM Grid monitor Virtual Organization Management PRIMA; gPlazma Batch queue job priority Courtesy of Ruth Pordes

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 17 Compute Element WN $WN_TMP Common Space across WN: $DATA (local SE) $APP $TMP Site Boundary (WAN->LAN) CE Authentication Mapping:GUMS Identity and Roles : X509 Certs Submit Host: Condor-G Globus RSL OSG 0.4 User Portal Catalogs & Displays: GridCat ACDC MonaLisa Monitoring & Information GridCat, ACDC MonaLis SiteVerify Storage Element SRM V1.1 GridFTP Compute Element GT2 GRAM Grid monitor Virtual Organization Management PRIMA; gPlazma Batch queue job priority Edge Service Framework (XEN) Lifetime Managed VO Services Some Sites with Bandwidth Management GT4 GRAM Full Local SE Job monitoring & exit codes reporting Accounting Service Discovery: GIP + BDII network Courtesy of Ruth Pordes

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 18 Software distribution  Software is contributed by individual OSG members into collections we call "packages".  OSG provides collections of software for common services built on top of the VDT to facilitate participation.  There is very little OSG specific software and we strive to use standards based interfaces where possible.  OSG software packages are currently distributed as a Pacman caches.  Latest release on May 24 th  VDT 1.3.6

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 19 OSG’s deployed Grids OSG Consortium operates two grids:  OSG is the production grid:  Stable; for sustained production  14 VOs  38 sites, ~5,000 CPUs, 10 VOs.  Support provided  http://osg-cat.grid.iu.edu/  OSG-ITB: is the test and development grid Grid:  For testing new services, technologies, versions…  29 sites, ~2400 CPUs,  http://osg-itb.ivdgl.org/gridcat/

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 20 Operations and support  VOs are responsible for 1 st level support  Distributed Operations and Support model from the outset.  Difficult to explain, but scalable and putting most support “locally”.  Key core component is central ticketing system with automated routing and import/export capabilities to other ticketing systems and text based information.  Grid Operations Center (iGOC)  Incident Response Framework, coordinated with EGEE.

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 22 PROD_SLAC  100 jobs slots available in TRUE resource sharing  0.5 TB of disk space  osg-support@slac.stanford.edu osg-support@slac.stanford.edu  LSF 5.1 batch system  VO role-base authentication and authorization  VOs: Babar, US ATLAS, US CMS, LIGO, iVDGL

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 23 PROD_SLAC  4 Sun V20z, dual processors machines  Storage is provided with NFS: 3 directories $APP, $DATA and $TMP  We do not run Ganglia or GRIS

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 24 Outline  OSG in a nutshell  OSG at SLAC: “PROD_SLAC” site  Authentication and Authorization in OSG  LSF-OSG integration  Running applications: US CMS and US ATLAS  Conclusions

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 25 AA using GUMS

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 26 UNIX account issue The Problem:  SLAC Unix account did not fit the OSG model:  Normal SLAC account have too many default privileges  Gatekeeper-AFS interaction is problematic The Solution:  Created a new class of Unix accounts just for the Grids  Creation of new process for this new type of account  New account type have minimum privileges:  no emails, no login accesses,  home dir on grid dedicated NFS, no write access beyond Grid NFS server

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 27 DN-UID mapping  Each (DN, voGroup) pair is mapped to an unique UNIX account  No group mapping  Account name schema: osg + VOname + VOgroup + NNNNN Example: A DN in USCMS VO (voGroup /uscms/) => osguscms00001 iVDGL VO, group mis (voGroup /ivdgl/mis) =>osgivdglmis00001  If revoked, the account name/UID will never be reused (unlike for UNIX accounts)  Keep track of Grid UNIX accounts like ordinary UNIX user accounts (in RES) 1,000,000 < UID < 10,000,000  All Grid UNIX accounts belongs to one single UNIX group  Home directories on Grid dedicated NFS, shells are /bin/false

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 28 Outline  OSG in a nutshell  OSG at SLAC: “PROD_SLAC” site  Authentication and Authorization in OSG  OSG-LSF integration  Running applications: US CMS and US ATLAS  Final thought

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 29 GRAM Issue The Problem:  Gatekeeper over aggressively poll jobs status; it overloads the LSF scheduler  Race conditions: LSF job manager unable to distinguish between error condition and loaded system (we usually have more than 2K jobs running)  Maybe reduced in next version of LSF The Solution:  Re-write part the LSF job manager: lsf.pm  Looking into writing custom bjobs to have local caching

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 30 The straw the broke the camel’s back  SLAC has more than 4000 jobs slots being schedule by a single machine  We operate is a fully production mode: operation disruption has to be avoided at all costs  Too many monitoring tools (ACDC, Monalisa, User’s Monitoring tools…) can easily overload the LSF scheduler by running bjobs –u all  Implementations of monitoring is a concern!

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 32 US CMS Application Intentionally left blank! We could run 10-100 jobs right away

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 33 US ATLAS Application  ATLAS reconstruction and analysis jobs require access to remote database servers at CERN, BNL, and elsewhere  SLAC batch nodes don't have internet access  Solution is to use have clone of the database within the SLAC network or to create a tunnel

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 35 Final thought “PARVA SED APTA MIHI SED…” - Ludovico Ariosto

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 36 QUESTIONS?

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 38 Governance

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 39 User in VO1 notices problem at RP3, notifies their SC (1). SC-C opens ticket (2) and assigns to SC-F. SC-F gets automatic notice (3) and contacts RP3 (4). Admin at RP3 fixes and replies to SC-F (5). SC-F notes resolution in ticket and marks it resolved (6). SC-C gets automatic notice of update to ticket (7). SC-C notifies user of resolution (8). User can complain if dissatisfied and SC-C can re-open ticket (9,10). Physical View 1 2 3 4 5 6 7 8 9 10 OSG infrastructure SC private infrastructure Ticketing Routing Example

Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

Similar presentations

Presentation on theme: "Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

Similar presentations

Presentation on theme: "Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,"— Presentation transcript:

Similar presentations

About project

Feedback