Presentation is loading. Please wait.

Presentation is loading. Please wait.

FermiGrid Keith Chadwick Fermilab Computing Division Communications and Computing Fabric Department Fabric Technology Projects Group.

Similar presentations


Presentation on theme: "FermiGrid Keith Chadwick Fermilab Computing Division Communications and Computing Fabric Department Fabric Technology Projects Group."— Presentation transcript:

1 FermiGrid Keith Chadwick chadwick@fnal.gov Fermilab Computing Division Communications and Computing Fabric Department Fabric Technology Projects Group Leader

2 August 22, 2006Keith Chadwick – OSG Consortium Meeting2 People FermiGrid Operations Team: Keith Chadwick (CD/CCF/FTP) – Project Leader - Monitoring and Metrics Steve Timm (CD/CSS/FCS) – Linux OS Support, Condor, Globus Gatekeeper Dan Yocum (CD/CCF/FTP) – Middleware Support - VOMS, VOMRS, GUMS Neha Sharma (CD/CCF/FTP) - Middleware Support - Storage, Accounting FermiGrid Stakeholder Representatives: Keith Chadwick – FermiGrid Common Services Representative Doug Benjamin – CDF Stakeholder Representative Ian Fisk – U.S.CMS Representative Amber Boehnlein & Alan Jonckheere – D0 Stakeholder Representatives Nickolai Kouropatkine – DES & SDSS Stakeholder Representative Steve Timm – Fermilab General Purpose Farms Representative Ruth Pordes – OSG Stakeholder Representative Eileen Berman & Rob Kennedy - Storage (enstore & dcache) Representatives FermiGrid Web Site & Additional Documentation: http://fermigrid.fnal.gov/

3 August 22, 2006Keith Chadwick – OSG Consortium Meeting3 What is FermiGrid? FermiGrid is: A set of common services, including: –The Fermilab site globus gateway, VOMS, VOMRS, GUMS, SAZ, MyProxy, Gratia Accounting, etc. A forum for promoting stakeholder interoperability within Fermilab. The portal from the Open Science Grid to Fermilab Compute Services: –Production: fermigrid1, fngp-osg, fcdfosg1, fcdfosg2, docabosg2, sdss- tam, etc… –Integration: fgtest1, fnpcg, etc… The portal from the Open Science Grid to Fermilab Storage Services: –Production: FNAL_FERMIGRID_SE (public dcache), stken, etc … –Integration: none at present.

4 August 22, 2006Keith Chadwick – OSG Consortium Meeting4 Site Wide Gateway Today - Animation CMS WC1 CDF OSG1 CDF OSG2 D0 CAB2 SDSS TAM GP Farm LQCD Site Wide Gateway Myproxy Server VOMS Server SAZ Server GUMS Server Step 1 - user issues voms-proxy-init user receives voms signed credentials Step 2 - user stores their voms signed credentials on the myproxy server Step 3 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 6 – Gateway retrieves the previously stored proxy Step 4 – Gateway requests GUMS Mapping based on VO & Role Step 5 – Gateway checks against Site Authorization Service clusters send ClassAds via condor_advertise to the site wide gateway Step 7 - Grid job is forwarded to target cluster ? ? ?

5 August 22, 2006Keith Chadwick – OSG Consortium Meeting5 Guest vs. Owner VO Access OSG “guest” VO Users “owner” VO Users Resource Head Node FermiGrid Gateway & Central Services Required Allowed Fermilab “guest” VO Users Allowed Resource Head Node Resource Head Node Resource Head Node Allowed Not Allowed

6 August 22, 2006Keith Chadwick – OSG Consortium Meeting6 Site Wide Gateway with CEMon - Animation CMS WC1 CDF OSG1 CDF OSG2 D0 CAB2 SDSS TAM GP Farm LQCD Site Wide Gateway VOMS Server SAZ Server GUMS Server Step 1 - user issues voms-proxy-init user receives voms signed credentials Step 2 – user submits their grid job via globus-job-run, globus-job-submit, or condor-g Step 3 – Gateway requests GUMS Mapping based on VO & Role Step 4 – Gateway checks against Site Authorization Service clusters send ClassAds via CEMon to the site wide gateway Step 5 - Grid job is forwarded to target cluster ? ? ?

7 August 22, 2006Keith Chadwick – OSG Consortium Meeting7 Virtual Organizations FermiGrid currently hosts the following Virtual Organizations: augerhttp://www.auger.org/ deshttp://decam.fnal.gov/ dzerohttp://www-d0.fnal.gov/ fermilabhttp://www.fnal.gov/ gaduhttp://www-wit.mcs.anl.gov/Alex/GADU/Index.cgi nanohubhttp://www.nanohub.org/ sdsshttp://www.sdss.org/ ilchttp://ilc.fnal.gov/ lqcdhttp://lqcd.fnal.gov/ I2u2http://www-ed.fnal.gov/uueo/i2u2.html

8 August 22, 2006Keith Chadwick – OSG Consortium Meeting8 SAZ - Site AuthoriZation Module We are in the process of implementing a SAZ (Site AuthoriZation) module for the Fermilab “Open Science Enclave”. Our current plan is to operate in a default accept mode for user credentials that are associated with known and “trusted” VOs and CAs: Site authorization callout on globus gateway sends SAZ authorization request (example): user:/DC=org/DC=doegrids/OU=People/CN=Keith Chadwick 800325 VO:fermilab attribute:/fermilab/Role=NULL/Capability=NULL CA:/DC=org/DC=DOEGrids/OU=Certificate Authorities/CN=DOEGrids CA 1 SAZ server on fermigrid4 receives SAZ authorization request: 1.Verifies certificate and trust chain. 2. If [ the certificate does not verify or the trust chain is invalid ]; then SAZ returns "Not-Authorized" fi 3.Issues select on "user:" against the SAZDB USER table 4. if [ the select on "user:" fails ]; then a record corresponding to the "user:" is inserted into the SAZDB user table with (Enabled = Y, Trusted=F) fi 5.Issues select on "VO:" against the local SAZDB VO table 6.if [ the select on "VO:" fails ]; then a record corresponding to the "VO:" is inserted into the SAZDB VO table with (Enabled = Y, Trusted=F) fi 7.Issues select on "VO-Role:" against the local SAZDB VO-ROLE table 8.if [ the select on "VO-Role:" fails ]; then a record corresponding to the "VO-Role:" is inserted into the SAZDB VO-Role table with (Enabled = Y, Trusted=F) fi 9.Issues select "CA:" against the local SAZDB CA table 10.if [ the select on "CA:" fails ]; then a record corresponding to the "CA:" is inserted into the SAZDB CA table with (Enabled = Y, Trusted=F) fi 11.The SAZ server then returns the logical and of (user.enabled, vo.enabled, vo-role.enabled, ca.enabled ) to the SAZ client which was called by either the globus gatekeeper or gLexec.

9 August 22, 2006Keith Chadwick – OSG Consortium Meeting9 SAZ - A couple of caveats What about grid-proxy-init or voms-proxy-init without a VO? The NULL VO is specifically disabled (Enabled=F, Trusted=F). If a user has “Trusted=Y” in their “user:” record, then they will be allowed to execute jobs without VO “sponsorship”. This will not be automatic. What about glide-in operation? To comply with the (draft) Fermilab policy on glide-in’s, VO’s will shortly be required to use gLexec to launch their glide-in jobs. SAZ queries for gLexec will require the VO to have “Trusted=Y” in the VO- Role record that they are using for glide-in’s. This will not be automatic. Authorization for “Trusted=Y” flags in the SAZ database will granted and revoked by the Fermilab Computer Security Executive.

10 August 22, 2006Keith Chadwick – OSG Consortium Meeting10 Stakeholder Interoperability The second component of FermiGrid is the bilateral stakeholder interoperability across the stakeholders (CDF, D0, CMS, GP Farms, SDSS, etc.) computing resources. Most of this work takes place in the various stakeholder organizations without direct FermiGrid Operations Team involvement. But there are certainly vigorous discussions…

11 August 22, 2006Keith Chadwick – OSG Consortium Meeting11 OSG Interfaces for Fermilab The third component of FermiGrid is enabling the opportunistic use of FNAL computing resources through Open Science Grid (OSG) interfaces. Most of this work to accomplish this happened in the context of the installation and configuration of the Fermilab Common Grid Services and deployment and integration on the GP Farm, CDF OSG1&2, DO OSG CAB2 clusters. The Fermilab “job-forwarding” gateway has caused some problems for users that assume that a job submitted to the fork jobmanager will have access to the same filesystem that a job submitted to the condor-g jobmanager has. We are in the process of deploying a multi-cluster filesystem based on the BlueArc NFS server appliance to provide a unified filesystem view from all clusters participating in FermiGrid job forwarding.

12 August 22, 2006Keith Chadwick – OSG Consortium Meeting12 OSG Access to Permanent Storage The fourth component of FermiGrid is enabling the opportunistic use of our storage resources through Open Science Grid (OSG) interfaces. We have deployed 7 Tbytes in our public dcache (FNAL_FERMIGRID_SE). We also have deployed three separate storage system interfaces: CDFEN - Enstore Mass Storage Production Service for CDF Run II. DOEN - Enstore Mass Storage Production Service for D0 Run II. STKEN - Enstore Mass Storage Production Service for all other users. These facilities are exposed to the Open Science Grid and are available for use by Fermilab experiments. The storage elements are either buffering (used to cache and assemble data sets) or custodial (used for long-term retention of scientific data). The preferred Grid interface is via the Storage Resource Manager (SRM). The SRM client supports the following access methods: Direct GridFTP service (although it is not scalable). Http service for reads. Kerberized FTP service. A read-only service with weak passwords. Fermilab is prototyping an opportunistic role with other experiments as well.

13 August 22, 2006Keith Chadwick – OSG Consortium Meeting13 Metrics and Service Monitors In addition to the normal operation effort of running the various FermiGrid services, significant effort has been spent over the last year to collect and publish operational metrics and instrument service monitors: Globus gatekeeper calls by jobmanager per day Globus gatekeeper IP connections per day VOMS calls per day VOMS IP connections per day VOMS service monitor GUMS calls per day GUMS service monitor Resource selection and acceptance of the fermilab VO at OSG sites Metrics typically run once a day. Service monitors typically run multiple times per day and are equipped to detect problems with the service that they are monitoring, notify administrators and automatically restart services as necessary to ensure continuous operations: http://fermigrid.fnal.gov/fermigrid-metrics.html

14 August 22, 2006Keith Chadwick – OSG Consortium Meeting14 Metrics - Globus Gatekeeper - fermigrid1

15 August 22, 2006Keith Chadwick – OSG Consortium Meeting15 Metrics - Gatekeeper IP Connections

16 August 22, 2006Keith Chadwick – OSG Consortium Meeting16 Metrics – VOMS - fermigrid2

17 August 22, 2006Keith Chadwick – OSG Consortium Meeting17 Metrics - VOMS - fermilab VO groups

18 August 22, 2006Keith Chadwick – OSG Consortium Meeting18 Metrics - VOMS IP Connections

19 August 22, 2006Keith Chadwick – OSG Consortium Meeting19 Monitoring - VOMS - fermigrid2

20 August 22, 2006Keith Chadwick – OSG Consortium Meeting20 Metrics – GUMS - fermigrid3

21 August 22, 2006Keith Chadwick – OSG Consortium Meeting21 Monitoring - GUMS - fermigrid3

22 August 22, 2006Keith Chadwick – OSG Consortium Meeting22 fermilab VO acceptance within the OSG. We have recently started development of a fermilab VO acceptance probe. Goals: See where members of the fermilab VO can execute grid jobs in the OSG. Cross check VO Resource Selector report. Record detailed results on success / failure. Provide a historical record. Serve as a tool for Fermilab management.

23 August 22, 2006Keith Chadwick – OSG Consortium Meeting23 OSG VO Resource Selector Here are the VORS Results for Grid=OSG + VO=fermilab (=30 sites): Name Gatekeeper Type Grid Status Last Test Date ASGC_OSG osgc01.grid.sinica.edu.tw:2119 computeOSG PASS 2006-08-17 18:10:58 BU_ATLAS_Tier2 atlas.bu.edu:2119 computeOSG PASS 2006-08-17 18:23:07 CIT_CMS_T2 cit-gatekeeper.ultralight.org:2119computeOSG PASS 2006-08-17 18:24:17 DARTMOUTH pbs-01.grid.dartmouth.edu:2119 computeOSG PASS 2006-08-17 18:28:52 FIU-PG fiupg.ampath.net:2119 computeOSG PASS 2006-08-17 18:29:36 FNAL_FERMIGRID fermigrid1.fnal.gov:2119compute OSG PASS 2006-08-17 18:30:58 FNAL_GPFARM fngp-osg.fnal.gov:2119compute OSG PASS 2006-08-17 18:34:20 GRASE-BINGHAMTON rommel.cs.binghamton.edu:2119computeOSG PASS 2006-08-17 18:38:54 GRASE-CCR-U2 u2-grid.ccr.buffalo.edu:2119computeOSG PASS 2006-08-17 18:39:57 GROW-UNI-P grow.cs.uni.edu:2119computeOSG PASS 2006-08-17 18:40:32 HAMPTONU hercules.hamptonu.edu:2119computeOSG PASS 2006-08-17 18:42:18 IU_ATLAS_Tier2 atlas.iu.edu:2119computeOSG PASS 2006-08-17 18:43:24 MIT_CMS ce01.cmsaf.mit.edu:2119computeOSG PASS 2006-08-17 18:53:00 NERSC-PDSF pdsfgrid2.nersc.gov:2119computeOSG PASS 2006-08-17 18:55:34 OSG_LIGO_PSU grid3.aset.psu.edu:2119computeOSG PASS 2006-08-17 19:01:19 Purdue-ITaP osg.rcac.purdue.edu:2119computeOSG PASS 2006-08-17 19:20:41 Purdue-Physics grid.physics.purdue.edu:2119computeOSG PASS 2006-08-17 19:27:10 SDSS_TAM tam01.fnal.gov:2119computeOSG PASS 2006-08-17 19:28:10 STAR-Bham rhilxs.ph.bham.ac.uk:2119computeOSG PASS 2006-08-17 19:32:40 STAR-WSU rhic23.physics.wayne.edu:2119compute OSG PASS 2006-08-17 19:42:01 TTU-ANTAEUS antaeus.hpcc.ttu.edu:2119compute OSG PASS 2006-08-17 19:43:12 UC_ATLAS_MWT2 tier2-osg.uchicago.edu:2119compute OSG PASS 2006-08-17 19:44:12 UC_Teraport tp-osg.uchicago.edu:2119compute OSG PASS 2006-08-17 19:45:10 UERJ_HEPGRID prod-frontend.hepgrid.uerj.br:2119compute OSG PASS 2006-08-17 19:46:14 UNM_HPC milta.alliance.unm.edu:2119compute OSG PASS 2006-08-17 19:52:04 USCMS-FNAL-WC1-CE cmsosgce.fnal.gov:2119compute OSG PASS 2006-08-17 19:52:44 UTA-DPCC atlas.dpcc.uta.edu:2119compute OSG PASS 2006-08-17 19:56:31 UVA-sunfire sunfire1.cs.virginia.edu:2119compute OSG PASS 2006-08-17 19:59:07 UWMilwaukee nest.phys.uwm.edu:2119compute OSG PASS 2006-08-17 20:07:08 VAMPIRE-Vanderbilt vampire.accre.vanderbilt.edu:2119compute OSG PASS 2006-08-17 20:08:04

24 August 22, 2006Keith Chadwick – OSG Consortium Meeting24 Actions performed by the fermilab VO Probe Fetches the current Gridcat summary report (production or integration). For each Compute Service entry in the Gridcat summary report, the fermilab VO probe fetches the detailed Gridcat record corresponding to the "Site Name" and parses out the: The Site Name The Gatekeeper Host Name The Site Application Directory ($APP) The Site Data Directory ($DATA) The Site Temporary Directory ($TMP) The Site Worker Node Temporary Directory ($WNTMP) The VO probe then runs the following series of tests against the Site: ping -c 1 gatekeeper_host_name globus-run-job gatekeeper_host_name /usr/bin/printenv globus-url-copy -v -cd local_file gsiftp:gatekeeper_host_name:2811/temporary_file_name globus-url-copy -v -cd gsiftp:gatekeeper_host_name:2811/temporary_file_name /dev/null globus-run-job gatekeeper_host_name /bin/rm temporary _file_name The start and end time of the tests is recorded, as are the results of the individual probes. If an individual test in the sequence fails, then all remaining tests are skipped. Finally a detail report, summary report and trend plots are generated from the probe results.

25 August 22, 2006Keith Chadwick – OSG Consortium Meeting25 fermilab VO Probe - Success List Here is the list of sites which successfully passed the fermilab VO probe (14 sites). Site NameGatekeeper IP NameGatekeeper Directory ASGC_OSGosgc01.grid.sinica.edu.tw/opt/wntmp/fermilab FNAL_GPFARMfngp-osg.fnal.gov/local/stage1/fermilab FNAL_FERMIGRIDfermigrid1.fnal.gov/local/stage1/fermilab osg-gw-2_t2_ucsd_eduosg-gw-2.t2.ucsd.edu/state/data/osgtmp/fermilab GRASE-CCR-U2u2-grid.ccr.buffalo.edu/san4/scratch/grid-tmp/fermilab USCMS-FNAL-WC1-CEcmsosgce.fnal.gov/uscms_data/d1/grid_tmp/osg/fermilab SDSS_TAMtam01.fnal.gov/tmp/wn-tmp/fermilab UC_Teraporttp-osg.uchicago.edu/tmp/fermilab HAMPTONUhercules.hamptonu.edu/wntmp/fermilab UERJ_HEPGRIDprod-frontend.hepgrid.uerj.br/storage/raid1/osg/osg-tmp/fermilab CIT_CMS_T2cit-gatekeeper.ultralight.org/tmp/fermilab MIT_CMSce01.cmsaf.mit.edu/osg/tmp/fermilab VAMPIRE-Vanderbiltvampire.accre.vanderbilt.edu/tmp/fermilab OSG_LIGO_PSUgrid3.aset.psu.edu/tmp/fermilab Out of 54 OSG sites, 14 (15) pass all fermilab VO probe tests. but 4 of these 14 (15) sites are at Fermilab! http://fermigrid.fnal.gov/monitor/fermigrid0-fermilab-vo-production-monitor-summary.html http://fermigrid.fnal.gov/monitor/fermigrid0-fermilab-vo-production-monitor-detail.html

26 August 22, 2006Keith Chadwick – OSG Consortium Meeting26 ping failure list Here is the list of sites which failed the ping test (12 sites). Site NameGatekeeper IP NameGatekeeper Directory BNL_ATLAS_2gridgk02.racf.bnl.gov/tmp//fermilab BNL_ATLAS_1gridgk01.racf.bnl.gov/tmp/fermilab UTA-DPCCatlas.dpcc.uta.edu/scratch/fermilab TTU-ANTAEUS ***antaeus.hpcc.ttu.edu/tmp/fermilab GRASE-ALBANYgrid.rit.albany.edu/tmp/fermilab UNM_HPCmilta.alliance.unm.edu/tmp/fermilab OU_OSCER_OSGboomer2.oscer.ou.edu/localscratch/fermilab TACCosg-login.lonestar.tacc.utexas.edu/data/OSG/tmp/fermilab STAR-WSUrhic23.physics.wayne.edu/tmp/fermilab STAR-BNLstargrid02.rcf.bnl.gov/tmp/fermilab KNUcluster28.knu.ac.kr/fermilab grow-UNI-Pgrow.cs.uni.edu/tmp/fermilab If the standard Unix ping is replaced by a globus ping: globusrun -a -r /jobmanager-fork Then the 1 site marked with *** will pass the fermilab VO probe. The remaining 11 still fail the globus ping.

27 August 22, 2006Keith Chadwick – OSG Consortium Meeting27 globus-job-run failure list Here is the list of sites which failed the globus-job-run test (25 sites): Site NameGatekeeper IP NameGatekeeper Directory Purdue-Physicsgrid.physics.purdue.edu/scratch/fermilab Purdue-ITaPosg.rcac.purdue.edu/tmp/fermilab NERSC-PDSFpdsfgrid2.nersc.gov/scratch/fermilab UFlorida-EOufgrid05.phys.ufl.edu/state/partition1/wntmp/fermilab UIOWA-OSG-PRODrtgrid1.its.uiowa.edu/tmp/fermilab OUHEP_OSGouhep0.nhn.ou.edu/myhome1/atlas/grid3/tmp/fermilab PROD_SLACosgserv01.slac.stanford.edu/nfs/slac/g/grid/osg/tmp/fermilab STAR-SAO_PAULOstars.if.usp.br/home/grid/WN_TMP/fermilab sunfire1_cs_virginia_edusunfire1.cs.virginia.edu/tmp/osg/wntmp/fermilab GRASE-BINGHAMTONrommel.cs.binghamton.edu/opt/grid3_share/scratch/fermilab UFlorida-PGufloridapg.phys.ufl.edu/wntmp/fermilab UWMadisonCMScmsgrid02.hep.wisc.edu/tmp/fermilab Nebraskared.unl.edu/scratch/fermilab SPRACEspgrid.if.usp.br/scratch/OSG/fermilab NTU_HEPgrid1.phys.ntu.edu.tw/fermilab FIU-PGfiupg.ampath.net/state/partition1/wntmp/fermilab IU_BANDICOOTbandicoot.uits.indiana.edu/tmp/fermilab FNAL_LQCDlqcd.fnal.gov/fermilab UFlorida-IHEPAufgrid01.phys.ufl.edu/wntmp/fermilab OSG_INSTALL_TEST_2cms-xen2.fnal.gov/usr/local/osg-ce/OSG.DIRS/wn_tmp/fermilab BU_ATLAS_Tier2atlas.bu.edu/scratch/osg-tmp/fermilab Riceosg-gate.rice.edu/net/data11/tmp//fermilab DARTMOUTHpbs-01.grid.dartmouth.edu/tmp/fermilab OU_OCHEP_SWT2tier2-01.ochep.ou.edu/state/partition1/gridtmp/fermilab Purdue-Learlepton.rcac.purdue.edu/tmp/fermilab

28 August 22, 2006Keith Chadwick – OSG Consortium Meeting28 globus-url-copy failure list Here is the list of sites which failed the globus-url-copy test (3 sites): Site NameGatekeeper IP NameGatekeeper Directory IU_ATLAS_Tier2atlas.iu.edu/osgscr/fermilab UWMilwaukeenest.phys.uwm.edu/localscratch/fermilab UC_ATLAS_MWT2tier2-osg.uchicago.edu/scratch/fermilab

29 August 22, 2006Keith Chadwick – OSG Consortium Meeting29 fermilab VO probe - Plots http://fermigrid.fnal.gov/monitor/fermigrid0-fermilab-vo-production-monitor.html http://fermigrid.fnal.gov/monitor/fermigrid0-fermilab-vo-integration-monitor.html

30 August 22, 2006Keith Chadwick – OSG Consortium Meeting30 Current Work and Future Plans Policy, Security and Authorization Infrastructure: Complete development of Site AuthoriZation (SAZ) Service and deployment (target is Oct 1, 2006). gLexec development and deployment (target is Oct 1, 2006). Fermilab Open Science Enclave enumeration, trust relationships and policy. Public storage and storage element: FNAL_FERMIGRID_SE - public dcache storage element - 7 TBytes. Migration to BlueArc NFS appliance storage. Further Metrics and Service Monitor Development: Automate the cross checking of VORS and the fermilab VO probe results. Generate appropriate tickets to the OSG GOC for sites which are listed in VORS but fail fermilab VO probe. Enhance and extend Metrics and Service Monitors. Service Failover: At the present time the services are running in non-redundant mode. We are thinking about the best ways to implement service failover. Linux-HA clustering, BlueArc, XEN, and other technologies are being looked at. Research & Development & Deployment of future ITBs and OSG releases: Ongoing work… testbed systems are essential so that we can perform the research, development & integration without impacting our production services. Looking at XEN to facilitate this work also.

31 August 22, 2006Keith Chadwick – OSG Consortium Meeting31 Fin Any Questions?


Download ppt "FermiGrid Keith Chadwick Fermilab Computing Division Communications and Computing Fabric Department Fabric Technology Projects Group."

Similar presentations


Ads by Google