Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University.

Slides:



Advertisements
Similar presentations
LCG WLCG Operations John Gordon, CCLRC GridPP18 Glasgow 21 March 2007.
Advertisements

Forschungszentrum Karlsruhe in der Helmholtz-Gemeinschaft Torsten Antoni – LCG Operations Workshop, CERN 02-04/11/04 Global Grid User Support - GGUS -
 Contributing >30% of throughput to ATLAS and CMS in Worldwide LHC Computing Grid  Reliant on production and advanced networking from ESNET, LHCNET and.
Grid Security Users, VOs, Sites OSG Collaboration Meeting University of Washington Bob Cowles August 23, 2006 Work supported.
Jan 2010 Current OSG Efforts and Status, Grid Deployment Board, Jan 12 th 2010 OSG has weekly Operations and Production Meetings including US ATLAS and.
F Run II Experiments and the Grid Amber Boehnlein Fermilab September 16, 2005.
Open Science Ruth Pordes Fermilab, July 17th 2006 What is OSG Where Networking fits Middleware Security Networking & OSG Outline.
MyOSG: A user-centric information resource for OSG infrastructure data sources Arvind Gopu, Soichi Hayashi, Rob Quick Open Science Grid Operations Center.
Open Science Grid Software Stack, Virtual Data Toolkit and Interoperability Activities D. Olson, LBNL for the OSG International.
OSG Operations and Interoperations Rob Quick Open Science Grid Operations Center - Indiana University EGEE Operations Meeting Stockholm, Sweden - 14 June.
OSG Services at Tier2 Centers Rob Gardner University of Chicago WLCG Tier2 Workshop CERN June 12-14, 2006.
Integration and Sites Rob Gardner Area Coordinators Meeting 12/4/08.
OSG Middleware Roadmap Rob Gardner University of Chicago OSG / EGEE Operations Workshop CERN June 19-20, 2006.
INFSO-RI Enabling Grids for E-sciencE The US Federation Miron Livny Computer Sciences Department University of Wisconsin – Madison.
INFSO-RI Enabling Grids for E-sciencE SA1: Cookbook (DSA1.7) Ian Bird CERN 18 January 2006.
May 8, 20071/15 VO Services Project – Status Report Gabriele Garzoglio VO Services Project – Status Report Overview and Plans May 8, 2007 Computing Division,
Apr 30, 20081/11 VO Services Project – Stakeholders’ Meeting Gabriele Garzoglio VO Services Project Stakeholders’ Meeting Apr 30, 2008 Gabriele Garzoglio.
PanDA Multi-User Pilot Jobs Maxim Potekhin Brookhaven National Laboratory Open Science Grid WLCG GDB Meeting CERN March 11, 2009.
SAMGrid as a Stakeholder of FermiGrid Valeria Bartsch Computing Division Fermilab.
Use of Condor on the Open Science Grid Chris Green, OSG User Group / FNAL Condor Week, April
J OINING OSG Suchandra Thapa Computation Institute University of Chicago.
Overview of Monitoring and Information Systems in OSG MWGS08 - September 18, Chicago Marco Mambelli - University of Chicago
Mar 28, 20071/9 VO Services Project Gabriele Garzoglio The VO Services Project Don Petravick for Gabriele Garzoglio Computing Division, Fermilab ISGC 2007.
Enabling Grids for E-sciencE EGEE-II INFSO-RI OSG-doc-498 Maite Barroso: Grid Operations LHCC review, CERN,25 th September Operations EGEE.
OSG Production Report OSG Area Coordinator’s Meeting Aug 12, 2010 Dan Fraser.
OSG Tier 3 support Marco Mambelli - OSG Tier 3 Dan Fraser - OSG Tier 3 liaison Tanya Levshina - OSG.
Open Science Grid An Update and Its Principles Ruth Pordes Fermilab.
BNL Tier 1 Service Planning & Monitoring Bruce G. Gibbard GDB 5-6 August 2006.
1 LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) ‏ GridPP21 3 rd September 2008.
US LHC OSG Technology Roadmap May 4-5th, 2005 Welcome. Thank you to Deirdre for the arrangements.
Grid Security Vulnerability Group Linda Cornwall, GDB, CERN 7 th September 2005
OSG Integration Activity Report Rob Gardner Leigh Grundhoefer OSG Technical Meeting UCSD Dec 16, 2004.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
Site Validation Session Report Co-Chairs: Piotr Nyczyk, CERN IT/GD Leigh Grundhoefer, IU / OSG Notes from Judy Novak WLCG-OSG-EGEE Workshop CERN, June.
Status Organization Overview of Program of Work Education, Training It’s the People who make it happen & make it Work.
The OSG and Grid Operations Center Rob Quick Open Science Grid Operations Center - Indiana University ATLAS Tier 2-Tier 3 Meeting Bloomington, Indiana.
Jan 2010 OSG Update Grid Deployment Board, Feb 10 th 2010 Now having daily attendance at the WLCG daily operations meeting. Helping in ensuring tickets.
RSV: OSG Grid Fabric Monitoring and Interoperation with WLCG Monitoring Systems Rob Quick, Arvind Gopu, and Soichi Hayashi Computing in High Energy and.
2005 GRIDS Community Workshop1 Learning From Cyberinfrastructure Initiatives Grid Research Integration Development & Support
Operations Activity Doug Olson, LBNL Co-chair OSG Operations OSG Council Meeting 3 May 2005, Madison, WI.
Sep 25, 20071/5 Grid Services Activities on Security Gabriele Garzoglio Grid Services Activities on Security Gabriele Garzoglio Computing Division, Fermilab.
OSG Deployment Preparations Status Dane Skow OSG Council Meeting May 3, 2005 Madison, WI.
April 25, 2006Parag Mhashilkar, Fermilab1 Resource Selection in OSG & SAM-On-The-Fly Parag Mhashilkar Fermi National Accelerator Laboratory Condor Week.
Kati Lassila-Perini EGEE User Support Workshop Outline: – CMS collaboration – User Support clients – User Support task definition – passive support:
Area Coordinator Report for Operations Rob Quick 4/10/2008.
Open Science Grid OSG Resource and Service Validation and WLCG SAM Interoperability Rob Quick With Content from Arvind Gopu, James Casey, Ian Neilson,
User Support of WLCG Storage Issues Rob Quick OSG Operations Coordinator WLCG Collaboration Meeting Imperial College, London July 7,
LCG Pilot Jobs + glexec John Gordon, STFC-RAL GDB 7 December 2007.
Opensciencegrid.org Operations Interfaces and Interactions Rob Quick, Indiana University July 21, 2005.
II EGEE conference Den Haag November, ROC-CIC status in Italy
Integration TestBed (iTB) and Operations Provisioning Leigh Grundhoefer.
OSG Status and Rob Gardner University of Chicago US ATLAS Tier2 Meeting Harvard University, August 17-18, 2006.
OSG Area Coordinators Meeting Security Team Report Mine Altunay 8/15/2012.
Parag Mhashilkar Computing Division, Fermilab.  Status  Effort Spent  Operations & Support  Phase II: Reasons for Closing the Project  Phase II:
RSV: OSG Grid Monitoring and User Customizable Views Rob Quick, Arvind Gopu, and Soichi Hayashi High Performance Distributed Computing Location: Munich,
March 2014 Open Science Grid Operations A Decade of HTC Infrastructure Support Kyle Gross Operations Support Lead Indiana University / Research Technologies.
OSG Facility Miron Livny OSG Facility Coordinator and PI University of Wisconsin-Madison Open Science Grid Scientific Advisory Group Meeting June 12th.
Open Science Grid Configuring RSV OSG Resource & Service Validation Thomas Wang Grid Operations Center (OSG-GOC) Indiana University.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Grid Colombia Workshop with OSG Week 2 Startup Rob Gardner University of Chicago October 26, 2009.
What is OSG? (What does it have to do with Atlas T3s?) What is OSG? (What does it have to do with Atlas T3s?) Dan Fraser OSG Production Coordinator OSG.
OSG Operations – Lessons Learned CHEP 2010, 18 October 15:10 (Asia/Taipei) – Room 2, BHSS OSG Operations – Lessons Learned CHEP 2010, 18 October 15:10.
Regional Operations Centres Core infrastructure Centres
Operations Interfaces and Interactions
Open Science Grid Progress and Status
Monitoring and Information Services Technical Group Report
Incident Response Plan for the Open Science Grid
EGEE VO Management.
Grid Service Monitoring Working Group
Leigh Grundhoefer Indiana University
Presentation transcript:

Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Outline How We Operate Lessons Learned Lessons Not Yet Learned

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 The Open Science Grid Operations Center (GOC) Critical Infrastructure Support Communication Hub Security Response Central Software Caches

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 OSG Operations Infrastructure Monitoring/Status –VORS –MonALISA –GridCat –VOMS Monitor –CEMon/BDII Integrated Information Server Services –VOMS (6 VOs) –RSS News Feed –GOC Informational Pages Trouble Ticketing –Exchange with Peering Grids and Support Centers Scheduled Downtime Tool OSG Software Cache Registration DB Duplicate Infrastructure for the OSG ITB

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Communication Hub Operator Available 24/7/365 to Receive Call/ and Open/Route Ticket Trouble Ticketing ~3500 Tickets since GOCs Inception ~30 New Tickets Opened Per Week Automated exchanging of tickets with GGUS, FNAL, VDT, ATLAS, CMS Weekly Operations Call OSG-Operations Mailing List

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Security Response Technician on-call 24/7/365 to evaluate security incidents. Critical Incidents are Immediately Addressed with OSG Security Officer opensciencegrid.org 24/7/365 phone availability

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Software Caches OSG and ITB Caches Compute Element Configuration of Condor, PBS, LSF, SGE Worker Node Client Client VOMS GUMS

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lessons One Event: Release of OSG Software Situation: Winter 2006, OSG Software stack has been validated and is ready for release, however documentation is in horrible shape. Solution: 3 people work non-stop for 2 weeks to get baseline documents in shape. Lesson: Documentation is as important as Validation, Integration, and Deployment. Corollary: Incorrect documentation is often worse than no documentation.

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Two Event: A java service is using resources poorly Situation: MonALISA monitoring used on a large group of grid resources takes tremendous amounts of I/O Solution: GOC is asked to beef up the hardware Lesson: The fix for poor software performance is better hardware. Wait: that didn’t work!!! Real Lesson: A bigger hammer will still not drive nails into rocks.

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Three Event: DZero Production Run Situation: DZero has 250 million events to process and merge. Solution: OSG Resources are urged to support the DZero VO and troubleshooting team works with application developers. Original Goal: ~3M events/day. Up to ~7.7M events/day processed. Lesson: There's nothing you can't do if you have a Swiss Army Knife, a roll of duct tape, and your wits. Actual Lesson: The resources are available on OSG, but there is still effort needed to coordinate large runs.

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Four Event: Joint WLCG/OSG/EGEE Operations Meeting Situation: We need a way to seamlessly exchange problems between peering grids. Solution: Develop a translator between EGEE GGUS ticketing and OSG Foot Prints System. Lesson: Communication is the key to grid interoperability. Alternate Lesson: If you can’t be at the World Cup, Geneva is the next best option.

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Five Event: High Level Collaborator Resigns Situation: OSG Collaborator running a critical status availability service suddenly resigns and service is turned off. Solution: Several developers design equivalent services. Lesson: Critical services should have multiple administrators and be located centrally, or co- located at the GOC. Alternate Lesson: If a potential security incident happens during a first date and pulls you away, there will probably not be a second.

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Six Event: Chicago Marathon Situation: 13.1 miles to go, halfway point… spectator is offering bananas and tequila. Solution: Take some of both. Lesson: Sometimes motivation comes in the most unlikely form. Corollary: If someone offers you tequila no matter what the situation… drink it!

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lessons That Need to Be Learned How to accurately advertise VO support How to efficiently interoperate with peering grids How to understand and advertise site policy to users What services are necessary to provide users with all of the information they need to effectively use the OSG How to handle an explosion in user base

R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Thank You Special Thanks GOC Team: John Rosheck, Tim Silvers, Kyle Gross, and Arvind Gopu