Presentation is loading. Please wait.

Presentation is loading. Please wait.

Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University.

Similar presentations


Presentation on theme: "Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University."— Presentation transcript:

1

2 Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University

3 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Outline How We Operate Lessons Learned Lessons Not Yet Learned

4 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 The Open Science Grid Operations Center (GOC) Critical Infrastructure Support Communication Hub Security Response Central Software Caches

5 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 OSG Operations Infrastructure Monitoring/Status –VORS –MonALISA –GridCat –VOMS Monitor –CEMon/BDII Integrated Information Server Services –VOMS (6 VOs) –RSS News Feed –GOC Informational Pages Trouble Ticketing –Exchange with Peering Grids and Support Centers Scheduled Downtime Tool OSG Software Cache Registration DB Duplicate Infrastructure for the OSG ITB

6 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Communication Hub Operator Available 24/7/365 to Receive Call/Email and Open/Route Ticket Trouble Ticketing ~3500 Tickets since GOCs Inception ~30 New Tickets Opened Per Week Automated exchanging of tickets with GGUS, FNAL, VDT, ATLAS, CMS Weekly Operations Call OSG-Operations Mailing List

7 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Security Response Technician on-call 24/7/365 to evaluate security incidents. Critical Incidents are Immediately Addressed with OSG Security Officer security@, incident@, abuse@ opensciencegrid.org 24/7/365 phone availability

8 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Software Caches OSG and ITB Caches Compute Element Configuration of Condor, PBS, LSF, SGE Worker Node Client Client VOMS GUMS

9 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lessons One Event: Release of OSG 0.4.0 Software Situation: Winter 2006, OSG Software stack has been validated and is ready for release, however documentation is in horrible shape. Solution: 3 people work non-stop for 2 weeks to get baseline documents in shape. Lesson: Documentation is as important as Validation, Integration, and Deployment. Corollary: Incorrect documentation is often worse than no documentation.

10 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Two Event: A java service is using resources poorly Situation: MonALISA monitoring used on a large group of grid resources takes tremendous amounts of I/O Solution: GOC is asked to beef up the hardware Lesson: The fix for poor software performance is better hardware. Wait: that didn’t work!!! Real Lesson: A bigger hammer will still not drive nails into rocks.

11 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Three Event: DZero Production Run Situation: DZero has 250 million events to process and merge. Solution: OSG Resources are urged to support the DZero VO and troubleshooting team works with application developers. Original Goal: ~3M events/day. Up to ~7.7M events/day processed. Lesson: There's nothing you can't do if you have a Swiss Army Knife, a roll of duct tape, and your wits. Actual Lesson: The resources are available on OSG, but there is still effort needed to coordinate large runs.

12 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Four Event: Joint WLCG/OSG/EGEE Operations Meeting Situation: We need a way to seamlessly exchange problems between peering grids. Solution: Develop a translator between EGEE GGUS ticketing and OSG Foot Prints System. Lesson: Communication is the key to grid interoperability. Alternate Lesson: If you can’t be at the World Cup, Geneva is the next best option.

13 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Five Event: High Level Collaborator Resigns Situation: OSG Collaborator running a critical status availability service suddenly resigns and service is turned off. Solution: Several developers design equivalent services. Lesson: Critical services should have multiple administrators and be located centrally, or co- located at the GOC. Alternate Lesson: If a potential security incident happens during a first date and pulls you away, there will probably not be a second.

14 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lesson Six Event: Chicago Marathon Situation: 13.1 miles to go, halfway point… spectator is offering bananas and tequila. Solution: Take some of both. Lesson: Sometimes motivation comes in the most unlikely form. Corollary: If someone offers you tequila no matter what the situation… drink it!

15 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Lessons That Need to Be Learned How to accurately advertise VO support How to efficiently interoperate with peering grids How to understand and advertise site policy to users What services are necessary to provide users with all of the information they need to effectively use the OSG How to handle an explosion in user base

16 R. Quick "WLCG-OSG-EGEE Interop" 26 Jan 2007 Thank You Special Thanks GOC Team: John Rosheck, Tim Silvers, Kyle Gross, and Arvind Gopu www.opensciencegrid.org www.grid.iu.edu


Download ppt "Grid Operations Lessons Learned Rob Quick Open Science Grid Operations Center - Indiana University."

Similar presentations


Ads by Google