Presentation is loading. Please wait.

Presentation is loading. Please wait.

TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004.

Similar presentations


Presentation on theme: "TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004."— Presentation transcript:

1 TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004

2 TeraGrid Operations Center Provides continuous and coordinated operational support, user assistance, and incident response for the nation-wide TeraGrid

3 TOC Capabilities  24/7 single source of assistance for TeraGrid users and staff, via email or telephone  Dedicated TeraGrid trouble-ticket system (TTS) ensures timely resolution of problems and event response  Leverages and pools vast experience of existing operations staff and system administrators  Capable of monitoring systems/queues at multiple remote sites

4 “use existing infrastructure” - NSF

5 TOC Technical Approach  TG Operations Center staffed by NCSA and SDSC Operations staff, 12 hour shift for each site  TOC provides front-line evaluation, resolution, and routing of problems  TOC coordinates, participates in event response – security issues, down time, etc.

6 NCSA & SDSC Ops Centers: Expanded Scope, but Business as Usual

7 Monitoring Capabilities

8 Monitoring  Currently ‘passively’ monitoring most TeraGrid clusters using CluMon  Ramping up efforts to monitor the TeraGrid network  Monitoring capacity untapped at this point (not yet monitoring grid fabric)

9 TeraGrid Ticketing System

10 Technical Approach - TeraGrid Ticketing System  help@teragrid.org or toll-free number receive all incoming requests help@teragrid.org  TTS is a browser-based, db-driven system developed from NCSA’s in-house ticketing system (use existing infrastructure!)  Users are able to track the progress of their tickets  New TG sites are easily integrated into system (all new ETF sites already integrated)

11 Technical Approach – TeraGrid Ticketing System (continued)  Problem Resolution – a tiered approach  Front-line evaluation, routing or resolution by TG Ops staff  Site-specific issues routed to site-leads for resolution  TG-wide issues routed to user support team to coordinate resolution by technical leads  Front-line Resolution an important factor  22% of all trouble tickets resolved by TOC staff

12 Trouble Ticket Processing From Open To Close  When a ticket is created, user receives auto- notification with ticket number  User receives personal reply within 30 minutes  Ticket is assigned to a project & to someone  User is kept updated on progress, resolution  Problem behind ticket is resolved  User is notified  User receives auto-notification of closure, with summary

13 Problem Resolution Workflow TeraGrid User Community help@teragrid.org TeraGrid Operations User Support Team TeraGrid Sites

14

15 Pulling Ops Centers Together:  A common set of web-based procedures documentation –  Routing & Assignment Guides  ’20 Questions’ Guides for problem determination  Basic operational policies and procedures  ‘Shift Turnover’ phone calls  Open communication & assistance

16 Challenges  TeraGrid is a huge learning curve for Ops Staff (must know at least a little bit about everything)  Keeping abreast with a constant state of change  Working with people who are very far away (and sometimes on vacation)  Promoting the concept of Problem Resolution (new to some) and getting everyone to use the Ticketing System  Inexperienced users on the horizon

17 Lessons Learned  More tickets than anyone expected  Problem Resolution on a global scale is expensive wrt time and talent consumed  TG Ops Center more than just a problem routing switchboard  Communication & coordination between RPs, services and TOC vital to success


Download ppt "TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004."

Similar presentations


Ads by Google