Download presentation
Presentation is loading. Please wait.
Published byEvangeline Harriet Charles Modified over 9 years ago
1
TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004
2
TeraGrid Operations Center Provides continuous and coordinated operational support, user assistance, and incident response for the nation-wide TeraGrid
3
TOC Capabilities 24/7 single source of assistance for TeraGrid users and staff, via email or telephone Dedicated TeraGrid trouble-ticket system (TTS) ensures timely resolution of problems and event response Leverages and pools vast experience of existing operations staff and system administrators Capable of monitoring systems/queues at multiple remote sites
4
“use existing infrastructure” - NSF
5
TOC Technical Approach TG Operations Center staffed by NCSA and SDSC Operations staff, 12 hour shift for each site TOC provides front-line evaluation, resolution, and routing of problems TOC coordinates, participates in event response – security issues, down time, etc.
6
NCSA & SDSC Ops Centers: Expanded Scope, but Business as Usual
7
Monitoring Capabilities
8
Monitoring Currently ‘passively’ monitoring most TeraGrid clusters using CluMon Ramping up efforts to monitor the TeraGrid network Monitoring capacity untapped at this point (not yet monitoring grid fabric)
9
TeraGrid Ticketing System
10
Technical Approach - TeraGrid Ticketing System help@teragrid.org or toll-free number receive all incoming requests help@teragrid.org TTS is a browser-based, db-driven system developed from NCSA’s in-house ticketing system (use existing infrastructure!) Users are able to track the progress of their tickets New TG sites are easily integrated into system (all new ETF sites already integrated)
11
Technical Approach – TeraGrid Ticketing System (continued) Problem Resolution – a tiered approach Front-line evaluation, routing or resolution by TG Ops staff Site-specific issues routed to site-leads for resolution TG-wide issues routed to user support team to coordinate resolution by technical leads Front-line Resolution an important factor 22% of all trouble tickets resolved by TOC staff
12
Trouble Ticket Processing From Open To Close When a ticket is created, user receives auto- notification with ticket number User receives personal reply within 30 minutes Ticket is assigned to a project & to someone User is kept updated on progress, resolution Problem behind ticket is resolved User is notified User receives auto-notification of closure, with summary
13
Problem Resolution Workflow TeraGrid User Community help@teragrid.org TeraGrid Operations User Support Team TeraGrid Sites
15
Pulling Ops Centers Together: A common set of web-based procedures documentation – Routing & Assignment Guides ’20 Questions’ Guides for problem determination Basic operational policies and procedures ‘Shift Turnover’ phone calls Open communication & assistance
16
Challenges TeraGrid is a huge learning curve for Ops Staff (must know at least a little bit about everything) Keeping abreast with a constant state of change Working with people who are very far away (and sometimes on vacation) Promoting the concept of Problem Resolution (new to some) and getting everyone to use the Ticketing System Inexperienced users on the horizon
17
Lessons Learned More tickets than anyone expected Problem Resolution on a global scale is expensive wrt time and talent consumed TG Ops Center more than just a problem routing switchboard Communication & coordination between RPs, services and TOC vital to success
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.