TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004.

Slides:



Advertisements
Similar presentations
NOC TOOLS ticket systems (RT3) AfNOG Cairo, SI-E, 5 of 5 Sunday Folayan.
Advertisements

Common NOC Practices 4/05/2007 The Quilt NOC Common Practice Panel April 4, 2007.
The Office of Information Technology ITA Meeting May 6, 2010 Harris Room (UC 2.212)
Quality Assurance (QA) Working Group Update February 11, 2010 Kate Ericson (SDSC) Shava Smallen (SDSC)
An Apartment Industry Overview Presented by Jon Tull eREI & Lead2Lease™ Lead Management.
V. 09/06/11MHS Service Desk Overview. Introduction The Military Health Systems Service Desk (MHSSD) is the entry point into the MHS IT customer support.
Dave Jent, PI Luke Fowler, Co-PI Ron Johnson, Co-PI
Unit 8: Tests, Training, and Exercises Unit Introduction and Overview Unit objectives:  Define and explain the terms tests, training, and exercises. 
Center for Health Care Quality Licensing & Certification Program Evaluation 1 August 2014 rev.
The six Centripetal Forces For Successful Global Software Teams  Telecommunications Infrastructure  Product Architecture  Team Building  Development.
Terry Jackson Paul Rydeen August 16, 2010
Presentation Identifier Goes Here 1 Business Critical Services Helping you manage your IT Risk.
EASTERN MICHIGAN UNIVERSITY Continuity of Operations Planning (COOP)
NOS Objectives, YR 4&5 Tony Rimovsky. 4.2 Expanding Secure TeraGrid Access A TeraGrid identity management infrastructure that interoperates with campus.
Test Organization and Management
Unit 5:Elements of A Viable COOP Capability (cont.)  Define and explain the terms tests, training, and exercises (TT&E)  Explain the importance of a.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
INFSO-RI Enabling Grids for E-sciencE Incident Response Policies and Procedures Carlos Fuentes
SharePoint Services Indiana University Cory P. Retherford May 9 th, 2011.
PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.
What if you suspect a security incident or software vulnerability? What if you suspect a security incident at your site? DON’T PANIC Immediately inform:
EGEE ARM-2 – 5 Oct LCG Security Coordination Ian Neilson LCG Security Officer Grid Deployment Group CERN.
WG Goals and Workplan We have a charter, we have a group of interested people…what are our plans? goalsOur goals should reflect what we have listed in.
1.  Describe an overall framework for project integration management ◦ RelatIion to the other project management knowledge areas and the project life.
A Web Based Workorder Management System for California Schools.
Unit 8a Troubleshooting; Maintenance and Upgrades; Interaction with Vendors, Developers, and Users Component 8 Installation and Maintenance of Health IT.
Coordinating the TeraGrid’s User Interface Areas Dave Hart, Amit Majumdar, Tony Rimovsky, Sergiu Sanielevici.
1 PY4 Project Report Summary of incomplete PY4 IPP items.
User creates problem ticket on Web tool? User has Problem End user requests asst. via , phone, in person. No ticket created Helpdesk staff decides.
Federated Environments and Incident Response: The Worst of Both Worlds? A TeraGrid Perspective Jim Basney Senior Research Scientist National Center for.
Session 7 - Maintenance - contract and day-to-day Maintenance Support Presenter  Grenville Powell (Managing Director - of Shokaz Integrated Computing.
Incident Response Plan for the Open Science Grid Grid Operations Experience Workshop – HEPiX 22 Oct 2004 Bob Cowles – Work.
TAX-AIDE CO1 District Coordinator Training & Orientation 1 October 7, 2015.
Joint Meeting of the AUS, US, XS Working Groups TG10 Tuesday August 3, hrs Elwood II.
Trouble ticket
TAX-AIDE District Coordinator Training & Orientation 1 Aug 2015.
Cyberinfrastructure: Many Things to Many People Russ Hobby Program Manager Internet2.
Sergiu April 2006June 2006 Overview of TeraGrid Security Working Group Activities James Marsteller CISSP, Working Group Chair.
Operations Activity Doug Olson, LBNL Co-chair OSG Operations OSG Council Meeting 3 May 2005, Madison, WI.
Improving Technology Infrastructure and Web-based Information and Services Northeast Iowa Community College PRP031A Christine Woodson, Project Director.
State of Georgia Release Management Training
1 CREATING AND MANAGING CERT. 2 Internet Wonderful and Terrible “The wonderful thing about the Internet is that you’re connected to everyone else. The.
TeraGrid-Wide Operations Von Welch Area Director for Networking, Operations and Security NCSA, University of Illinois April, 2009.
TeraGrid-Wide Operations DRAFT #2 Mar 31 Von Welch.
EGEE is a project funded by the European Union under contract IST Roles & Responsibilities Ian Bird SA1 Manager Cork Meeting, April 2004.
~ pertemuan 4 ~ Oleh: Ir. Abdul Hayat, MTI 20-Mar-2009 [Abdul Hayat, [4]Project Integration Management, Semester Genap 2008/2009] 1 PROJECT INTEGRATION.
EGEE ARM-2 – 5 Oct LCG/EGEE Security Coordination Ian Neilson Grid Deployment Group CERN.
TeraGrid’s Common User Environment: Status, Challenges, Future Annual Project Review April, 2008.
Installation and Maintenance of Health IT Systems Unit 8a Troubleshooting; Maintenance and Upgrades; and Interaction with Vendors, Developers, and Users.
Opensciencegrid.org Operations Interfaces and Interactions Rob Quick, Indiana University July 21, 2005.
Information Technology Division Customer Service Support Center.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks What all NGIs need to do: Helpdesk / User.
Online Testing: What is an SRF and Why Should I Care? District and Campus Coordinators, Technology Staff, and Test Administrators.
WFP Information and Communication Technology Regional Joint Conference on “Disaster: Relief and Management-International Cooperation and Role of ICT” Alexandria,
LHCOPN operational model Guillaume Cessieux (CNRS/FR-CCIN2P3, EGEE SA2) On behalf of the LHCOPN Ops WG GDB CERN – November 12 th, 2008.
© 2016 TM Forum Live! 2016 | 1 E2E Service Orchestration for Smarter Health Real-World Business User Stories Draft.
Response to an Emergency Training for 211 Staff in Ontario Updated September
Grid Deployment Technical Working Groups: Middleware selection AAA,security Resource scheduling Operations User Support GDB Grid Deployment Resource planning,
Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.
TeraGrid’s Process for Meeting User Needs. Jay Boisseau, Texas Advanced Computing Center Dennis Gannon, Indiana University Ralph Roskies, University of.
Emergency Operations Plan Environmental and Molecular Toxicology (ALS residents) Information from the Safety Advocates for EMT (SAFE)
Operations Interfaces and Interactions
Ticket Handling, Queue Management and QlikView Dashboard Workshop
Information from the Safety Advocates for EMT (SAFE)
Ian Bird GDB Meeting CERN 9 September 2003
LHCOPN Operations: Yearly review
LCG Operations Centres
LCG Operations Workshop, e-IRG Workshop
Implementing on a Limited Budget: Leveraging Existing Tools within Lockheed Martin Aeronautics Customer Support Centers Successful Collaboration and Knowledge.
ZTE Customer Request Self-Service Portal Operation Guide V1.0.5
Presentation transcript:

TeraGrid Operations Overview Mike Pingleton NCSA TeraGrid Operations December 2 nd, 2004

TeraGrid Operations Center Provides continuous and coordinated operational support, user assistance, and incident response for the nation-wide TeraGrid

TOC Capabilities  24/7 single source of assistance for TeraGrid users and staff, via or telephone  Dedicated TeraGrid trouble-ticket system (TTS) ensures timely resolution of problems and event response  Leverages and pools vast experience of existing operations staff and system administrators  Capable of monitoring systems/queues at multiple remote sites

“use existing infrastructure” - NSF

TOC Technical Approach  TG Operations Center staffed by NCSA and SDSC Operations staff, 12 hour shift for each site  TOC provides front-line evaluation, resolution, and routing of problems  TOC coordinates, participates in event response – security issues, down time, etc.

NCSA & SDSC Ops Centers: Expanded Scope, but Business as Usual

Monitoring Capabilities

Monitoring  Currently ‘passively’ monitoring most TeraGrid clusters using CluMon  Ramping up efforts to monitor the TeraGrid network  Monitoring capacity untapped at this point (not yet monitoring grid fabric)

TeraGrid Ticketing System

Technical Approach - TeraGrid Ticketing System  or toll-free number receive all incoming requests  TTS is a browser-based, db-driven system developed from NCSA’s in-house ticketing system (use existing infrastructure!)  Users are able to track the progress of their tickets  New TG sites are easily integrated into system (all new ETF sites already integrated)

Technical Approach – TeraGrid Ticketing System (continued)  Problem Resolution – a tiered approach  Front-line evaluation, routing or resolution by TG Ops staff  Site-specific issues routed to site-leads for resolution  TG-wide issues routed to user support team to coordinate resolution by technical leads  Front-line Resolution an important factor  22% of all trouble tickets resolved by TOC staff

Trouble Ticket Processing From Open To Close  When a ticket is created, user receives auto- notification with ticket number  User receives personal reply within 30 minutes  Ticket is assigned to a project & to someone  User is kept updated on progress, resolution  Problem behind ticket is resolved  User is notified  User receives auto-notification of closure, with summary

Problem Resolution Workflow TeraGrid User Community TeraGrid Operations User Support Team TeraGrid Sites

Pulling Ops Centers Together:  A common set of web-based procedures documentation –  Routing & Assignment Guides  ’20 Questions’ Guides for problem determination  Basic operational policies and procedures  ‘Shift Turnover’ phone calls  Open communication & assistance

Challenges  TeraGrid is a huge learning curve for Ops Staff (must know at least a little bit about everything)  Keeping abreast with a constant state of change  Working with people who are very far away (and sometimes on vacation)  Promoting the concept of Problem Resolution (new to some) and getting everyone to use the Ticketing System  Inexperienced users on the horizon

Lessons Learned  More tickets than anyone expected  Problem Resolution on a global scale is expensive wrt time and talent consumed  TG Ops Center more than just a problem routing switchboard  Communication & coordination between RPs, services and TOC vital to success