Best practises and experiences in user support – A case study at GARUDA Mr Santhosh J santhoshj@cdac.in Centre for Development of Advanced Computing, Bangalore, India
That holds true for most businesses and services That holds true for most businesses and services. And it holds true for Grid or Cloud computing environments as well. 3/22/2018 ISGC 2018
GARUDA India's first national grid initiative bringing together academic, scientific and research communities for their data and compute intensive applications that are of national importance. GARUDA grid is an aggregation of resources comprising of computational nodes and mass storage distributed across the country. No. of GARUDA Partners – 75 NKN (National Knowledge Network) connectivity at 10Gbps About GARUDA & its complex environment Distributed administrative domains Varied types of users – General grid user, developer, remote admins, application enablement, VO managers, users from different domains Multiple Applications & VOs PKI infrastructure 3/22/2018 ISGC 2018
GARUDA – computational resources Distributed resources Distributed administrative domains Varied architecture Heterogenous Resources – Hardware: Opteron, POWER5 (AIX, Linux), Xeon, etc., OS: Linux (Varied distributions like, RHEL, CentOS, AIX etc., Schedulers: SGE, PBS, PBS PRO, LoadLeveller 3/22/2018 ISGC 2018
High Level System Components of GARUDA About GARUDA & its complex environment Multiple Applications & VOs PKI infrastructure 3/22/2018 ISGC 2018
Indian Grid Certification Authority (IGCA) IGCA is the accredited member of APGridPMA Issues x.509 Certificates to support the secure environment for Grid. Issues certificates for users & resources of GARUDA grid, institutes that do research in grid computing in India and foreign institutes that collaborates with GARUDA. http://ca.garudaindia.in 3/22/2018 ISGC 2018
3/22/2018 ISGC 2018
Interoperability with International Grids Integrating technological components of GARUDA and EGI Glite and Globus Customizing Gridway meta-scheduler To run real life application across both infrastructures 3/22/2018 ISGC 2018
Grid support challenges Support for Interoperable Grids Lack of Knowledge base Lack of Tracking, Prioritization & work allocation Different Types of Users Grid users, developers, remote admins, VO managers, Grid certificate requests, Application enablement. Distributed resources and administrators Decentralized Support requests EU-India Grid, CHAIN 80% of users request support via emails 20% make phone calls Distributed Support teams Incident Management Release Management Change Management Need a support system to handle all these challenges…. Boils down to addressing each challenge/module. Network, R&D, Admin, Security HPC admins, appl. enablement Incidents, attacks and recovery Software, Portals and Application updates Addition & Removal of Resources, Maintenance of Resources 3/22/2018 ISGC 2018
Transforming Grid Support Centralized Support system Integrating distributed support teams Tracking, Prioritizing, Categorizing and assigning to right team Automating Grid Operations Integrated FAQ’s and Knowledge base Integrated Reporting & Analytics Weekly Review meetings Effective User support Decentralized to Centralized. Bringing all the support teams under one umbrella. Track, etc., Automating Grid Operations. – Service recovery, testing of job submission (Daily cron jobs), Automated Data backup & Recovery, Setting up of High Availability, User creation, DN Mapping, Certificate Renewal Notice etc., Creating Knowledge Base Reporting & Analytics – Which service is getting more tickets, how the tickets are been resolved, How long an issue takes. – Followed by Review meeting to solve the pending cases Asap, and share the experiences. Weekly meeting – With Remote administrators with VC. Aim of this transformation is to build an effective user support. 3/22/2018 ISGC 2018
Transforming Grid Support Integrated ticketing system Convert all incoming emails, calls, chats into tickets. Prioritize, categorize and assign them to the right people. Integrating support teams GARUDA Grid has plenty of support teams distributed across institutes VO Managers, HPC administrators, Grid administrators, , Portal & PSE developers, Application enablers and security handling groups. Integrated ticketing system brings all of those distributed groups into a unified team. Enables team collaboration, avoids collision. 3/22/2018 ISGC 2018
Automating Grid operations Setting up of Grid Operation Center to monitoring resources and events. Monitoring and recovering services automatically Simplifying registrations, certificate issuance, VO subscription and credential/proxy management and mapping certificate DNs across resources. Incident management & notifications. Configuration management. Automated tests for compliance, security and other policies https://www.inspec.io/ - Automated tests for compliance, security & other policies. Chef.io or puppet. – Configuration management. Recovering services by monitoring automatically by the scripts. Grid Operations to monitor the job flow by submitting test jobs. 3/22/2018 ISGC 2018
Integrated FAQ’s and Knowledge Base Frequently asked questions and knowledge base integrated into the ticketing system. Reduces the volume of tickets raised. We saw up to 40% of users using knowledge base to solve their issues. 3/22/2018 ISGC 2018
Reporting and analytics Ticketing system with integrated reporting and analytics system helped to measure and understand the entire user experience. Helps to differentiate channels with the volume of tickets raised. ISO standards Adhering to ISO standards ensures users get reliable, timely and efficient grid services. 3/22/2018 ISGC 2018
Weekly review meetings Conducted weekly review meetings to understand the issues. All remote support teams participate via video conference. Helps to discuss pending issues, share expertise and suggestions 3/22/2018 ISGC 2018
Manual Support Vs Ticketing system Identify, explain & assign to right group Manual communication Monitor status Respond to users Before 10 mins engage 5 mins Assign 5 mins Follow up 5 mins Resolve Ticket raised & automatically assigned to right group Automated response to users After How a ticketing system can improve the user support when compared with manual support. Fix SLA in the ticketing system. 2 mins Assign 1 min Resolved Automated communication 3/22/2018 ISGC 2018
No. of Tickets – 20,000+ 3/22/2018 Stats from 2007 to 2016 ISGC 2018
GARUDA stats No. of Jobs Executed 1,10,687 No. of VO’s 10 Total no. of Users 2500 Active Users 500+ Stats from 2007 to 2016 3/22/2018 ISGC 2018
Lessons learnt Even Grid & HPC support system can learn from industries best practices Automation of grid operations helps in reducing the volume of issues raised. Knowledge bases & FAQ’s helped users solve the issues themselves. Ticketing system helped offering effective user support. Collaboration is crucial to get everyone working on an issue. No tool provides all solutions. Hence, certain modules are developed in- house to deploy a complete ticketing system. Finally, we have found that proper monitoring methodology, automation and adhering to quality standards gives good results in providing high grid availability hence the effective user support. In-House – Knowledge Base, Phone to Tickets, Chat to tickets, Automating Grid Operations, etc., 3/22/2018 ISGC 2018
Conclusion Our paper aims to share our experience in offering user support for a highly complex Grid computing environment - GARUDA However, the approach will suite for any HPC computing environments. 3/22/2018 ISGC 2018
Thank you 3/22/2018 ISGC 2018