HP BSM implementation summary IVAN TABARAVETS May 25, 2015
PROBLEM STATEMENT Incidents are identified and reported by users INCIDENTS PROCESSING EFFICIENCY Incidents are identified and reported by users Processing efficiency depends heavily on assignment precision Alert storms negate effective incident prioritization SERVICE LEVEL CONTROL Independent system operation monitoring result in lack of service health awareness Unacceptable service performance is realized reactively upon report analysis USER EXPERIENCE Technically healthy applications provide unsatisfying user experience Lack of user behavior emulation monitoring leads to long-standing usability issues
ONLY 9% PROACTIVELY RESOLVED SPORTCHEK The overwhelming majority of incidents related to application services are identified and reported by users In the most cases incidents identified and addressed by service team are discovered by users prior to resolution KEY HIGHLIGHTS SPORTCHEK INCIDENT PROCESSING STATISTICS ONLY 9% PROACTIVELY RESOLVED
80% 80% OVERHEAD IN 17% OF THE JOB of overhead SPORTCHEK In the most cases incidents are handled with an outstanding efficiency score of 97% But processing efficiency depends heavily on assignment precision And each consecutive reassignment dramatically decreases processing efficiency KEY HIGHLIGHTS SPORTCHEK INCIDENT PROCESSING EFFICIENCY 80% OVERHEAD IN 17% OF THE JOB 80% of overhead
ISSUE 80% IS IT AN of overhead WITH SLA? SPORTCHEK System operation monitoring carried out by often separated teams doesn’t guarantee sufficient awareness of service performance in case of complex applications SLA breaches are typically identified upon report analysis and are not timely addressed KEY HIGHLIGHTS SPORTCHEK SERVICE LEVEL CONTROL Streaming Request / Response Status / Error Corrections Web Portal Managed File Transfer Data Quality Monitor Error Unprocessed Data Operational Logs DATA STORAGE 1 Year Data 5 Year Compressed Date EVENT & ORDERS Customer Data Reporter Data PLL DATA Reference Data Market Data REFERENCE & TRADE DATA Email Services Client CAT-Hosted Portal Regulation Web Portal ETL Stage Validation, Added Value Calculations & Error Handling Stage 5 days Data Order Linkage Stage Unlinked Orders Errors BROADRIDGE HOSTED 2 Years Processed Data Archive Unprocessed Data Archive Logs Archive DATA ARCHIVE B2B INGESTION LAYER OPERATIONAL DATA STORE LAYER CENTRAL REPOSITORY LAYER DATA DELIVERY LAYER Regulator’s Sandbox Storage Query Optimization Data Enrichment DATA ACCESS Exchanges TRF / ADF Broker-Dealers / Reporting Agents CTA / UPT/ OPRA DATA SOURCES IS IT AN ISSUE WITH SLA? 80% of overhead
PROJECT OVERVIEW Enhance incident identification and processing efficiency, enable end-to-end control over service delivery conditions and improve user experience Implement a company-wide event management system built on HP BSM platform to cover application services essential for delivery operations The first stage of the project required a ten month long engagement of a project manager, a solution architect and three engineers totaling at approximately thirty man-months of effort THE GOAL THE PROJECT THE EFFORT
APPS 3 OF 60 FOR 30% OF INCIDENTS SPORTCHEK Project scope definition was driven by application services complexity, monthly incidents flow and impact on the delivery operations The first stage of the implementation aimed to cover EPAM Cloud, Atlassian Jira and Atlassian Confluence KEY HIGHLIGHTS SPORTCHEK CLOUD CONFLUENCE CTC JIRA GIT 3 OF 60 APPS FOR 30% OF INCIDENTS EXCHANGE
PROJECT ROADMAP Architecture Definition Architecture vision Functional Analysis and Solution Design Architecture definition document Solution requirements specification Solution Implementation Solution deployed Issues in Defect Tracking System Test Specification, Artifacts & Results Process Engineering Updated process documentation Data Source Connectors Development Integrated data sources Integration with Event Handling Systems Integrated with event handling systems Rollout to Production Solution deployed to production Service Topology Maps Discover Service Topology Discover Service Delivery Parameters System operation metrics mapped to service delivery parameters Define Event Handling Logic Event classes associated with handlers Classify Event Handlers Event handlers catalog Classify Events Event classes catalog QA
SOLUTION DESIGN PROCESSING LAYER PRESENTATION LAYER Event Management System Event Correlation System Event Management Analyze historical data Map events to service topology Consolidate, filter and log events Capture monitoring data Execute event handlers AGGREGATION LAYER SERVICE HEALTH DASHBOARD Visualize Service Health Analyzer Service Health Dashboard Present consolidated information about service health in real-time Call applicable event handlers and update service health information in uCMDB Map events to service topology data from uCMDB and filter excessive information Capture monitoring data from underlying monitoring systems and present it in a unified form of events EVENT MANAGEMENT SUBSYSTEM Operation Analytics Event Handlers EVENT CORRELATION SUBSYSTEM Event Log DATA AGGREGATION SUBSYSTEM Data Aggregation System uCMDB Monitoring Systems
PROJECT OUTCOMES INCIDENTS PROCESSING EFFICIENCY Up to 25% of incidents are resolved proactively, a 3-times increase over the baseline Incident processing overhead cut down by approximately 150 hours per month SERVICE LEVEL CONTROL System operation monitoring parameters are mapped to service delivery objectives Service health information is reported in real-time allowing timely corrective actions USER EXPERIENCE Proactive incident handling and better control over SLA improved user experience User behavior emulation helped us to discover and address tens of usability issues
LESSONS LEARNT DEFINE SERVICE DELIVERY PARAMETERS Ensure you have a well-defined SLA for application services to map system monitoring parameters to Service topology discovery and continuous actualization is an absolute must to make things work Event classes are many, event handlers are typically few, so event handlers catalog is a good starting point This is a lesson you can never completely learn: always overestimate your integration efforts and control your project scope to be able to deliver results on time and budget DISCOVER SERVICE TOPOLOGY MAPS PUT TOGETHER EVENT HANDLING PROCEDURES OVERESTIMATE INTEGRATION EFFORTS
THANKS FOR YOUR ATTENTION IVAN TABARAVETS IVAN_TABARAVETS@EPAM.COM 375.029.686.9232