Presentation is loading. Please wait.

Presentation is loading. Please wait.

May 2003 Statistical Exception Detection System, Based on MASF Technique Igor Trubin, Ph.D., Kevin McLaughlin Capital One Services, Inc.

Similar presentations


Presentation on theme: "May 2003 Statistical Exception Detection System, Based on MASF Technique Igor Trubin, Ph.D., Kevin McLaughlin Capital One Services, Inc."— Presentation transcript:

1 May 2003 Statistical Exception Detection System, Based on MASF Technique Igor Trubin, Ph.D., Kevin McLaughlin Capital One Services, Inc. igor.trubin@capitalone.com

2 May 2003 Introduction: Environment Capital One –6th largest card issuer in the United States –Capital One to S&P 500 in 1998 –Fortune 500 company starting in 2000 –Managed loans at $59.7 billion as of Q1 2003 –Accounts at 47.4 million as of Q1 2003 –CIO 100 Award “Master of the Customer Connection” –Information Week “Innovation 100” Award Winner –ComputerWorld “Top 100 places to work in IT”

3 May 2003 n Three Significant Capacity Management Problems – Filtering real capacity and performance issues – Disk I/O type metrics without natural thresholds like CPU or bandwidth utilization – Automatic recognition and web reporting/email-alerting of a potential performance/capacity issues n Task – Web-based exception reporting against computer performance database for a large, multi-platform environment n Methods – Statistical Process Control (SPC) – Multivariate Adaptive Statistical Filtering (MASF)

4 May 2003 Review of the existing tools  SAS/QC (Quality Control):  JMP from SAS Control charts for monitoring variations in process under statistical control: INT_STA0=23 Variables Control Chart XBar of TOT_MEM

5 May 2003 Review of the existing tools  MASF - The Patrol Perform and Predict tool from BMC software:  Other: for Teradata or Oracle; Concord eHealth – DFN (Deviation From Normal)

6 May 2003 Result: Exception Detection System (EDS) n Performance database (PDB) – SAS/ITSV; BMC Visualiser Database n Home made programs – SAS 8.2; Unix scripting (awk/sed/perl); VisualBasic.NET/SQL n Reporting – Intranet web server; HTML, Email n Special features – exception estimation; – statistical exception alerts; – exceptions database and others… n The rules to avoid taking into consideration: – noise (like runaway processes); – insignificant exceptions (like slight increases of workloads for underutilized servers); – other insignificant patterns, based on the analyst’s interpretation.

7 May 2003 EDS Structure  Exception detectors for the most important metrics such as CPU, memory and disk utilization, memory page rate, and CPU run queue;  statistical process control daily profile chart generator;  exception server name list generator;  Leader/Outsider servers detector and detector of runaway processes; Exception Detection System database with history of exceptions;  Leaders/Outsiders bar charts generator. (see the next slide)

8 May 2003 SPC Chart for Web Publishing

9 May 2003 Application vs. Global SPC Charts

10 May 2003 Email Notification Sections n Exception list with servers, which had exceptions yesterday (In front of each server name, there is a sublist of application names that had exceptions as well for immediate identification of the critical workload) n Null data list with servers, which did not have performance data due to any reason. (Data delivery problem) n Insufficient data server list. Number of observations is less than a certain quantity (empirical rule is "< 6")

11 May 2003 Exception Database Structure

12 May 2003 Exception Database Example

13 May 2003 System performance daily web report based on EDS database History (see the next slide)

14 May 2003 Server Statistical Health Check against EDS Database (History of Exceptions)

15 May 2003 "Extra Volume" Metric

16 May 2003 "Extra Volume" Metric ExtraVolume=UpperVolume+LowerVolume UpperVolume>0 - the area between the upper limit curve and the actual data curve LowerVolume<0 - the area between the lower limit curve and the actual data curve For CPU utilization ExtraVolume is the daily CPU time (ExtraTime) that the server took more than standard deviation.

17 May 2003 TOP Leaders Charts (ExtraVolume >0)

18 May 2003 TOP Outsiders Charts (ExtraVolume <0)

19 May 2003 EDS Database and Server Size Factor To compare different server’s configuration for the CPU utilization metric ExtraVolume can be recalculated to an abstract transaction rate using a TPC benchmark. (see www.tpc.org and reference [3]).

20 May 2003 EDS Report With Server Sizing Adjustment:

21 May 2003 Ad-hoc Analyses Against EDS Database

22 May 2003 This analysis has the following conclusions: n The usage of the server's resources is not balanced. n CPU subsystem has excess capacity n Disk subsystem mostly experienced the impact. It is a possible performance and/or capacity bottleneck. n Memory page rate had a few exceptions, which probably correlate to Disk I/O activity, and is not a concern.

23 May 2003 Summary The Exception Detection System was developed as a combination of the classical SPC approach and some new ideas such as an EDS database to keep a history of exceptions, and using some new integrative metrics like ExtraVolume to better analyze unusual consumption of server resources. Application level is added to the system.

24 May 2003 Summary The system adequately supports the rapid growth of the company, and it doesn’t require buying new analysis software (when using existing SAS tools). The efficiency of this system has helped reduce the reaction time to exceptions and the amount of time necessary to prepare exception reports.

25 May 2003 References [1] Krajewski / Ritzman: Operation Management, 1990, Addison-Wesley Publishing Company, Inc. [2] Jeffrey Buzen and Annie Shum: "MASF – Multivariate Adaptive Statistical Filtering," Proceedings of the Computer Measurement Group, 1995, pp. 1-10. [3] Bob Chan: "Unix Server Sizing -Or What to do When There are No MIPS", Proceedings of the Computer Measurement Group, 2000 [4] Kevin McLaughlin and Igor Trubin: "Exception Detection System, Based on the Statistical Process Control Concept", Proceedings of the Computer Measurement Group, 2001

26 May 2003 Thanks! Igor Trubin Tech Services Capacity Planning Capital One Services, Inc. igor.trubin@capitalone.com


Download ppt "May 2003 Statistical Exception Detection System, Based on MASF Technique Igor Trubin, Ph.D., Kevin McLaughlin Capital One Services, Inc."

Similar presentations


Ads by Google