CS548 Fall 2017 Anomaly Detection

Slides:



Advertisements
Similar presentations
PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.
Advertisements

Chapter 3 – Data Exploration and Dimension Reduction © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Copyright © 2006 The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 ~ Curve Fitting ~ Least Squares Regression Chapter.
Detection of Deviant Behavior From Agent Traces Boštjan Kaluža Department of Intelligent Systems, Jožef Stefan Institute Jozef Stefan Institute Jožef Stefan.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
1er. Escuela Red ProTIC - Tandil, de Abril, 2006 Principal component analysis (PCA) is a technique that is useful for the compression and classification.
8-1 Quality Improvement and Statistics Definitions of Quality Quality means fitness for use - quality of design - quality of conformance Quality is.
Intro to Statistics for the Behavioral Sciences PSYC 1900
Chapter 13 Forecasting.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability.
Edpsy 511 Homework 1: Due 2/6.
Correlation and Regression. Correlation What type of relationship exists between the two variables and is the correlation significant? x y Cigarettes.
Quality Assurance.
As with averages, researchers need to transform data into a form conducive to interpretation, comparisons, and statistical analysis measures of dispersion.
Tables, Figures, and Equations
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Correlation and Regression Analysis
Probability and Statistics in Engineering Philip Bedient, Ph.D.
Control charts : Also known as Shewhart charts or process-behaviour charts, in statistical process control are tools used to determine whether or not.
Chemometrics Method comparison
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Correlation and Linear Regression
Correlation and Linear Regression
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.
Linear Regression and Correlation
Patterns of significant seismic quiescence in the Pacific Mexican coast A. Muñoz-Diosdado, A. H. Rudolf-Navarro, A. Barrera-Ferrer, F. Angulo-Brown National.
© 2002 Prentice-Hall, Inc.Chap 14-1 Introduction to Multiple Regression Model.
Variance and Standard Deviation
Chapter 6 & 7 Linear Regression & Correlation
Model Building III – Remedial Measures KNNL – Chapter 11.
Introduction to Linear Regression
McGraw-Hill/Irwin Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. Chapter 13 Linear Regression and Correlation.
Objectives 2.1Scatterplots  Scatterplots  Explanatory and response variables  Interpreting scatterplots  Outliers Adapted from authors’ slides © 2012.
Chapter 3 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 3: Measures of Central Tendency and Variability Imagine that a researcher.
Rhine-Westhalia Institute for Economic Research in Germany 1 “Peak demand in hospitals and patient outcomes” Christoph Schwierz, RWI Essen Boris Augurzky,
Monitoring High-yield processes MONITORING HIGH-YIELD PROCESSES Cesar Acosta-Mejia June 2011.
MULTIVARIATE TIME SERIES & FORECASTING 1. 2 : autocovariance function of the individual time series.
Linear Regression and Correlation Chapter GOALS 1. Understand and interpret the terms dependent and independent variable. 2. Calculate and interpret.
Statistics for Political Science Levin and Fox Chapter Seven
Principal Component Analysis (PCA)
1 SMU EMIS 7364 NTU TO-570-N Control Charts Basic Concepts and Mathematical Basis Updated: 3/2/04 Statistical Quality Control Dr. Jerrell T. Stracener,
September 28, 2000 Improved Simultaneous Data Reconciliation, Bias Detection and Identification Using Mixed Integer Optimization Methods Presented by:
Principal Component Analysis
©The McGraw-Hill Companies, Inc. 2008McGraw-Hill/Irwin Linear Regression and Correlation Chapter 13.
Chapter 14 EXPLORATORY FACTOR ANALYSIS. Exploratory Factor Analysis  Statistical technique for dealing with multiple variables  Many variables are reduced.
Statistics for Business and Economics Module 2: Regression and time series analysis Spring 2010 Lecture 7: Time Series Analysis and Forecasting 1 Priyantha.
Copyright © Cengage Learning. All rights reserved. 10 Inferences about Differences.
Lecture 9 Forecasting. Introduction to Forecasting * * * * * * * * o o o o o o o o Model 1Model 2 Which model performs better? There are many forecasting.
Stats Methods at IC Lecture 3: Regression.
Correlation and Linear Regression
Chapter 14 Introduction to Multiple Regression
Part 5 - Chapter
Regression Analysis Module 3.
Online Conditional Outlier Detection in Nonstationary Time Series
Outlier Processing via L1-Principal Subspaces
Week 14 Chapter 16 – Partial Correlation and Multiple Regression and Correlation.
Reasons for not attending to present at UKSim 2018
Quality Control at a Local Brewery
Techniques for Data Analysis Event Study
Exam 5 Review GOVT 201.
Identifying and Correcting Outliers From Paleo Forage Fish Records using a Multivariate Statistical Approach MARS6300 Alex Filardo 2018.
Chapter 13 Group Differences
Principal Component Analysis
CORRELATION AND MULTIPLE REGRESSION ANALYSIS
Introduction To Medical Technology
Principal Component Analysis
Chapter 10 Introduction to the Analysis of Variance
Chapter 9 Hypothesis Testing: Single Population
Testing Causal Hypotheses
Presentation transcript:

CS548 Fall 2017 Anomaly Detection Showcase by Jun Dao, Qiming Wang, Emily Weber, Zijun Xu, Ruosi Zhang Showcasing work by Harrou, F., Kadri, F., Chaabane S., Tahon, C., Sun, Y. on Improved principal component analysis for anomaly detection: Application to an Emergency Department

References [1] Harrou, F., Kadri, F., Chaabane, S., Tahon, C., Sun, Y. (2015). Improved principal component analysis for anomaly detection: Application to an emergency department. Computers and Industrial Engineering, 88, 63-77. [2] Ruiz, C. Class Lecture, Topic: “Anomaly Detection.” CS548, Worcester Polytechnic Institute, Worcester, MA, Nov, 9, 2017. [3] Hines, J., Penha, R. (2001). Using Principal Component Analysis Modeling to Monitor Temperature Sensors in a Nuclear Research Reactor.

Data set From: Pediatric Emergency Department (PED) in Lille Regional Hospital, France. Attributes: 10 time-series variables in terms of daily number of patients A high degree of cross correlation among the variables Dates: Daily time from January to December 2011 data for training 2012 data for testing Data matrix: 362 rows ×10 columns

Arrival Number (X1): Daily number of patient arrivals Arrival means (X2): Daily number of patient arrivals not by emergency vehicle CCMU1(X3): Daily number of non-urgent patient arrivals CCMU2 (X4): Daily number of patient arrivals with a stable prognosis GEMSA2 (X5): Daily number of unexpected patients Radiology (X6): Daily number of patient arrivals for radiology Scanner (X7): Daily number of patient arrivals for scanner Echography (X8): Daily number of patient arrivals for echography Biology (X9): Daily number of patient arrivals for biology (labs) Discharge-home (X10): Daily number of patient discharged (sent home)

Introduction Issue: Solution: From the National Academies for Science, Engineering and Medicine: Between 1993 and 2003: In the U.S. patients increased by 26%, while Eemergency Ddepartments (EDs) decreased by 9% Patient influx to EDs generates strain situations that affect building safety and reliability Solution: Detecting abnormal demands on EDs will improve the management of patients and medical resources Technique: Anomaly Detection

Monthly PED arrivals Daily PED arrivals Taken from [1] Actual number of arrivals per month from January 2011 to December 2011. Daily PED arrivals Taken from [1]

Anomaly Detection: PCA based Statistical Modeling Build a profile of “normal behavior” Use a training set of “normal” operations containing no anomalies Scale training set to have a zero mean and unit variance Build a PCA model using training set Compute control limits for normal operations Use “normal” profile to detect anomalies Scale new point with mean and standard deviation from training set For new point, calculate residuals using PCA model Compute monitoring statistic for new point Verbage taken from [2]

Anomaly Detection: PCA based Statistical Modeling Definition of Outlier: An outlier is a time period that has an abnormal amount of patient arrivals Anomaly score function: A data instance’s monitoring statistic is greater than the control limits of normal operations How does the approach work? Calculate control limits for normal operations ( 𝑇 𝛼 2 or 𝑄 𝛼 ) Calculate monitoring statistic for new point (T2 or Q) If T2 > 𝑇 𝛼 2 or Q > 𝑄 𝛼 then an anomaly is declared Verbage taken from [2]

PCA based statistical monitoring Raw data matrix X Decompose of X into a process subspace and a residual subspace T PT 𝑿 𝑠 = 𝑇 | 𝑇 𝑃 | 𝑃 𝑇 = 𝑇 𝑃 𝑇 + 𝑇 𝑃 𝑇 = 𝑿 𝑠 𝑃 𝑃 𝑇 + 𝑿 𝑠 ( 𝐼 𝑚 − 𝑃 𝑃 𝑇 ) 𝑋 E Taken from [1]

Control Limits and Monitoring Statistics Hotelling’s T2 statistic Measures the variation within the PCA model Monitoring Statistic: 𝑇 2 = 𝑥 𝑠 𝑇 𝑃 Λ 𝑃 𝑇 𝑥 𝑠 = 𝑖=1 𝑙 𝑡 𝑖 2 𝜆 𝑖 𝑤ℎ𝑒𝑟𝑒 Λ 𝑑𝑖𝑎𝑔𝑛𝑎𝑙 𝑚𝑎𝑡𝑟𝑖𝑥 𝑤𝑖𝑡ℎ 𝑒𝑖𝑔𝑒𝑛𝑣𝑎𝑙𝑢𝑒𝑠 𝑜𝑓 𝑃𝐶𝑠 Control Limit: 𝑇 𝛼 2 = Χ 𝑙,𝛼 2 where α is the level of significance (between 1% and 5%)

Control Limits and Monitoring Statistics Q Statistic Measures how well the new point fits the PCA model Monitoring Statistic: (𝐼− 𝑃 𝑃 𝑇 ) 𝑥 𝑠 2 = 𝐸 2 -distance the new point fall from the PCA model Control Limit: 𝑄 𝛼 = 𝜑 1 ℎ 0 𝑐 𝛼 2 𝜑 2 𝜑 1 +1+ 𝜑 2 ℎ 0 ( ℎ 0 −1) 𝜑 1 2 where 𝜑 𝑖 = 𝑗=𝑙+1 𝑚 𝜆 𝑗 𝑖 , 𝑖=1,2,3 and ℎ 0 =1− 2 𝜑 1 𝜑 3 3 𝜑 2 2

Taken from [1]

Problems with PCA based Statistical Modeling T2 and Q Statistics cannot detect small anomalies Statistics largely depend on how many principal componets are kept Need dectector that has higher sensitivity and less dependent on PCs

PCA based MCUSUM Anomaly Detection Taken from [1]

PCA based MCUSUM Anomaly Detection Multivariate Cumulative Sum (MCUSUM) control chart Used to monitor uncorrelated residuals obtained from PCA model Normal operations = residuals close to zero Abnormal operations = residuals that deviate from zero indicating a new condition that is different from normal operations Detecting an anomaly is done almost the same as before except Monitoring Statistic  Decision Function Ct Control Limits  H, where H is chosen to provide a pre-defined in-control Average Run Length using simulation Step 2 Step 1

Experiments Abrupt Anomaly: Sudden increase in patient arrivals Case 1A: Add 50% of the total variation in X1 to samples 141 to 147 in the testing set Case 1B: Add 25% of the total variation in X1 to samples 141 to 147 in the testing set Case Single-Data Strain: Add 25% of the total variation in X1 to sample 147 Gradual Anomaly: A slow increase in patient arrivals Case B: a slow gradual anomaly with slope = 0.1 is added to X1

Case A1 Taken from [1]

Case A2 Taken from [1]

Single-data strain Taken from [1]

Case B Taken from [1]

Conclusion Detection of abnormal demand for patient care is beneficial for reactive control of strain situations Knowing when abnormalities take place can: Help managers be proactive in preparing for them Determine when and why abnormalities occur Help managers act quickly in the occurrence of a strain situation

Questions?