Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04/ | SDM 2009 / Anomaly-Detection Proximity-Based Anomaly Detection using Sparse Structure.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Tests of Static Asset Pricing Models
Independent Component Analysis
Discovering Cyclic Causal Models by Independent Components Analysis Gustavo Lacerda Peter Spirtes Joseph Ramsey Patrik O. Hoyer.
Prediction with Regression
Pattern Recognition and Machine Learning
Kriging.
Cost of surrogates In linear regression, the process of fitting involves solving a set of linear equations once. For moving least squares, we need to form.
Computer vision: models, learning and inference Chapter 8 Regression.
Tokyo Research Laboratory © Copyright IBM Corporation 2006 | 2006/09/19 | PKDD 2006 Why does subsequence time-series clustering produce sine waves? IBM.
Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.
Dimension reduction (1)
Exploiting Sparse Markov and Covariance Structure in Multiresolution Models Presenter: Zhe Chen ECE / CMR Tennessee Technological University October 22,
Chapter 2: Lasso for linear models
Artificial Intelligence Lecture 2 Dr. Bo Yuan, Professor Department of Computer Science and Engineering Shanghai Jiaotong University
Coefficient Path Algorithms Karl Sjöstrand Informatics and Mathematical Modelling, DTU.
x – independent variable (input)
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Temporal Causal Modeling with Graphical Granger Methods
Data Basics. Data Matrix Many datasets can be represented as a data matrix. Rows corresponding to entities Columns represents attributes. N: size of the.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Preference Analysis Joachim Giesen and Eva Schuberth May 24, 2006.
Basic Mathematics for Portfolio Management. Statistics Variables x, y, z Constants a, b Observations {x n, y n |n=1,…N} Mean.
Classification and Prediction: Regression Analysis
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Elec471 Embedded Computer Systems Chapter 4, Probability and Statistics By Prof. Tim Johnson, PE Wentworth Institute of Technology Boston, MA Theory and.
Normalised Least Mean-Square Adaptive Filtering
Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04/03 | SDM 09 / Travel-Time Prediction Travel-Time Prediction using Gaussian Process.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
LINEAR REGRESSION Introduction Section 0 Lecture 1 Slide 1 Lecture 5 Slide 1 INTRODUCTION TO Modern Physics PHYX 2710 Fall 2004 Intermediate 3870 Fall.
CSE554Laplacian DeformationSlide 1 CSE 554 Lecture 8: Laplacian Deformation Fall 2012.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
The horseshoe estimator for sparse signals CARLOS M. CARVALHO NICHOLAS G. POLSON JAMES G. SCOTT Biometrika (2010) Presented by Eric Wang 10/14/2010.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Discriminant Function Analysis Basics Psy524 Andrew Ainsworth.
Fundamentals of Data Analysis Lecture 10 Management of data sets and improving the precision of measurement pt. 2.
2 Multicollinearity Presented by: Shahram Arsang Isfahan University of Medical Sciences April 2014.
Sparse Inverse Covariance Estimation with Graphical LASSO J. Friedman, T. Hastie, R. Tibshirani Biostatistics, 2008 Presented by Minhua Chen 1.
Chapter 21 R(x) Algorithm a) Anomaly Detection b) Matched Filter.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Multivariate Statistics Matrix Algebra I W. M. van der Veld University of Amsterdam.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
Learning Linear Causal Models Oksana Kohutyuk ComS 673 Spring 2005 Department of Computer Science Iowa State University.
Canonical Correlation Psy 524 Andrew Ainsworth. Matrices Summaries and reconfiguration.
A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.
Colorado Center for Astrodynamics Research The University of Colorado 1 STATISTICAL ORBIT DETERMINATION Probability and statistics review ASEN 5070 LECTURE.
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
Tokyo Research Laboratory © Copyright IBM Corporation 2005SDM 05 | 2005/04/21 | IBM Research, Tokyo Research Lab Tsuyoshi Ide Knowledge Discovery from.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Pairwise Preference Regression for Cold-start Recommendation Speaker: Yuanshuai Sun
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Machine Learning 5. Parametric Methods.
- 1 - Preliminaries Multivariate normal model (section 3.6, Gelman) –For a multi-parameter vector y, multivariate normal distribution is where  is covariance.
MathematicalMarketing Slide 5.1 OLS Chapter 5: Ordinary Least Square Regression We will be discussing  The Linear Regression Model  Estimation of the.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Computational Intelligence: Methods and Applications Lecture 14 Bias-variance tradeoff – model selection. Włodzisław Duch Dept. of Informatics, UMK Google:
1 C.A.L. Bailer-Jones. Machine Learning. Data exploration and dimensionality reduction Machine learning, pattern recognition and statistical data modelling.
Normal Distribution Prepared by: Ameer Sameer Hamood
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Analyzing Redistribution Matrix with Wavelet
PSG College of Technology
Estimating Networks With Jumps
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Machine Learning Math Essentials Part 2
Biointelligence Laboratory, Seoul National University
5.2 Least-Squares Fit to a Straight Line
5.4 General Linear Least-Squares
Generally Discriminant Analysis
Presentation transcript:

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04/ | SDM 2009 / Anomaly-Detection Proximity-Based Anomaly Detection using Sparse Structure Learning Tsuyoshi Idé (IBM Tokyo Research Lab) Aurelie C. Lozano, Naoki Abe, and Yan Liu (IBM T. J. Watson Research Center)

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 2 /16 Goal: Compute anomaly score for each variable to capture anomalous behaviors in variable dependencies. reference data “Something wrong between x 2 and x 4 ” x2x2 x4x4 Some anomalies cannot be detected only by looking at individual variables (e.g. no increase in RPM when accelerating) In practice, we need to reduce pairwise information to an anomaly score for each variable variable anomaly score

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 3 /16 Difficulty -- Correlation values are extremely unstable (1/2): Example from econometrics data.  Data: daily spot prices over two different terms  foreign currencies in dollars  No evidence that the international relationships changed between the terms  However, most of the correlation coefficients are completely different Data source term 1 term 2

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 4 /16 Difficulty -- Correlation values are extremely unstable (2/2): We can make meaningful comparisons by focusing on neighborhoods.  Important observation: Highly correlated pairs are stable. Look only at neighborhood of each variable for robust comparisons.

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 5 /16 We want to remove spurious dependencies caused by noise, and leave essential dependencies.  Input: Multivariate (time-series) data  Output: Weighted graph representing essential dependencies of variables  The graph will be sparse Node = variable Edge = dependency between two variables No edge = two nodes are independent of each other

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 6 /16 Approach: (1) Select neighbors using sparse learning method, (2) Compute anomaly score based on the selected neighbors.  Our problem: Compute anomaly (or change) score of each variable based on comparison with reference data. (1) Sparse structure learning (2) Scoring each variable reference data

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 7 /16 We use the Graphical Gaussian Model (GGM) for structure learning, where each graph is uniquely related to a precision matrix.  Precision matrix = Inverse of covariance matrix S  General rule: No edge if corresponding element of is zero  Ex.1: If, there is no edge between x 1 and x 2 Implying they are statistically independent given the rest of the variables. Why? Because this condition factorizes the distribution.  Ex. 2: A six variable case

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 8 /16 Recent trends in GGM: Classical methods are being replaced with modern sparse algorithms.  Covariance selection (classical method)  Dempster [1972]: Sequentially pruning smallest elements in precision matrix  Drton and Perlman [2008]: Improved statistical tests for pruning  L 1 -regularization based method (hot !)  Meinshausen and Bühlmann [Ann. Stat. 06]: Used LASSO regression for neighborhood selection  Banerjee [JMLR 08]: Block sub-gradient algorithm for finding precision matrix  Friedman et al. [Biostatistics 08]: Efficient fixed-point equations based on a sub-gradient algorithm  … Serious limitations in practice: breaks down when covariance matrix is not invertible Structure learning is possible even when # variables > # samples

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 9 /16 One-page summary of Meinshausen-Bühlmann (MB) algorithm: Solving separated Lasso for every single variables. Step 1: Pick up one variable Step 2: Think of it as “y”, and the rest as “z” Step 3: Solve Lasso regression problem between y and z Step 4: Connect the k-th node to those having nonzero weight in w

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 10 /16 Instead, we solve an L 1 -regularized maximum likelihood equation for structure learning.  Input: Covariance matrix S  Assumes standardized data (mean=0, variance=1)  S is generally rank-deficient Thus the inverse does not exist  Output: Sparse precision matrix  Originally, is defined as the inverse of S, but not directly invertible  Need to find a sparse matrix that can be thought as of as an inverse of S  Approach: Solve an L 1 -regularized maximum likelihood equation log likelihood regularizer

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 11 /16 From matrix optimization to vector optimization: Solving coupled Lasso for every single variables.  Focus only on one row (column), keeping the others constant  Optimization problem for blue vector is shown to be Lasso (L 1 -regularized quadratic programming)  (See the paper for derivation)  Difference from MB’s: Resulting Lasso problems are coupled  The gray part is actually not constant; changes after solving one Lasso problem  This coupling is essential for stability under noise, as discussed later

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 12 /16 Defining anomaly score using the sparse graphical models.  Now we have two Gaussians for reference and target data  We use Kullback–Leibler divergence as a discrepancy metric   Result for anomaly score of the i-th variable:  d i AB = (change in degrees of node x i ) + (change in “tightness” of node x i ) + (change in variance of node x i itself) neighbors of node x i in data set A neighbors of node x i in data set B referencetarget KL div

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 13 /16 Experiment (1/4) -- Structure learning under collinearities: Experimental settings  Data: daily spot prices  Strong collinearity exists (See the beginning slides)  Focused on a single term  Observed the change of structure after introducing noise  Perform structure learning from the data  Learning again after introducing noise Added Gaussian noise having sigma = 10% standard deviation of the original data  Compared three structure learning methods  “Glasso” Friedman, Hastie, & Tibshirani., Biostatistics, 2008  “Lasso” Meinshausen & Bühlmann, Ann. Stats  “AdaLasso” Improved version of MB’s algorithm, where regression is based on Adaptive Lasso [H. Zou, JASA, 2006] rather than simple Lasso

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 14 /16 Experiment (2/4) -- Structure learning under collinearities: Only “Graphical lasso” was stable  MB’s algorithm doesn’t work under collinearities, while Glasso shows reasonable stability  This is due to the general tendency that Lasso selects one of correlated features almost at random c.f. Bolasso [Bach 08], Stability Selection [MB 08]  Glasso avoids this problem by solving coupled version of Lasso Sparsity ratio of disconnected edges to all possible edges Flip prob. pro. of how many edges are changed after introducing noise Don’t reduce structure learning to separated regression problems of individual variables. Treat the precision matrix as matrix.

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 15 /16 Experiment (3/4) -- Anomaly detection using automobile sensor data: Experimental settings  Automobile sensor data  44 variables  79 reference and 20 faulty data sets  In faulty data, two variables exhibit a correlation anomaly x 24 and x 25 (not shown)  Compute a set of anomaly scores for each of 79 x 20 data pairs  Result is summarized in ROC curve Area Under Curve (AUC) will be 1 if top 2 variables in anomaly score are always occupied by truly faulty variables normal faulty loss of correlation

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Page 16 /16 Experiment (4/4) -- Anomaly detection using automobile sensor data: Our method substantially reduced false positives.  Methods compared  likelihood-based score (conventional)  k-NN method for neighborhood selection  a stochastic neighborhood selection method [Idé et al, ICDM 07]  Our KL-divergence-based method gives the best results our approach

Tokyo Research Laboratory © Copyright IBM Corporation 2009 | 2009/04 | SDM 2009 Thank you!