Handling Outliers and Missing Data in Statistical Data Models Kaushik Mitra Date: 17/1/2011 ECSU Seminar, ISI.

Slides:

Advertisements

Similar presentations

Matrix Factorization with Unknown Noise

Advertisements

Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.

Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.

Prediction with Regression

An Introduction of Support Vector Machine

Computer vision: models, learning and inference Chapter 8 Regression.

CS Statistical Machine learning Lecture 13 Yuan (Alan) Qi Purdue CS Oct

Fast Bayesian Matching Pursuit Presenter: Changchun Zhang ECE / CMR Tennessee Technological University November 12, 2010 Reading Group (Authors: Philip.

Multi-Task Compressive Sensing with Dirichlet Process Priors Yuting Qi 1, Dehong Liu 1, David Dunson 2, and Lawrence Carin 1 1 Department of Electrical.

Graph Laplacian Regularization for Large-Scale Semidefinite Programming Kilian Weinberger et al. NIPS 2006 presented by Aggeliki Tsoli.

Jun Zhu Dept. of Comp. Sci. & Tech., Tsinghua University This work was done when I was a visiting researcher at CMU. Joint.

Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.

Visual Recognition Tutorial

A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.

Principal Component Analysis

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.

Stanford CS223B Computer Vision, Winter 2007 Lecture 8 Structure From Motion Professors Sebastian Thrun and Jana Košecká CAs: Vaibhav Vaish and David Stavens.

Visibility Subspaces: Uncalibrated Photometric Stereo with Shadows Kalyan Sunkavalli, Harvard University Joint work with Todd Zickler and Hanspeter Pfister.

Stanford CS223B Computer Vision, Winter 2006 Lecture 8 Structure From Motion Professor Sebastian Thrun CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.

Bootstrapping a Heteroscedastic Regression Model with Application to 3D Rigid Motion Evaluation Bogdan Matei Peter Meer Electrical and Computer Engineering.

Cao et al. ICML 2010 Presented by Danushka Bollegala.

Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.

Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.

Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.

Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.

Non Negative Matrix Factorization

Recovering low rank and sparse matrices from compressive measurements Aswin C Sankaranarayanan Rice University Richard G. Baraniuk Andrew E. Waters.

Machine Learning Seminar: Support Vector Regression Presented by: Heng Ji 10/08/03.

1 Sparsity Control for Robust Principal Component Analysis Gonzalo Mateos and Georgios B. Giannakis ECE Department, University of Minnesota Acknowledgments:

计算机学院计算感知 Support Vector Machines. 2 University of Texas at Austin Machine Learning Group 计算感知计算机学院 Perceptron Revisited: Linear Separators Binary classification.

CSCE 643 Computer Vision: Structure from Motion

Young Ki Baik, Computer Vision Lab.

Virtual Vector Machine for Bayesian Online Classification Yuan (Alan) Qi CS & Statistics Purdue June, 2009 Joint work with T.P. Minka and R. Xiang.

Lec 22: Stereo CS4670 / 5670: Computer Vision Kavita Bala.

Efficient computation of Robust Low-Rank Matrix Approximations in the Presence of Missing Data using the L 1 Norm Anders Eriksson and Anton van den Hengel.

Christopher M. Bishop, Pattern Recognition and Machine Learning.

Sparse Bayesian Learning for Efficient Visual Tracking O. Williams, A. Blake & R. Cipolloa PAMI, Aug Presented by Yuting Qi Machine Learning Reading.

Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.

Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.

University of Texas at Austin Machine Learning Group Department of Computer Sciences University of Texas at Austin Support Vector Machines.

Additional Topics in Prediction Methodology. Introduction Predictive distribution for random variable Y 0 is meant to capture all the information about.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Bundle Adjustment A Modern Synthesis Bill Triggs, Philip McLauchlan, Richard Hartley and Andrew Fitzgibbon Presentation by Marios Xanthidis 5 th of No.

Chapter1: Introduction Chapter2: Overview of Supervised Learning

Rank Minimization for Subspace Tracking from Incomplete Data

Gaussian Processes For Regression, Classification, and Prediction.

Large-Scale Matrix Factorization with Missing Data under Additional Constraints Kaushik Mitra University of Maryland, College Park, MD Sameer Sheoreyy.

CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct

Feature Selction for SVMs J. Weston et al., NIPS 2000 오장민 (2000/01/04) Second reference : Mark A. Holl, Correlation-based Feature Selection for Machine.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:

SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.

I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Probability Theory and Parameter Estimation I

LECTURE 09: BAYESIAN ESTIMATION (Cont.)

ROBUST SUBSPACE LEARNING FOR VISION AND GRAPHICS

Ch3: Model Building through Regression

Outlier Processing via L1-Principal Subspaces

Structure from motion Input: Output: (Tomasi and Kanade)

CSCI 5822 Probabilistic Models of Human and Machine Learning

Probabilistic Models with Latent Variables

Biointelligence Laboratory, Seoul National University

Structure from motion Input: Output: (Tomasi and Kanade)

Parameter estimation class 6

Outline Sparse Reconstruction RIP Condition

Presentation transcript:

Handling Outliers and Missing Data in Statistical Data Models Kaushik Mitra Date: 17/1/2011 ECSU Seminar, ISI

Statistical Data Models Goal: Find structure in data Applications – Finance – Engineering – Sciences Biological – Wherever we deal with data Some examples – Regression – Matrix factorization Challenges: Outliers and Missing data

Outliers Are Quite Common Google search results for `male faces’

Need to Handle Outliers Properly Noisy imageGaussian filtered image Desired result Removing salt-and-pepper (outlier) noise

Missing Data Problem Completing missing tracks Incomplete tracks Completed tracks by a sub-optimal method Desired result Missing tracks in structure from motion

Our Focus Outliers in regression – Linear regression – Kernel regression Matrix factorization in presence of missing data

Robust Linear Regression for High Dimension Problems

What is Regression? Regression – Find functional relation between y and x x: independent variable y: dependent variable – Given data: (y i,x i ) pairs Model y = f(x, w)+n – Estimate w – Predict y for a new x

Robust Regression Real world data corrupted with outliers Outliers make estimates unreliable Robust regression – Unknown Parameter, w Outliers – Combinatorial problem N data and k outliers C(N,k) ways

Prior Work Combinatorial algorithms – Random sample consensus (RANSAC) – Least Median Squares (LMedS) Exponential in dimension M-estimators – Robust cost functions – local minima

Robust Linear Regression model Linear regression model : y i =x i T w+e i – e i, Gaussian noise Proposed robust model: e i =n i +s i – n i, inlier noise (Gaussian) – s i, outlier noise (sparse) Matrix-vector form – y=Xw+n+s Estimate w, s y1y2..yNy1y2..yN x1Tx2T..xNTx1Tx2T..xNT n 1 n 2. n N s 1 s 2. s N = + + w 1 w 2. w D

Simplification Objective (RANSAC): Find w that minimizes the number of outliers Eliminate w Model: y=Xw+n+s Premultiple by C : CX=0, N ≥ D – Cy=CXw+Cs+Cn – z=Cs+g – g Gaussian Problem becomes: Solve for s -> identify outliers -> LS -> w

Relation to Sparse Learning Solve: – Combinatorial problem Sparse Basis Selection/ Sparse Learning Two approaches : – Basis Pursuit (Chen, Donoho, Saunder 1995) – Bayesian Sparse Learning (Tipping 2001)

Basis Pursuit Robust regression (BPRR) Solve – Basis Pursuit Denoising (Chen et. al. 1995) – Convex problem – Cubic complexity : O(N 3 ) From Compressive Sensing theory (Candes 2005) – Equivalent to original problem if s is sparse C satisfy Restricted Isometry Property (RIP) Isometry: ||s 1 - s 2 || = ||C(s 1 – s 2 )|| Restricted: to the class of sparse vectors In general, no guarantees for our problem

Bayesian Sparse Robust Regression (BSRR) Sparse Bayesian learning technique (Tipping 2001) – Puts a sparsity promoting prior on s : – Likelihood : p(z/s)=Ν(Cs,εI) – Solves the MAP problem p(s/z) – Cubic Complexity : O(N 3 )

Setup for Empirical Studies Synthetically generated data Performance criteria – Angle between ground truth and estimated hyper-planes

Vary Outlier Fraction  BSRR performs well in all dimensions  Combinatorial algorithms like RANSAC, MSAC, LMedS not practical in high dimensions Dimension = 2 Dimension = 8 Dimension = 32

Facial Age Estimation Fgnet dataset : 1002 images of 82 subjects Regression – y : Age – x: Geometric feature vector

Outlier Removal by BSRR Label data as inliers and outliers Detected 177 outliers in 1002 images BSRR Inlier MAE3.73 Outlier MAE19.14 Overall MAE6.45 Leave-one-out testing

Summary for Robust Linear Regression Modeled outliers as sparse variable Formulated robust regression as Sparse Learning problem – BPRR and BSRR BSRR gives the best performance Limitation: linear regression model – Kernel model

Robust RVM Using Sparse Outlier Model

Relevance Vector Machine (RVM) RVM model: – : kernel function Examples of kernels – k(x i, x j ) = (x i T x j ) 2 : polynomial kernel – k(x i, x j ) = exp( -||x i - x j || 2 /2σ 2 ) : Gaussian kernel Kernel trick: k(x i,x j ) = ψ(x i ) T ψ(x j ) – Map x i to feature space ψ(x i )

RVM: A Bayesian Approach Bayesian approach – Prior distribution : p(w) – Likelihood : Prior specification – p(w) : sparsity promoting prior p(w i ) = 1/|w i | – Why sparse? Use a smaller subset of training data for prediction Support vector machine Likelihood – Gaussian noise Non-robust : susceptible to outliers

Robust RVM model Original RVM model – e, Gaussian noise Explicitly model outliers, e i = n i + s i – n i, inlier noise (Gaussian) – s i, outlier noise (sparse and heavy-tailed) Matrix vector form – y = Kw + n + s Parameters to be estimated: w and s

Robust RVM Algorithms y = [K|I]w s + n – w s = [w T s T ] T : sparse vector Two approaches – Bayesian – Optimization

Robust Bayesian RVM (RB-RVM) Prior specification – w and s independent : p(w, s) = p(w)p(s) – Sparsity promoting prior for s: p(s i )= 1/|s i | Solve for posterior p(w, s|y) Prediction: use w inferred above Computation: a bigger RVM – w s instead of w – [K|I] instead of K

Basis Pursuit RVM (BP-RVM) Optimization approach – Combinatorial Closest convex approximation From compressive sensing theory – Same solution if [K|I] satisfies RIP In general, can not guarantee

Experimental Setup

Prediction : Asymmetric Outliers Case

Image Denoising Salt and pepper noise – Outliers Regression formulation – Image as a surface over 2D grid y: Intensity x: 2D grid Denoised image obtained by prediction

Salt and Pepper Noise

Some More Results RVMRB-RVMMedian Filter

Age Estimation from Facial Images RB-RVM detected 90 outliers Leave-one-person-out testing

Summary for Robust RVM Modeled outliers as sparse variables Jointly estimated parameter and outliers Bayesian approach gives very good result

Limitations of Regression Regression: y = f(x,w)+n – Noise in only “y” – Not always reasonable All variables have noise – M = [x 1 x 2 … x N ] – Principal component analysis (PCA) [x 1 x 2 … x N ] = AB T – A: principal components – B: coefficients – M = AB T : matrix factorization (our next topic)

Matrix Factorization in the presence of Missing Data

Applications in Computer Vision Matrix factorization: M=AB T Applications: build 3-D models from images – Geometric approach (Multiple views) – Photometric approach (Multiple Lightings) 37 Structure from Motion (SfM) Photometric stereo

Matrix Factorization Applications in Vision – Affine Structure from Motion (SfM) – Photometric stereo Solution: SVD – M=USV T – Truncate S to rank r A=US 0.5, B=VS M = x ij y ij = CS T Rank 4 matrix M = NS T, rank = 3

Missing Data Scenario Missed feature tracks in SfM Specularities and shadow in photometric stereo 39 Incomplete feature tracks

Challenges in Missing Data Scenario Can’t use SVD Solve: W: binary weight matrix, λ: regularization parameter Challenges – Non-convex problem – Newton’s method based algorithm (Buchanan et. al. 2005) Very slow Design algorithm – Fast (handle large scale data) – Flexible enough to handle additional constraints Ortho-normality constraints in ortho-graphic SfM

Proposed Solution Formulate matrix factorization as a low-rank semidefinite program (LRSDP) – LRSDP: fast implementation of SDP (Burer, 2001) Quasi-Newton algorithm Advantages of the proposed formulation: – Solve large-scale matrix factorization problem – Handle additional constraints 41

Low-rank Semidefinite Programming (LRSDP) Stated as: Variable: R Constants C: cost A l, b l : constants Challenge Formulating matrix factorization as LRSDP Designing C, A l, b l

Matrix factorization as LRSDP: Noiseless Case We want to formulate: As: LRSDP formulation: C identity matrix, A l indicator matrix

Affine SfM Dinosaur sequence MF-LRSDP gives the best reconstruction 72% missing data

Photometric Stereo Face sequence MF-LRSDP and damped Newton gives the best result 42% missing data

Additional Constraints: Orthographic Factorization Dinosaur sequence

Summary Formulated missing data matrix factorization as LRSDP – Large scale problems – Handle additional constraints Overall summary – Two statistical data models Regression in presence of outliers – Role of sparsity Matrix factorization in presence of missing data – Low rank semidefinite program

Thank you! Questions? 48

Robust Bayesian RVM (RB-RVM) Prior specification – w and s independent : p(w, s) = p(w)p(s) – Hierarchical prior for w and s, α i uniformly distributed, β i uniformly distributed True nature of prior p(s i ) = 1/|s i |, sparsity promoting

RB-RVM: Inference and Prediction First estimate α, β and σ Prior and conditional are Gaussian – Posterior, p(w, s|y) is Gaussian Specified by mean and covariance Prediction – Use w inferred above – y also Gaussian

RB-RVM: Fast Algorithm Bigger RVM – w s instead of w – [K|I] instead of K Fast implementation (Tipping et. al., 2003)

Vary Outlier Fraction

Uniqueness and Global Minima When is the factorization unique? – Matrix completion theory (Candes, 2008) Enough observed entries (O(rn 1.2 logn)) M dense Global minima – Under above conditions (empirical)

Empirical Evaluations Synthetic data – Generate random matrices A,B of size n×r – Obtain M = AB T – Reveal a fraction of elements – Add noise Ν(0,σ 2 )

Noiseless case n=500, r=5 Damped Newton too slow to run Reconstruction successful if MF-LRSP gives better reconstruction results followed by OptSpace and alternation

Vary Size, Rank and Noise Variance MF-LRSDP gives better reconstruction results for different sizes and ranks Noise performance of all algorithms are similar