Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008.

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Random Forest Predrag Radenković 3237/10
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Evaluating Classifiers
Prediction with Regression
Pattern Recognition and Machine Learning
Indian Statistical Institute Kolkata
Model Assessment and Selection
Model assessment and cross-validation - overview
Chapter 4: Linear Models for Classification
Uncertainty Representation. Gaussian Distribution variance Standard deviation.
A Bayesian Approach to Joint Feature Selection and Classifier Design Balaji Krishnapuram, Alexander J. Hartemink, Lawrence Carin, Fellow, IEEE, and Mario.
Pattern Recognition and Machine Learning
Kernel methods - overview
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Curve-Fitting Regression
Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,
REGRESSION What is Regression? What is the Regression Equation? What is the Least-Squares Solution? How is Regression Based on Correlation? What are the.
Linear Regression Models Based on Chapter 3 of Hastie, Tibshirani and Friedman Slides by David Madigan.
Evaluating Hypotheses
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
Data mining and statistical learning - lecture 13 Separating hyperplane.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
REGRESSION Predict future scores on Y based on measured scores on X Predictions are based on a correlation from a sample where both X and Y were measured.
Introduction to domain adaptation
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Machine Learning CUNY Graduate Center Lecture 3: Linear Regression.
Outline Separating Hyperplanes – Separable Case
1 Least squares procedure Inference for least squares lines Simple Linear Regression.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
Overview of Supervised Learning Overview of Supervised Learning2 Outline Linear Regression and Nearest Neighbors method Statistical Decision.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
D. M. J. Tax and R. P. W. Duin. Presented by Mihajlo Grbovic Support Vector Data Description.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
START OF DAY 5 Reading: Chap. 8. Support Vector Machine.
CS 478 – Tools for Machine Learning and Data Mining SVM.
Sparse Kernel Methods 1 Sparse Kernel Methods for Classification and Regression October 17, 2007 Kyungchul Park SKKU.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Machine Learning 5. Parametric Methods.
Optimal Reverse Prediction: Linli Xu, Martha White and Dale Schuurmans ICML 2009, Best Overall Paper Honorable Mention A Unified Perspective on Supervised,
Kernel Methods Arie Nakhmani. Outline Kernel Smoothers Kernel Density Estimators Kernel Density Classifiers.
Machine Learning CUNY Graduate Center Lecture 6: Linear Regression II.
Robust Estimation Course web page: vision.cis.udel.edu/~cv April 23, 2003  Lecture 25.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computacion Inteligente Least-Square Methods for System Identification.
Regularized Least-Squares and Convex Optimization.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
Evaluating Classifiers
Deep Feedforward Networks
Probability Theory and Parameter Estimation I
Ch3: Model Building through Regression
Learning with information of features
10701 / Machine Learning Today: - Cross validation,
Biointelligence Laboratory, Seoul National University
Recursively Adapted Radial Basis Function Networks and its Relationship to Resource Allocating Networks and Online Kernel Learning Weifeng Liu, Puskal.
Probabilistic Surrogate Models
Machine Learning: Lecture 5
Presentation transcript:

Efficient Direct Density Ratio Estimation for Non-stationarity Adaptation and Outlier Detection Takafumi Kanamori Shohei Hido NIPS 2008

Outline Motivation Importance Estimation Direct Importance Estimation Approximation Algorithm Experiments Conclusions

Motivation Importance Sampling Covariate Shift Outlier Detection

Importance Sampling Rather than sampling from the distribution p, importance sampling is to reduce the variance of Ê[f(X)] by an appropriate choice of q, hence the name importance sampling, as samples from q can be more "important" for the estimation of the integral. Other reasons include difficulties to draw samples from distribution p or efficiency considerations. [2] R. Srinivasan, Importance sampling - Applications in communications and detection, Springer-Verlag, Berlin, [3]P. J.Smith, M.Shafi, and H. Gao, "Quick simulation: A review of importance sampling techniques in communication systems," IEEE J.Select.Areas Commun., vol. 15, pp , May 1997.

Covariate Shift Compensated by weighting the training samples according to the importance Distribution of input training and testing set changed, while the conditional distribution that output given input unchanged. Then, standard learning techniques such as MLE or CV are biased. [4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.

Outlier Detection The importance for regular samples are close to one, while those for outliers tend to be significantly deviated from one. The values of the importance could be used as an index of the degree of outlyingness.

Related Works Kernel Density Estimation Kernel Mean Matching a map into the feature space the expectation operator μ(Pr) := E x~Pr(x) [Φ(x)]. [4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS 2006.

Direct Importance Estimation

Least-square Approach Model w(x) with linear model Determine the parameter alpha so that the squared error on training samples is minimized:

Least Square Importance Fitting LSIF Empirical estimation Regularization term to avoid over-fitting

Model Selection for LSIF Model the parameter lambda, the basis function phi Model selection Cross Validation

Heuristics for Basic function Design Gaussian kernel centered at the test samples

Unconstrained Least-squares Approach (uLSIF) Ignore the non-negativity constraints Learned parameters could be negative To compensate for the approximation error, modify the solution

Efficient Computation of LOOCV Samples learned without the LOOCV score According to the Sherman-Woodbury-Morrison formula, the matrix inverse needs to be computed only once.

Experiments Importance Estimation p train is the d-dimensional normal distribution with mean zero and covariance identity. p test is the d-dimensional normal distribution with mean (1,0,…,0) T and covariance identity. Normalized mean squared error

Covariate Shift Adaptation in classification and regression Given the training samples, the test samples, and the outputs of the training samples The task is to predict the outputs for test samples

Experimental Description Divide the training samples into R disjoint subsets The function is learned using by IWRLS and its mean test error for the remaining samples is computed: Where

Covariate shift adaptation

Experiment Outlier Detection

Conclusions Application –Covariate shift adaptation –Outlier detection –Feature selection –Conditional distribution estimation –ICA –……

Reference [1]Takafumi Kanamori, Shohei Hido. Efficient direct density ratio estimation for non-stationarity adaptation and outlier detection, NIPS [2] R. Srinivasan, Importance sampling - Applications in communications and detection, Springer-Verlag, Berlin, [3]P. J.Smith, M.Shafi, and H. Gao, "Quick simulation: A review of importance sampling techniques in communication systems," IEEE J.Select.Areas Commun., vol. 15, pp , May [4]Jiayuan Huang, Alexander J. Smola,Arthur Gretton,et al. Correcting Sample Selection Bias by Unlabeled Data, NIPS [5] Jing Jiang. A Literature Survey on Domain Adaptation of Statistical Classifiers