Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.

Slides:



Advertisements
Similar presentations
Part 2: Unsupervised Learning
Advertisements

Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 24: Non-linear Support Vector Machines Geoffrey Hinton.
Basics of Statistical Estimation
Feature Selection as Relevant Information Encoding Naftali Tishby School of Computer Science and Engineering The Hebrew University, Jerusalem, Israel NIPS.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
Biointelligence Laboratory, Seoul National University
Computer vision: models, learning and inference Chapter 8 Regression.
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Neural Networks II CMPUT 466/551 Nilanjan Ray. Outline Radial basis function network Bayesian neural network.
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Bayesian Learning Rong Jin.
Simple Bayesian Supervised Models Saskia Klein & Steffen Bollmann 1.
Thanks to Nir Friedman, HU
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSC2535: 2013 Advanced Machine Learning Lecture 3a: The Origin of Variational Bayes Geoffrey Hinton.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Baseline Methods for the Feature Extraction Class Isabelle Guyon Best BER=1.26  0.14% - n0=1000 (20%) – BER0=1.80% GISETTE Best BER=1.26  0.14% - n0=1000.
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Participation in the NIPS 2003 Challenge Theodor Mader ETH Zurich, Five Datasets were provided for experiments: ARCENE: cancer diagnosis.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Challenge Submissions for the Feature Extraction Class Georg Schneider my_classif=svc({'coef0=1', 'degree=3', 'gamma=0', 'shrinkage=1'});
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Filter + Support Vector Machine for NIPS 2003 Challenge Jiwen Li University of Zurich Department of Informatics The NIPS 2003 challenge was organized to.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Digital Media Lab 1 Data Mining Applied To Fault Detection Shinho Jeong Jaewon Shim Hyunsoo Lee {cinooco, poohut,
Randomized Algorithms for Bayesian Hierarchical Clustering
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
Some Aspects of Bayesian Approach to Model Selection Vetrov Dmitry Dorodnicyn Computing Centre of RAS, Moscow.
Guest lecture: Feature Selection Alan Qi Dec 2, 2004.
Dropout as a Bayesian Approximation
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Ensemble Methods in Machine Learning
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
Gaussian Processes For Regression, Classification, and Prediction.
CS Statistical Machine learning Lecture 12 Yuan (Alan) Qi Purdue CS Oct
Gaussian Process and Prediction. (C) 2001 SNU CSE Artificial Intelligence Lab (SCAI)2 Outline Gaussian Process and Bayesian Regression  Bayesian regression.
CSC321: Introduction to Neural Networks and Machine Learning Lecture 23: Linear Support Vector Machines Geoffrey Hinton.
Multi-label Prediction via Sparse Infinite CCA Piyush Rai and Hal Daume III NIPS 2009 Presented by Lingbo Li ECE, Duke University July 16th, 2010 Note:
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Predictive Automatic Relevance Determination by Expectation Propagation Y. Qi T.P. Minka R.W. Picard Z. Ghahramani.
Chapter 3: Maximum-Likelihood Parameter Estimation
DEEP LEARNING BOOK CHAPTER to CHAPTER 6
CEE 6410 Water Resources Systems Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
Alan Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani
Basic machine learning background with Python scikit-learn
Machine Learning Basics
Pattern Classification, Chapter 3
CSCI 5822 Probabilistic Models of Human and Machine Learning
Mixture Density Networks
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Parametric Methods Berlin Chen, 2005 References:
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Uncertainty Propagation
What is Artificial Intelligence?
Presentation transcript:

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners of NIPS2003 feature selection challenge University of Toronto

The results Combination of Bayesian neural networks and classification based on Bayesian clustering with a Dirichlet diffusion tree model. A Dirichlet diffusion tree method is used for Arcene. Bayesian neural networks (as in BayesNN-large) are used for Gisette, Dexter, and Dorothea. For Madelon, the class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.

Their General Approach Use simple techniques to reduce the computational difficulty of the problem, then apply more sophisticated Bayesian methods. Use simple techniques to reduce the computational difficulty of the problem, then apply more sophisticated Bayesian methods. –The simple techniques: PCA and feature selection by significance tests. –Bayesian neural networks. –Automatic Relevance Determination.

(I) First level feature reduction

Feature selection using significance tests (first level) An initial feature subset was found by simple univariate significance tests. (correlation coefficient, symmetrical uncertainty ) Assumption: Relevant variables will be at least somewhat relevant on their own. For all tests, a p-value was found by comparing to the distribution found when permuting the class labels.

Dimensionality reduction with PCA (an alternative for FS) There are probably better dimensionality reduction methods than PCA, but that ’ s what we used. One reason is that it ’ s feasible even when p is huge, provided n is not too large - time required is of order min(pn 2, np 2 ). PCA was done using all the data (training, validation, and test).

(II) Building learning model & Second level feature Selection

Bayesian Neural Networks

Conventional neural network learning

Bayesian Neural Network Learning Based on the statistic interpretation of the conventional neural network learning Based on the statistic interpretation of the conventional neural network learning

Bayesian Neural Network Learning Bayesian predictions are found by integration rather than maximization. For a test case x, y is predicted: Conventional neural network only consider parameters with maximum posterior Conventional neural network only consider parameters with maximum posterior Bayesian Neural Network consider all possible parameters in the parameter space. Bayesian Neural Network consider all possible parameters in the parameter space. Can be implemented by Gaussian approximation and MCMC Can be implemented by Gaussian approximation and MCMC

ARD Prior Still remember the decay? How? (by optimize the decay parameter) – –Associate weights from each input with a decay parameter – –There are theories for optimizing the decays. Result. If an input feature x is irrelevant, its relevance hyper- parameter β=1/a will tend to be small, forcing the relevant weight from that input to be near zero.

Some Strong Points of This Algorithm Bayesian learning integrates over the posterior distribution for the network parameters, rather than picking a single “ optimal ” set of parameters. This farther helps to avoid overfitting. ARD can be used to adjust the relevance of input features We can using prior to incorporate external knowledge

Dirichlet Diffusion Trees An Bayesian hierarchical clustering method

The methods BayesNN-small features selected using significance tests. BayesNN-large principle components BayesNN-DFT-combo the class probabilities from a Bayesian neural network and from a Dirichlet diffusion tree method are averaged, then thresholded to produce predictions.

About the datasets

The results

Thanks. Any Question?