Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models.

Slides:



Advertisements
Similar presentations
Random Forest Predrag Radenković 3237/10
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Introduction Training Complexity, Pruning CART vs. ID3 vs. C4.5
Molecular Biomedical Informatics 分子生醫資訊實驗室 Machine Learning and Bioinformatics 機器學習與生物資訊學 Machine Learning & Bioinformatics 1.
Dimension reduction (1)
Model generalization Test error Bias, variance and complexity
Overview Previous techniques have consisted of real-valued feature vectors (or discrete-valued) and natural measures of distance (e.g., Euclidean). Consider.
Model Assessment, Selection and Averaging
Chapter 4: Linear Models for Classification
x – independent variable (input)
Sparse vs. Ensemble Approaches to Supervised Learning
Ensemble Learning: An Introduction
Optimization of Signal Significance by Bagging Decision Trees Ilya Narsky, Caltech presented by Harrison Prosper.
Discrimination Methods As Used In Gene Array Analysis.
Machine Learning CMPT 726 Simon Fraser University
Classification 10/03/07.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
ICS 273A Intro Machine Learning
Machine Learning: Ensemble Methods
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Comp 540 Chapter 9: Additive Models, Trees, and Related Methods
Ensemble Learning (2), Tree and Forest
1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Lecture 4 Linear machine
Linear Models for Classification
Classification (slides adapted from Rob Schapire) Eran Segal Weizmann Institute.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Ensemble Methods in Machine Learning
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Classification Ensemble Methods 1
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
ECE 471/571 – Lecture 20 Decision Tree 11/19/15. 2 Nominal Data Descriptions that are discrete and without any natural notion of similarity or even ordering.
Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.
Classification and Regression Trees
Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.
Machine Learning Recitation 8 Oct 21, 2009 Oznur Tastan.
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Bump Hunting The objective PRIM algorithm Beam search References: Feelders, A.J. (2002). Rule induction by bump hunting. In J. Meij (Ed.), Dealing with.
Dimension reduction (2) EDR space Sliced inverse regression Multi-dimensional LDA Partial Least Squares Network Component analysis.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Ensemble Classifiers.
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Machine Learning: Ensemble Methods
JMP Discovery Summit 2016 Janet Alvarado
Chapter 7. Classification and Prediction
Trees, bagging, boosting, and stacking
CH 5: Multivariate Methods
ECE 471/571 – Lecture 12 Decision Tree.
Ungraded quiz Unit 6.
REMOTE SENSING Multispectral Image Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical Learning Dong Liu Dept. EEIS, USTC.
Ensemble learning Reminder - Bagging of Trees Random Forest
Chapter 7: Transformations
Multivariate Methods Berlin Chen
Multivariate Methods Berlin Chen, 2005 References:
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Supervised learning in high-throughput data  General considerations  Dimension reduction with outcome variables  Classification models

General considerations Control 1Control 2……Control 25Disease 1Disease 2……Disease 40 Gene …… ……6.88 Gene …… ……5.01 Gene …… ……4.11 Gene …… ……6.45 Gene …… ……3.15 Gene …… ……6.44 Gene …… ……4.26 Gene …… ……3.82 Gene …… ……7.27 Gene …… ……2.89 Gene …… ……4.44 …… Gene …… ……3.62 This is the common structure of microarray gene expression data from a simple cross-sectional case-control design. Data from other high-throughput technology are often similar.

Fisher Linear Discriminant Analysis Find the lower-dimension space where the classes are most separated.

In the projection, two goals are to be fulfilled: (1)Maximize between-class distance (2)Minimize within-class scatter Maximize this function with all non-zero vectors w Between class distance Within-class scatter

In the two-class case, we are projecting to a line to find the best separation: mean2 mean1 Decision boundry Maximization yields: Decision boundry: Fisher Linear Discriminant Analysis

EDR space Now we start talking about regression. The data is {x i, y i } Is dimension reduction on X matrix alone helpful here? Possibly, if the dimension reduction preserves the essential structure about Y|X. This is suspicious. Effective Dimension Reduction --- reduce the dimension of X without losing information which is essential to predict Y.

EDR space The model: Y is predicted by a set of linear combinations of X. If g() is known, this is not very different from a generalized linear model. For dimension reduction purpose, is there a scheme which can work on almost any g(), without knowledge of its actual form?

EDR space The general model encompasses many models as special cases:

Under this general model, The space B generated by β 1, β 2, ……, β K is called the e.d.r. space. Reducing to this sub-space causes no loss of information regarding predicting Y. Similar to factor analysis, the subspace B is identifiable, but the vectors aren’t. Any non-zero vector in the e.d.r. space is called an e.d.r. direction. EDR space

This equation assumes almost the weakest form, to reflect the hope that a low-dimensional projection of a high-dimensional regressor variable contains most of the information that can be gathered from a sample of modest size. It doesn’t impose any structure on how the projected regressor variables effect the output variable. Most regression models assume K=1, plus additional structures on g().

EDR space The philosophical point of Sliced Inverse Regression: the estimation of the projection directions can be a more important statistical issue than the estimation of the structure of g() itself. After finding a good e.d.r. space, we can project data to this smaller space. Then we are in a better position to identify what should be pursued further : model building, response surface estimation, cluster analysis, heteroscedasticity analysis, variable selection, ……

SIR Sliced Inverse Regression. In regular regression, our interest is the conditional density h(Y|X). Most important is E(Y|X) and var(Y|X). SIR treats Y as independent variable and X as the dependent variable. Given Y=y, what values will X take? This takes us from a p-dimensional problem (subject to curse of dimensionality) back to a 1-dimensional curve-fitting problem: E(X i |Y), i=1,…, p

SIR

covariance matrix for the slice means of X, weighted by the slice sizes sample covariance for X i ’s Find the SIR directions by conducting the eigenvalue decomposition of with respect to :

SIR An example response surface found by SIR.

PLS  Finding latent factors in X that can predict Y.  X is multi-dimensional, Y can be either a random variable or a random vector.  The model will look like: where T j is a linear combination of X  PLS is suitable in handling p>>N situation.

PLS Data: Goal:

PLS Solution: a k+1 is the (k+1) th eigen vector of Alternatively, The PLS components minimize Can be solved by iterative regression.

PLS Example: PLS v.s. PCA in regression: Y is related to X 1

Classification Tree An example classification tree.

Classification Trees Every split (mostly binary)should increase node purity. Drop of impurity as a criteria for variable selection at each split. Tree should not be overly complex. May prune tree.

Classification tree

Issues: How many splits should be allowed at a node? Which property to use at a node? When to stop splitting a node and declare it a “leaf”? How to adjust the size of the tree? Tree size model complexity. Too large a tree – over fitting; Too small a tree – not capture the underlying structure. How to assign the classification decision at each leaf? Missing data?

Classification Tree Binary split.

To decide what split criteria to use, need to establish the measurement of node impurity. Entropy: Misclassification: Gini impurity: (Expected error rate if class label is permuted.) Classification Tree

Growing the tree. Greedy search: at every step, choose the query that decreases the impurity as much as possible. For a real valued predictor, may use gradient descent to find the optimal cut value. When to stop? - Stop when reduction in impurity is smaller than a threshold. - Stop when the leaf node is too small. - Stop when a global criterion is met. - Hypothesis testing. - Cross-validation. - Fully grow and then prune. Classification Tree

Pruning the tree. - Merge leaves when the loss of impurity is not severe. - cost-complexity pruning allows elimination of a branch in a single step. When priors and costs are present, adjust training by adjusting the Gini impurity Assigning class label to a leaf. - No prior: take the class with highest frequency at the node. - With prior: weigh the frequency by prior - With loss function.… Always minimize the loss.

Classification Tree Choice or features.

Classification Tree Multivariate tree.

Bootstraping Directly assess uncertainty from the training data Basic thinking: assuming the data approaches true underlying density, re- sampling from it will give us an idea of the uncertainty caused by sampling

Bootstrapping

Bagging “Bootstrap aggregation.” Resample the training dataset. Build a prediction model on each resampled dataset. Average the prediction. It’s a Monte Carlo estimate of, where is the empirical distribution putting equal probability 1/N on each of the data points. Bagging only differs from the original estimate when f() is a non-linear or adaptive function of the data! When f() is a linear function, Tree is a perfect candidate for bagging – each bootstrap tree will differ in structure.

Bagging trees Bagged trees are of different structure.

Bagging trees Error curves.

Random Forest

Bagging can be seen as a method to reduce variance of an estimated prediction function. It mostly helps high-variance, low-bias classifiers. Comparatively, boosting build weak classifiers one-by-one, allowing the collection to evolve to the right direction. Random forest is a substantial modification to bagging – build a collection of de-correlated trees. - Similar performance to boosting - Simpler to train and tune compared to boosting

Random Forest The intuition – the average of random variables. B i.i.d. random variables, each with variance The mean has variance B i.d. random variables, each with variance with pairwise correlation, The mean has variance Bagged trees are i.d. samples. Random forest aims at reducing the correlation to reduce variance. This is achieved by random selection of variables.

Random Forest

Example comparing RF to boosted trees.

Random Forest Benefit of RF – out of bag (OOB) sample  cross validation error. For sample i, find its RF error from only trees built from samples where sample i did not appear. The OOB error rate is close to N-fold cross validation error rate. Unlike many other nonlinear estimators, RF can be fit in a single sequence. Stop growing forest when OOB error stabilizes.

Random Forest Variable importance – find the most relevant predictors. At every split of every tree, a variable contributed to the improvement of the impurity measure. Accumulate the reduction of i(N) for every variable, we have a measure of relative importance of the variables. The predictors that appears the most times at split points, and lead to the most reduction of impurity, are the ones that are important Another method – Permute the predictor values of the OOB samples at every tree, the resulting decrease in prediction accuracy is also a measure of importance. Accumulate it over all trees.

Random Forest