Download presentation
Presentation is loading. Please wait.
Published byErick Barker Modified over 9 years ago
1
Supervised learning in high-throughput data General considerations Dimension reduction with outcome variables Classification models
2
General considerations Control 1Control 2……Control 25Disease 1Disease 2……Disease 40 Gene 19.259.77……9.48.585.62……6.88 Gene 26.995.85……55.145.43……5.01 Gene 34.555.3……4.733.664.27……4.11 Gene 47.047.16……6.476.796.87……6.45 Gene 52.843.21……3.23.063.26……3.15 Gene 66.086.26……7.196.125.93……6.44 Gene 744.41……4.224.424.09……4.26 Gene 84.014.15……3.453.773.55……3.82 Gene 96.377.2……8.145.137.06……7.27 Gene 102.913.04……3.032.833.86……2.89 Gene 113.713.79……3.395.156.23……4.44 …… Gene 500003.653.73……3.83.873.76……3.62 This is the common structure of microarray gene expression data from a simple cross-sectional case-control design. Data from other high-throughput technology are often similar.
3
Fisher Linear Discriminant Analysis Find the lower-dimension space where the classes are most separated.
4
In the projection, two goals are to be fulfilled: (1)Maximize between-class distance (2)Minimize within-class scatter Maximize this function with all non-zero vectors w Between class distance Within-class scatter
5
In the two-class case, we are projecting to a line to find the best separation: mean2 mean1 Decision boundry Maximization yields: Decision boundry: Fisher Linear Discriminant Analysis
6
EDR space Now we start talking about regression. The data is {x i, y i } Is dimension reduction on X matrix alone helpful here? Possibly, if the dimension reduction preserves the essential structure about Y|X. This is suspicious. Effective Dimension Reduction --- reduce the dimension of X without losing information which is essential to predict Y.
7
EDR space The model: Y is predicted by a set of linear combinations of X. If g() is known, this is not very different from a generalized linear model. For dimension reduction purpose, is there a scheme which can work on almost any g(), without knowledge of its actual form?
8
EDR space The general model encompasses many models as special cases:
9
Under this general model, The space B generated by β 1, β 2, ……, β K is called the e.d.r. space. Reducing to this sub-space causes no loss of information regarding predicting Y. Similar to factor analysis, the subspace B is identifiable, but the vectors aren’t. Any non-zero vector in the e.d.r. space is called an e.d.r. direction. EDR space
10
This equation assumes almost the weakest form, to reflect the hope that a low-dimensional projection of a high-dimensional regressor variable contains most of the information that can be gathered from a sample of modest size. It doesn’t impose any structure on how the projected regressor variables effect the output variable. Most regression models assume K=1, plus additional structures on g().
11
EDR space The philosophical point of Sliced Inverse Regression: the estimation of the projection directions can be a more important statistical issue than the estimation of the structure of g() itself. After finding a good e.d.r. space, we can project data to this smaller space. Then we are in a better position to identify what should be pursued further : model building, response surface estimation, cluster analysis, heteroscedasticity analysis, variable selection, ……
12
SIR Sliced Inverse Regression. In regular regression, our interest is the conditional density h(Y|X). Most important is E(Y|X) and var(Y|X). SIR treats Y as independent variable and X as the dependent variable. Given Y=y, what values will X take? This takes us from a p-dimensional problem (subject to curse of dimensionality) back to a 1-dimensional curve-fitting problem: E(X i |Y), i=1,…, p
13
SIR
15
covariance matrix for the slice means of X, weighted by the slice sizes sample covariance for X i ’s Find the SIR directions by conducting the eigenvalue decomposition of with respect to :
16
SIR An example response surface found by SIR.
17
PLS Finding latent factors in X that can predict Y. X is multi-dimensional, Y can be either a random variable or a random vector. The model will look like: where T j is a linear combination of X PLS is suitable in handling p>>N situation.
18
PLS Data: Goal:
19
PLS Solution: a k+1 is the (k+1) th eigen vector of Alternatively, The PLS components minimize Can be solved by iterative regression.
20
PLS Example: PLS v.s. PCA in regression: Y is related to X 1
21
Classification Tree An example classification tree.
22
Classification Trees Every split (mostly binary)should increase node purity. Drop of impurity as a criteria for variable selection at each split. Tree should not be overly complex. May prune tree.
23
Classification tree
24
Issues: How many splits should be allowed at a node? Which property to use at a node? When to stop splitting a node and declare it a “leaf”? How to adjust the size of the tree? Tree size model complexity. Too large a tree – over fitting; Too small a tree – not capture the underlying structure. How to assign the classification decision at each leaf? Missing data?
25
Classification Tree Binary split.
26
To decide what split criteria to use, need to establish the measurement of node impurity. Entropy: Misclassification: Gini impurity: (Expected error rate if class label is permuted.) Classification Tree
28
Growing the tree. Greedy search: at every step, choose the query that decreases the impurity as much as possible. For a real valued predictor, may use gradient descent to find the optimal cut value. When to stop? - Stop when reduction in impurity is smaller than a threshold. - Stop when the leaf node is too small. - Stop when a global criterion is met. - Hypothesis testing. - Cross-validation. - Fully grow and then prune. Classification Tree
29
Pruning the tree. - Merge leaves when the loss of impurity is not severe. - cost-complexity pruning allows elimination of a branch in a single step. When priors and costs are present, adjust training by adjusting the Gini impurity Assigning class label to a leaf. - No prior: take the class with highest frequency at the node. - With prior: weigh the frequency by prior - With loss function.… Always minimize the loss.
30
Classification Tree Choice or features.
31
Classification Tree Multivariate tree.
32
Bootstraping Directly assess uncertainty from the training data Basic thinking: assuming the data approaches true underlying density, re- sampling from it will give us an idea of the uncertainty caused by sampling
33
Bootstrapping
34
Bagging “Bootstrap aggregation.” Resample the training dataset. Build a prediction model on each resampled dataset. Average the prediction. It’s a Monte Carlo estimate of, where is the empirical distribution putting equal probability 1/N on each of the data points. Bagging only differs from the original estimate when f() is a non-linear or adaptive function of the data! When f() is a linear function, Tree is a perfect candidate for bagging – each bootstrap tree will differ in structure.
35
Bagging trees Bagged trees are of different structure.
36
Bagging trees Error curves.
37
Random Forest
38
Bagging can be seen as a method to reduce variance of an estimated prediction function. It mostly helps high-variance, low-bias classifiers. Comparatively, boosting build weak classifiers one-by-one, allowing the collection to evolve to the right direction. Random forest is a substantial modification to bagging – build a collection of de-correlated trees. - Similar performance to boosting - Simpler to train and tune compared to boosting
39
Random Forest The intuition – the average of random variables. B i.i.d. random variables, each with variance The mean has variance B i.d. random variables, each with variance with pairwise correlation, The mean has variance ------------------------------------------------------------------------------------- Bagged trees are i.d. samples. Random forest aims at reducing the correlation to reduce variance. This is achieved by random selection of variables.
40
Random Forest
41
Example comparing RF to boosted trees.
42
Random Forest Benefit of RF – out of bag (OOB) sample cross validation error. For sample i, find its RF error from only trees built from samples where sample i did not appear. The OOB error rate is close to N-fold cross validation error rate. Unlike many other nonlinear estimators, RF can be fit in a single sequence. Stop growing forest when OOB error stabilizes.
43
Random Forest Variable importance – find the most relevant predictors. At every split of every tree, a variable contributed to the improvement of the impurity measure. Accumulate the reduction of i(N) for every variable, we have a measure of relative importance of the variables. The predictors that appears the most times at split points, and lead to the most reduction of impurity, are the ones that are important. ------------------ Another method – Permute the predictor values of the OOB samples at every tree, the resulting decrease in prediction accuracy is also a measure of importance. Accumulate it over all trees.
44
Random Forest
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.