Download presentation
Presentation is loading. Please wait.
Published byCornelius Powell Modified over 9 years ago
2
Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington
3
Collaborators Dr. David Banks (Duke) Dr. Jacqueline Hughes-Oliver (NC State) Dr. Stan Young (NISS) Dr. Young Truoung (UNC) Dr. Chris Beecher (Metabolon) Dr. Xiaodong Lin (SAMSI)
5
Large data sets Examples –Walmart 20 million transactions daily –AT&T 100 million customers and carries 200 million calls a day on its long-distance network –Mobil Oil over 100 terabytes of data with oil exploration –Human genome Gigabytes of data –IRA
6
Dimensionality
7
3,000 metabolites 40,000 genes 100,000 chemicals Try to find the signal in these data sets (and not the noise)…..Data mining Examples of data mining techniques: pattern recognition, expert systems, genetic algorithms, neural networks, random forests
8
Today’s talk Focus on classification (supervised learning…use a response to guide the learning process) Response is categorical (Each observation belongs to a “class”) Interested in relationship between variables and the response Short, fat data (instead of long, skinny data)
9
Long, skinny data XYZ 289 344 7546 873 45635 65863 1293 14235 24145 274 137825 145634 18689 35856
10
Short, fat data n<p problem XYZSTVMNRQLHGKBCW 43658304357378931402534 67 67678898426598673 74624567589795044578 8456557 4227234676805690
11
Random Forests Developed by Leo Breiman (Berkeley) and Adele Cutler (Utah State) Can handle the n<p problem Random forests are comparable in accuracy to support vector machines Random forests are a combination of tree predictors
12
Constructing a tree ObservationGenderHeight (inches) 1F60 2F66 3M68 4F70 5F66 6M72 7F64 8M67
13
Tree for previous data set All observations N=8 Height < 66 N=4 Height > 66 N=4 Male N=0 Female N=4 Male N=3 Female N=1
14
Random Forest First, the number of trees to be grown must be specified. Also, the number of variables randomly selected at each node must be specified (m). Each tree is constructed in the following manner: 1. At each node, randomly select m variables to split on.
15
Random Forest 2.The node is split using the best split among the selected variables. 3.This process is continued until each node has only one observation, or all the observations belong to the same class. Do this for each tree in the “forest”
16
Example: Cereal Data
17
N=70 (40 G, 30K) Calories <100 (2 G, 15 K) Calories <100 (38 G, 15 K) Fat <1 15 K Fat >1 2 G Carbo<12 15 K Carbo>12 38G
18
Random Forest Another important feature is that each tree is created using a bootstrap sample of the learning set. Each bootstrap sample contains approximately 2/3 of the data (thus approximately 1/3 is left) Now, we can use the trees built not containing observations to get an idea of the error rate (each tree will “vote” on which class the observation belongs to). Example
19
N=70 (40 G, 30K) Calories <100 (2 G, 15 K) Calories <100 (38 G, 15 K) Fat <1 15 K Fat >1 2 G Carbo<12 15 K Carbo>12 38G Observation withheld from creating this tree Calories Fat Carbo Mfr 98 2 10 K
20
Random Forest This gives us an “out of bag” error rate Random forests also give us an idea of which variables are important for classifying individuals. Also gives information about outliers
21
The era of the “omics” sciences
22
Just a few of the “omics” sciences Genomics Transcriptomics Proteomics Metabolomics Phenomics Toxicogenomics Phylomics Foldomics Kinomics Interactomics Behavioromics Variomics Pharmacogenomics
23
Functional Genomics Genomics Transciptomics Proteomics Metabolomics
24
Metabolites are all the small molecules in a cell (i.e. ATP, sugar, pyruvate, urea) 3,000 metabolites in the human body (compared to 35,000 genes and approximately 100,000 proteins) Most direct measure of cell physiology Uses GC/MS and LC/MS to obtain measurements
25
Data Currently only have GC/MS information Missing values are very informative (below detection limits) Imputed data using uniform random variables from 0 to minimum value 105 metabolites 58 individuals (42 “disease 1”, 6 “disease 2”, and 10 “controls”)
26
Confusion matrix 123 14018 2051 3201 Oob error = 20.69%
27
Outlier
28
Variable Importance
29
Visual Data Dostat
30
Conclusions Random forests, support vector machines, and neural networks are some of the newest algorithms for understanding large datasets. There is still much more to be done.
31
Thank you
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.