Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington.

Slides:

Advertisements

Similar presentations

Ensemble Learning – Bagging, Boosting, and Stacking, and other topics

Advertisements

Machine learning methods for the analysis of heterogeneous, multi- source data Ilkka Huopaniemi Statistical machine learning and.

Random Forest Predrag Radenković 3237/10

CART: Classification and Regression Trees Chris Franck LISA Short Course March 26, 2013.

1 Metabolomics a Promising ‘omics Science By Susan Simmons University of North Carolina Wilmington.

Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005

Sparse vs. Ensemble Approaches to Supervised Learning

Ensemble Learning: An Introduction

Classification Continued

Bagging and Boosting in Data Mining Carolina Ruiz

Three kinds of learning

Prediction Methods Mark J. van der Laan Division of Biostatistics U.C. Berkeley

Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!

Support Vector Machines

Modeling Gene Interactions in Disease CS 686 Bioinformatics.

© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.

Machine Learning CS 165B Spring 2012

Overview DM for Business Intelligence.

Exploring Metabolomic data with recursive partitioning Metabolomic Workshop NISS July 14-15, 2005.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.

Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

LOGO Ensemble Learning Lecturer: Dr. Bo Yuan

Biophysical Gradient Modeling. Management Needs Decision Support Tools – Baseline Information Vegetation characteristics Forest stand structure Fuel loads.

Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.

1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.

Data Mining - Volinsky Columbia University 1 Topic 10 - Ensemble Methods.

Predicting Earthquakes By Lois Desplat. Why Predict Earthquakes?  To minimize the loss of life and property.  Unfortunately, current techniques do not.

Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.

Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.

CLASSIFICATION: Ensemble Methods

Using Machine Learning Techniques in Stylometry Ramyaa, Congzhou He, Dr. Khaled Rasheed.

Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.

1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.

Konstantina Christakopoulou Liang Zeng Group G21

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Data Mining and Decision Support

Classification and Prediction: Ensemble Methods Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.

Finding τ → μ−μ−μ+ Decays at LHCb with Data Mining Algorithms

Data Analytics CMIS Short Course part II Day 1 Part 1: Introduction Sam Buttrey December 2015.

Chong Ho Yu.  Data mining (DM) is a cluster of techniques, including decision trees, artificial neural networks, and clustering, which has been employed.

Supervised Machine Learning: Classification Techniques Chaleece Sandberg Chris Bradley Kyle Walsh.

Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Combining multiple learners Usman Roshan. Decision tree From Alpaydin, 2010.

Monday, February 22,  The term analytics is often used interchangeably with:  Data science  Data mining  Knowledge discovery  Extracting useful.

Mustafa Gokce Baydogan, George Runger and Eugene Tuv INFORMS Annual Meeting 2011, Charlotte A Bag-of-Features Framework for Time Series Classification.

Ensemble Classifiers.

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Machine Learning with Spark MLlib

Introduction to Machine Learning

JMP Discovery Summit 2016 Janet Alvarado

Bagging and Random Forests

Zaman Faisal Kyushu Institute of Technology Fukuoka, JAPAN

Trees, bagging, boosting, and stacking

Basic machine learning background with Python scikit-learn

Todd Holloway Two Lecture Series for B551 November 20 & 27, 2007

CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,

CSCI N317 Computation for Scientific Applications Unit Weka

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Somi Jacob and Christian Bach

Data Mining, Machine Learning, Data Analysis, etc. scikit-learn

Classification with CART

Ensembles An ensemble is a set of classifiers whose combined results give the final decision. test feature vector classifier 1 classifier 2 classifier.

Credit Card Fraudulent Transaction Detection

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington

Collaborators Dr. David Banks (Duke) Dr. Jacqueline Hughes-Oliver (NC State) Dr. Stan Young (NISS) Dr. Young Truoung (UNC) Dr. Chris Beecher (Metabolon) Dr. Xiaodong Lin (SAMSI)

Large data sets Examples –Walmart 20 million transactions daily –AT&T 100 million customers and carries 200 million calls a day on its long-distance network –Mobil Oil over 100 terabytes of data with oil exploration –Human genome Gigabytes of data –IRA

Dimensionality

3,000 metabolites 40,000 genes 100,000 chemicals Try to find the signal in these data sets (and not the noise)…..Data mining Examples of data mining techniques: pattern recognition, expert systems, genetic algorithms, neural networks, random forests

Today’s talk Focus on classification (supervised learning…use a response to guide the learning process) Response is categorical (Each observation belongs to a “class”) Interested in relationship between variables and the response Short, fat data (instead of long, skinny data)

Long, skinny data XYZ

Short, fat data n<p problem XYZSTVMNRQLHGKBCW

Random Forests Developed by Leo Breiman (Berkeley) and Adele Cutler (Utah State) Can handle the n<p problem Random forests are comparable in accuracy to support vector machines Random forests are a combination of tree predictors

Constructing a tree ObservationGenderHeight (inches) 1F60 2F66 3M68 4F70 5F66 6M72 7F64 8M67

Tree for previous data set All observations N=8 Height < 66 N=4 Height > 66 N=4 Male N=0 Female N=4 Male N=3 Female N=1

Random Forest First, the number of trees to be grown must be specified. Also, the number of variables randomly selected at each node must be specified (m). Each tree is constructed in the following manner: 1. At each node, randomly select m variables to split on.

Random Forest 2.The node is split using the best split among the selected variables. 3.This process is continued until each node has only one observation, or all the observations belong to the same class. Do this for each tree in the “forest”

Example: Cereal Data

N=70 (40 G, 30K) Calories <100 (2 G, 15 K) Calories <100 (38 G, 15 K) Fat <1 15 K Fat >1 2 G Carbo<12 15 K Carbo>12 38G

Random Forest Another important feature is that each tree is created using a bootstrap sample of the learning set. Each bootstrap sample contains approximately 2/3 of the data (thus approximately 1/3 is left) Now, we can use the trees built not containing observations to get an idea of the error rate (each tree will “vote” on which class the observation belongs to). Example

N=70 (40 G, 30K) Calories <100 (2 G, 15 K) Calories <100 (38 G, 15 K) Fat <1 15 K Fat >1 2 G Carbo<12 15 K Carbo>12 38G Observation withheld from creating this tree Calories Fat Carbo Mfr K

Random Forest This gives us an “out of bag” error rate Random forests also give us an idea of which variables are important for classifying individuals. Also gives information about outliers

The era of the “omics” sciences

Just a few of the “omics” sciences Genomics Transcriptomics Proteomics Metabolomics Phenomics Toxicogenomics Phylomics Foldomics Kinomics Interactomics Behavioromics Variomics Pharmacogenomics

Functional Genomics Genomics Transciptomics Proteomics Metabolomics

Metabolites are all the small molecules in a cell (i.e. ATP, sugar, pyruvate, urea) 3,000 metabolites in the human body (compared to 35,000 genes and approximately 100,000 proteins) Most direct measure of cell physiology Uses GC/MS and LC/MS to obtain measurements

Data Currently only have GC/MS information Missing values are very informative (below detection limits) Imputed data using uniform random variables from 0 to minimum value 105 metabolites 58 individuals (42 “disease 1”, 6 “disease 2”, and 10 “controls”)

Confusion matrix Oob error = 20.69%

Outlier

Variable Importance

Visual Data Dostat

Conclusions Random forests, support vector machines, and neural networks are some of the newest algorithms for understanding large datasets. There is still much more to be done.

Thank you