1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery.

Slides:

Advertisements

Similar presentations

Conceptual Clustering

Advertisements

Random Forest Predrag Radenković 3237/10

Chapter 7 – Classification and Regression Trees

Linking Genetic Profiles to Biological Outcome Paul Fogel Consultant, Paris S. Stanley Young National Institute of Statistical Sciences NISS, NMF Workshop.

Multivariate Methods Pattern Recognition and Hypothesis Testing.

Sample Selection Issues in Experiment Random sampling (difficult) Convenience & purposive sampling Volunteers External validity Representativeness & generalizability.

Stat 512 – Lecture 18 Multiple Regression (Ch. 11)

Statistics 350 Lecture 21. Today Last Day: Tests and partial R 2 Today: Multicollinearity.

Differentially expressed genes

. Differentially Expressed Genes, Class Discovery & Classification.

Biol 500: basic statistics

Microarray analysis 2 Golan Yona. 2) Analysis of co-expression Search for similarly expressed genes experiment1 experiment2 experiment3 ……….. Gene i:

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

1 Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data Presented by: Tun-Hsiang Yang.

Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK

Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.

Multiple testing in high- throughput biology Petter Mostad.

A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.

Data Mining Chun-Hung Chou

Exploring Metabolomic data with recursive partitioning Metabolomic Workshop NISS July 14-15, 2005.

Lecture Notes 4 Pruning Zhangxi Lin ISQS

Rationale / value of using statistics statistics is a powerful tool to objectively compare experimental data uncover relationships among variables experience.

Differential Gene Expression Dennis Kostka, Christine Steinhoff Slides adapted from Rainer Spang.

Chapter 9 – Classification and Regression Trees

Kanchana Prapphal, Chulalongkorn University Statistics for Language Teachers Kanchana prapphal May 23, 2002 Kasetsart University.

Using Random Forests to explore a complex Metabolomic data set Susan Simmons Department of Mathematics and Statistics University of North Carolina Wilmington.

BOF Trees Visualization  Zagreb, June 12, 2004 BOF Trees Visualization  Zagreb, June 12, 2004 “BOF” Trees Diagram as a Visual Way to Improve Interpretability.

Chapter 10: Analyzing Experimental Data Inferential statistics are used to determine whether the independent variable had an effect on the dependent variance.

Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.

Lesson Multiple Regression Models. Objectives Obtain the correlation matrix Use technology to find a multiple regression equation Interpret the.

MKT 700 Business Intelligence and Decision Models Algorithms and Customer Profiling (1)

Introduction to Statistical Analysis of Gene Expression Data Feng Hong Beespace meeting April 20, 2005.

© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 12 Testing for Relationships Tests of linear relationships –Correlation 2 continuous.

1 ArrayTrack Demonstration National Center for Toxicological Research U.S. Food and Drug Administration 3900 NCTR Road, Jefferson, AR

IMPROVED RECONSTRUCTION OF IN SILICO GENE REGULATORY NETWORKS BY INTEGRATING KNOCKOUT AND PERTURBATION DATA Yip, K. Y., Alexander, R. P., Yan, K. K., &

N318b Winter 2002 Nursing Statistics Specific statistical tests Chi-square (  2 ) Lecture 7.

Applied Quantitative Analysis and Practices LECTURE#25 By Dr. Osman Sadiq Paracha.

Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics

AP Statistics: ANOVA Section 1. In section 13.1 A, we used a t-test to compare the means between two groups. An ANOVA (ANalysis Of VAriance) test is used.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Data Mining Consultant GlaxoSmithKline: US Pharma IT

Comp. Genomics Recitation 10 4/7/09 Differential expression detection.

T tests comparing two means t tests comparing two means.

Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.

Copyright c 2001 The McGraw-Hill Companies, Inc.1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent variable.

Regression Tree Ensembles Sergey Bakin. Problem Formulation §Training data set of N data points (x i,y i ), 1,…,N. §x are predictor variables (P-dimensional.

Multiple Regression Learning Objectives n Explain the Linear Multiple Regression Model n Interpret Linear Multiple Regression Computer Output n Test.

Statistical Analysis for Expression Experiments Heather Adams BeeSpace Doctoral Forum Thursday May 21, 2009.

Computational Sensing = Modeling + Optimization CENS seminar Jan 28, 2005 Miodrag Potkonjak Key Contributors: Bradley Bennet, Alberto.

© 2006 by The McGraw-Hill Companies, Inc. All rights reserved. 1 Chapter 11 Testing for Differences Differences betweens groups or categories of the independent.

Waqas Haider Bangyal. Classification Vs Clustering In general, in classification you have a set of predefined classes and want to know which class a new.

Classification Tree Interaction Detection. Use of decision trees Segmentation Stratification Prediction Data reduction and variable screening Interaction.

Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.

Choosing and using your statistic. Steps of hypothesis testing 1. Establish the null hypothesis, H 0. 2.Establish the alternate hypothesis: H 1. 3.Decide.

AP PSYCHOLOGY: UNIT I Introductory Psychology: Statistical Analysis The use of mathematics to organize, summarize and interpret numerical data.

Machine Learning with Spark MLlib

Data Mining ICCM

SNS COLLEGE OF TECHNOLOGY

S. Stanley Young Robert Obenchain Goran Krstic

JMP Discovery Summit 2016 Janet Alvarado

Inter-experimental LHC Machine Learning Working Group Activities

Introduction to Machine Learning and Tree Based Methods

Heping Zhang, Chang-Yung Yu, Burton Singer, Momian Xiong

Lecture 17. Boosting¶ CS 109A/AC 209A/STAT 121A Data Science: Harvard University Fall 2016 Instructors: P. Protopapas, K. Rader, W. Pan.

AP Statistics: Chapter 7

Chapter 4, Regression Diagnostics Detection of Model Violation

MIS2502: Data Analytics Clustering and Segmentation

An Introduction to Correlational Research

Decision trees MARIO REGIN.

Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017

Presentation transcript:

1 High Throughput Target Identification Stan Young, NISS Doug Hawkins, U Minnesota Christophe Lambert, Golden Helix Machine Learning, Statistics, and Discovery 25 June 03

2 Micro Array Literature

3 Guilt by Association : You are known by the company you keep.

4 Data Matrix Goal: Associations over the genes. Guilty Gene Genes Tissues

5 Goals 1.Associations. 2.Deep associations – beyond 1 st level correlations. 3. Uncover multiple mechanisms.

6 Problems 1.n < < p 2.Strong correlations. 3.Missing values. 4.Non-normal distributions. 5.Outliers. 6.Multiple testing.

7 Technical Approach 1.Recursive partitioning. 2.Resampling-based, adjusted p-values. 3.Multiple trees.

8 Recursive Partitioning Tasks 1.Create classes. 2.How to split. 3.How to stop.

9 Differences: Recursive Partitioning Top-down analysis Can use any type of descriptor. Uses biological activities to determine which features matter. Produces a classification tree for interpretation and prediction. Big N is not a problem! Missing values are ok. Multiple trees, big p is ok. Clustering Often bottom-up Uses “gestalt” matching. Requires an external method for determining the right feature set. Difficult to interpret or use for prediction. Big N is a severe problem!!

10 Forming Classes, Categories, Groups Profession Av. Income Baseball Players 1.5M Football Players 1.2M Doctors.8M Dentists.5M Lawyers.23M Professors.09M.....

11 Forming Classes from “Continuous” Descriptor How many “cuts” and where to make them?

12 Splitting : t-test n = 1650 ave = 0.34 sd = 0.81 n = 1614 ave = 0.29 sd = 0.73 n = 36 ave = 2.60 sd = 0.9 Signal t = = = Noise NN-CC TT: NN-CC rP = 2.03E-70 aP = 1.30E-66

13 Splitting : F-test n = 1650 ave = 0.34 sd = 0.81 n = 1553 ave = 0.21 sd = 0.73 n = 36 ave = 2.60 sd = 0.9 n = 61 ave = 1.29 sd = 0.83 n = 61 ave = 1.29 sd = 0.83 Signal Among Var  (Xi. - X..) 2 /df1 F = = = Noise Within Var  (Xij - Xi.) 2 /df2

14 How to Stop Examine each current terminal node. Stop if no variable/class has a significant split, multiplicity adjusted.

15 Levels of Multiple Testing 1.Raw p-value. 2.Adjust for class formation, segmentation. 3.Adjust for multiple predictors. 4.Adjust for multiple splits in the tree. 5.Adjust for multiple trees.

16 Understanding observations NB: Splitting variables govern the process, linked to response variable. linked to response variable. Multiple Mechanisms Conditionally important descriptors.

17 Multiple Mechanisms

18 Reality: Example Data 60 Tissues 1453 Genes Gene 510 is the “guilty” gene, the Y.

19 1 st Split of Gene 510 (Guilty Gene)

20 Split Selection 14 spliters with adjusted p-value < 0.05

21 Histogram Non-normal, hence resampling p-values make sense.

22 Resampling-based Adjusted p-value

23 Single Tree RP Drawbacks Data greedy. Only one view of the data. May miss other mechanisms. Highly correlated variables may be obscured. Higher order interactions may be masked. No formal mechanisms for follow-up experimental design. Disposition of outliers is difficult.

24 Etc. Multiple Trees, how and why?

25 How do you get multiple trees? 1.Bootstrap the sample, one tree per sample. 2.Randomize over valid splitters. Etc.

26 Random Tree Browsing, 1000 Trees.

27 Example Tree

28 1 st Split

29 Example Tree, 2 nd Split

30 Conclusion for Gene G510 If G518 < and G790 < then G510 = /- 0.30

31 Using Multiple Trees to Understand variables Which variables matter? How to rank variables in importance. Correlations. Synergistic variables.

32 Correlation Interaction Matrix Red=Syn.

33 Summary Review recursive partitioning. Demonstrated multiple tree RP’s capabilities –Find associated genes –Group correlated predictors (genes) –Synergistic predictors (genes that predict together) Used to understand a complex data set.

34 Needed research Real data sets with known answers. Benchmarking. Linking to gene annotations. Scale (1,000*10,000). Multiple testing in complex data sets. Good visualization methods. Outlier detection for large data sets. Missing values. (see NISS paper 123)

35 Teams NC State University : Jacqueline Hughes-Oliver Katja Rimlinger U Waterloo : Will Welch Hugh Chipman Marcia Wang Yan Yuan U. Minnesota : Douglas Hawkins NISS : Alan Karr (Consider post docs) GSK : Lei Zhu Ray Lam

36 References/Contact papers 122 and GSK patent.

37 Questions