Introduction to Challenge 2 The NIEHS - NCATS - UNC DREAM Toxicogenetics Challenge THE DATA Fred A. Wright, Ph.D. Professor and Director of the Bioinformatics.

Slides:



Advertisements
Similar presentations
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Advertisements

Decision Errors and Power
Inferential Statistics:
Review of the Basic Logic of NHST Significance tests are used to accept or reject the null hypothesis. This is done by studying the sampling distribution.
Quantitative Data Analysis: Hypothesis Testing
PROBABILISTIC ASSESSMENT OF THE QSAR APPLICATION DOMAIN Nina Jeliazkova 1, Joanna Jaworska 2 (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria (2)
© 2010 Pearson Prentice Hall. All rights reserved Least Squares Regression Models.
ANOVA notes NR 245 Austin Troy
Detecting Differentially Expressed Genes Pengyu Hong 09/13/2005.
Chapter 14 Analysis of Categorical Data
Analysis of Variance. Experimental Design u Investigator controls one or more independent variables –Called treatment variables or factors –Contain two.
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Chapter Topics The Completely Randomized Model: One-Factor Analysis of Variance F-Test for Difference in c Means The Tukey-Kramer Procedure ANOVA Assumptions.
Methods and Measurement in Psychology. Statistics THE DESCRIPTION, ORGANIZATION AND INTERPRATATION OF DATA.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
15-1 Introduction Most of the hypothesis-testing and confidence interval procedures discussed in previous chapters are based on the assumption that.
Today Concepts underlying inferential statistics
No definitive “gold standard” causal networks Use a novel held-out validation approach, emphasizing causal aspect of challenge Training Data (4 treatments)
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
Nonparametrics and goodness of fit Petter Mostad
Experimental Data. The Nature of data b Data is the outcome of observation and measurement b Data may be acquired ê In the field ê By experiment.
Chapter 12 Inferential Statistics Gay, Mills, and Airasian
Power and Sample Size IF IF the null hypothesis H 0 : μ = μ 0 is true, then we should expect a random sample mean to lie in its “acceptance region” with.
Choosing Statistical Procedures
© 2011 Pearson Prentice Hall, Salkind. Introducing Inferential Statistics.
LARA MANGRAVITE SAGE BIONETWORKS ON BEHALF OF THE RA CHALLENGE ORGANIZING TEAM The DREAM Rheumatoid Arthritis Responder Challenge: Motivation, Data, Scoring.
2015 AprilUNIVERSITY OF HAIFA, DEPARTMENT OF STATISTICS, SEMINAR FOR M.A 1 Hastie, Tibshirani and Friedman.The Elements of Statistical Learning (2nd edition,
T-test Mechanics. Z-score If we know the population mean and standard deviation, for any value of X we can compute a z-score Z-score tells us how far.
Slide 1 Estimating Performance Below the National Level Applying Simulation Methods to TIMSS Fourth Annual IES Research Conference Dan Sherman, Ph.D. American.
Comparing Two Means Prof. Andy Field.
User Study Evaluation Human-Computer Interaction.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Correlation and Regression Used when we are interested in the relationship between two variables. NOT the differences between means or medians of different.
© Copyright McGraw-Hill CHAPTER 13 Nonparametric Statistics.
Bioinformatics Expression profiling and functional genomics Part II: Differential expression Ad 27/11/2006.
11 Chapter 12 Quantitative Data Analysis: Hypothesis Testing © 2009 John Wiley & Sons Ltd.
ITEC6310 Research Methods in Information Technology Instructor: Prof. Z. Yang Course Website: c6310.htm Office:
STATISTICAL ANALYSIS FOR THE MATHEMATICALLY-CHALLENGED Associate Professor Phua Kai Lit School of Medicine & Health Sciences Monash University (Sunway.
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Application of Class Discovery and Class Prediction Methods to Microarray Data Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Simple linear regression Tron Anders Moger
Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.
Association between genotype and phenotype
Comp. Genomics Recitation 10 4/7/09 Differential expression detection.
CHAPTER OVERVIEW Say Hello to Inferential Statistics The Idea of Statistical Significance Significance Versus Meaningfulness Meta-analysis.
Michael Biehl Kerstin Bunte Petra Schneider DREAM 6 / FlowCAP 2 Challenge: Molecular Classification of Acute Myeloid Leukaemia Johann Bernoulli Institute.
Nonparametric Statistics
Statistical Analysis II Lan Kong Associate Professor Division of Biostatistics and Bioinformatics Department of Public Health Sciences December 15, 2015.
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Biostatistics Nonparametric Statistics Class 8 March 14, 2000.
Consensus Relevance with Topic and Worker Conditional Models Paul N. Bennett, Microsoft Research Joint with Ece Kamar, Microsoft Research Gabriella Kazai,
Chapter 13 Understanding research results: statistical inference.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
Educational Research Inferential Statistics Chapter th Chapter 12- 8th Gay and Airasian.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Chapter 8 Introducing Inferential Statistics.
Comparing Two Means Prof. Andy Field.
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
R. E. Wyllys Copyright 2003 by R. E. Wyllys Last revised 2003 Jan 15
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Some Nonparametric Methods
Logistic Regression --> used to describe the relationship between
Sampling Distribution
Sampling Distribution
Hypothesis testing. Chi-square test
I. Statistical Tests: Why do we use them? What do they involve?
Nonparametric Statistics
Georgi Iskrov, MBA, MPH, PhD Department of Social Medicine
Presentation transcript:

Introduction to Challenge 2 The NIEHS - NCATS - UNC DREAM Toxicogenetics Challenge THE DATA Fred A. Wright, Ph.D. Professor and Director of the Bioinformatics Research Center Departments of Statistics and Biological Sciences North Carolina State University amateurbrainsurgery.com 1

In vitro cytotoxicity screening of human cell lines to characterize variability and map suseptibility loci Many caveats are obvious, but bear repeating: limitations of the in vitro environment cell type sources of technical variation On the other hand, we are working with the correct species, and there is much that can be done: heritability analysis identification of potential mechanisms underlying variability, mostly via genetic mapping characterization of average response and variation across agents/chemicals, to prioritize in vitro data used for predictive toxicity models 2

Image courtesy of M. Andersen and D. Krewski 3

Much of the previous work has been in pharmacogenomics, especially cytotoxicity screening of anticancer agents However, most of the principles apply to any agent/chemical 4 Cytotoxicity heritability estimates from 125 lymphoblastoid cell lines (LCLs), 29 chemotherapeutic agents

CYTOTOXICITY PROFILING – BOILING DOWN TO A NUMBER(?) Experiments done in batches Challenge: estimation of cytoxic response or other relevant phenotype per cell line in the presence of variation Solution: likelihood-based fitting of EC 10 values, with outlier detection and batch correction log 10 (concentration) cytotoxicity (normalized % cell survival) 5

Observed data True variation across population Measurement variation The concept of population toxicity involves means and true variability, obscured by technical variation Chemical 1 Measure of susceptibility/resistance (e.g. EC 10 ) for one cell line has error 6

A vulnerable subpopulation The concept of population toxicity involves means and true variability, obscured by technical variation 7

In the high-throughput screening toxicology literature, relatively little data to support these concepts across multiple populations Chemical 1 Chemical 2 Chemical 3 Chemical 4 Prioritizing chemicals for vulnerable subpops depends on both means and variances 8 Observed variability has the potential to provide finer-grained uncertainty factors in risk assessment

The Challenge Data

10

The data in context – previous cell line vs. chemical/drug studies Heatmap of the EC 10 values (axes to scale)

Ranking chemicals by average cytotoxicity is of obvious interest – even with this large sample size, some uncertainty in ranking

EC 10 for each cell line 5 th and 95 th percentiles/quantiles are of interest from a risk assessment perspective. We call q 95 -q 05 the “fold-range”

884 lines that are “unrelated” (i.e. no first degree relatives) TrainingTestValidation Subchallenge 1 – predict EC 10 from SNPs and RNA- Seq data 156 chemicals that are “predictable” 106 training 50 test Subchallenge 2 – predict average and fold-range from chemical descriptors

The NIEHS - NCATS - UNC DREAM Toxicogenetics Challenge OVERALL RESULTS Federica Eduati, Ph.D. European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI) Cambridge, United Kingdom 15

Subchallenge 1: Data Comp 1 Comp 2 … Comp 106 Cytotoxicity data (EC 10 ) Train cell line 1 Train cell line 2 … Train cell line 487 Cytotoxicity data (EC 10 ) Final Test cell line 1 Final Test cell line 2 … Final Test cell line 264 Cytotoxicity data (EC 10 ) Leaderboard cell line 1 Leaderboard cell line 2 … Leaderboard cell line 133 Leaderboard (released Aug 31st): -Genotype data for 133 cell lines -RNAseq data for 48 cell lines -Predict: EC10 data for 106 compounds and 133 cell lines Final test: -RNAseq data for 97 cell lines -Genotype data for 264 cell lines -Predict: EC10 data for 106 compounds and 264 cell lines Training: -EC10 data for 106 compounds and 487 cell lines -Genotype data for 487 cell lines -RNAseq data for 192 cell lines Predict interindividual variability in cytotoxicity based on genomic profiles

Experimental error cell line 1 cell line 2 cell line 3 cell line 4 Comp 1 ranking Comp 1 Exact order is variable if there is noise Probabilistic C-index  accounts for the probabilistic nature of the gold standard cell line 1 cell line 2 cell line 3 cell line 4 Comp ranking Exact measures Noisy measures To each pair of cell lines, it assigns a score given by the probability that the predicted ranking is supported by the noisy gold standard For each compound:

Scoring metrics Correlation between predicted and observed values – Pearson correlation Ranking of cytotoxicity for different cell lines – Probabilistic C-index – Spearman correlation

Predictions vs null hypothesis

Comp 1 Comp 2 … Comp 106 Test cell line 1 Test cell line 2 … Test cell line N Cytotoxicity data (EC 10 ) SUBMISSION 1 Cytotoxicity data (EC 10 ) 1.For each submission, compute the following metrics compound by compound: a.Pearson correlation b.Probabilistic C-index 2.For each metric: a.Rank submissions for each compound b.Compute the mean ranking over all compounds c.Rank submissions according to the mean ranking 3.The final ranking is obtained averaging the ranking obtained with the 2 different metrics SUBMISSION 2 Cytotoxicity data (EC 10 ) SUBMISSION 3 Cytotoxicity data (EC 10 ) SUBMISSION 3 Cytotoxicity data (EC 10 ) SUBMISSION 3 Cytotoxicity data (EC 10 ) SUBMISSION 3 Cytotoxicity data (EC 10 ) SUBMISSION M Cytotoxicity data (EC 10 ) Submission 1 Submission 2 … Submission M Comp 1 Comp 2 … Comp 106 Mean ranking Scoring

* one sided Wilcoxon signed-rank test, FDR< significantly * different not significantly * different Robustness (sampling) analysis Verify if the rank is robust with respect to the compounds For times: 1.randomly mask data for 10% of the compounds 2.re-compute the score

Wisdom of crowds

Subchallenge 2: Data Final test: -Chemical attributes for 50 chemicals -Predict: population level parameters for 50 compounds -Median EC10 values -Interquantile distance (q95-q05) Training: -EC10 data for 106 compounds and 620 cell lines -Chemical attributes for 106 chemicals Cytotoxicity data (EC 10 ) Cell line 1 Cell line 2 … Cell line 620 Train Comp 1 Train Comp 2 … Train Comp 106 (a) Median EC10 (b) Interquantile distance (q95-q05) Test Comp 1 Test Comp 2 … Test Comp 50 DATA PREDICTIONS Predict population-level parameters of cytotoxicity of chemicals based on structural attributes of compounds.

Predictions vs null hypothesis

1.For each submission, compute the following metrics for each predicted population parameter (median, q95-905) a.Pearson correlation b.Spearman correlation 2.For each metric: a.Rank submissions each for population parameter b.Compute the mean ranking over the 2 population parameters c.Rank submissions according to the mean ranking 3.The final ranking is obtained averaging the ranking obtained with the 2 different metrics Test Comp 1 Test Comp 2 … Test Comp 50 Median EC10 Q95-Q05 SUBMISSION 1 SUBMISSION 2 SUBMISSION 3 SUBMISSION M Submission 1 Submission 2 … Submission M Mean ranking Scoring Median EC10 Q95-Q05

Robustness (sampling) analysis Verify if the rank is robust with respect to the compounds For times: 1.randomly mask data of 10% of the compounds 2.re-compute the score * one sided Wilcoxon signed-rank test, FDR< significantly * different not significantly * different

Wisdom of crowds

Conclusions Predictive models of toxicity were developed by participants, great response from the community: – Subchallenge 1: 99 submissions from 34 teams – Subchallenge 2: 85 submissions from 24 teams predictions were scored against a hidden test set top performing models provide significant predictions that could be useful to assess health risk best performers are robustly ranked first, but there are other models which provide good predictions – wisdom of crowds: the aggregation of predictions can increase overall performances

Rebecca Boyles Allen Dearry Raymond Tice Nour Abdo Paul Gallins Oksana Kosyk Ivan Rusyn Jessica Wignall Fred Wright Kai Xia Yi-Hui Zhou Christopher Austin Ruili Huang Anton Simeonov Menghang Xia Chris Bare Stephen Friend Mike Kellen Lara Mangravite Thea Norman Federica Eduati Michael Menden Kely Norel Julio Saez-Rodriguez Gustavo Stolovitzky 213 participants