DNA methylation age of human tissues and cell types. Genome Biol. 2013 14(10):R115 PMID: 24138928.

Slides:



Advertisements
Similar presentations
Chapter 5 Multiple Linear Regression
Advertisements

Random Forest Predrag Radenković 3237/10
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
We processed six samples in triplicate using 11 different array platforms at one or two laboratories. we obtained measures of array signal variability.
 These 100 seniors make up one possible sample. All seniors in Howard County make up the population.  The sample mean ( ) is and the sample standard.
Covariance and Correlation: Estimator/Sample Statistic: Population Parameter: Covariance and correlation measure linear association between two variables,
 Coefficient of Determination Section 4.3 Alan Craig
Université d’Ottawa / University of Ottawa 2001 Bio 4118 Applied Biostatistics L10.1 CorrelationCorrelation The underlying principle of correlation analysis.
The Simple Linear Regression Model: Specification and Estimation
Bivariate Regression CJ 526 Statistical Analysis in Criminal Justice.
Estimation Procedures Point Estimation Confidence Interval Estimation.
Feature Selection for Regression Problems
SOME ADDITIONAL POINTS ON MEASUREMENT ERROR IN EPIDEMIOLOGY Sholom May 28, 2011 Supplement to Prof. Carroll’s talk II.
Biol 500: basic statistics
Multivariate Regression Model y =    x1 +  x2 +  x3 +… +  The OLS estimates b 0,b 1,b 2, b 3.. …. are sample statistics used to estimate 
Regression Chapter 10 Understandable Statistics Ninth Edition By Brase and Brase Prepared by Yixun Shi Bloomsburg University of Pennsylvania.
Lasso regression. The Goals of Model Selection Model selection: Choosing the approximate best model by estimating the performance of various models Goals.
Multiple Regression Research Methods and Statistics.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Empirical evaluation of prediction- and correlation network methods applied to genomic data Steve Horvath University of California, Los Angeles.
Norms & Norming Raw score: straightforward, unmodified accounting of performance Norms: test performance data of a particular group of test takers that.
Bivariate Linear Regression. Linear Function Y = a + bX +e.
Walter Hop Web-shop Order Prediction Using Machine Learning Master’s Thesis Computational Economics.
Chapter 13: Inference in Regression
Statistics and Research methods Wiskunde voor HMI Bijeenkomst 3 Relating statistics and experimental design.
Statistical Bootstrapping Peter D. Christenson Biostatistician January 20, 2005.
University of Washington Institute of Technology Tacoma, WA, USA Ecole des Hautes Etudes en Santé Publique Département Infobiostat Rennes, France Isabelle.
Correlation and Regression PS397 Testing and Measurement January 16, 2007 Thanh-Thanh Tieu.
90288 – Select a Sample and Make Inferences from Data The Mayor’s Claim.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.2 Extending the Correlation and R-Squared for Multiple.
Epigenetic Analysis BIOS Statistics for Systems Biology Spring 2008.
Introduction to Multivariate Analysis Epidemiological Applications in Health Services Research Dr. Ibrahim Awad Ibrahim.
Copyright © 2013, 2009, and 2007, Pearson Education, Inc. Chapter 13 Multiple Regression Section 13.3 Using Multiple Regression to Make Inferences.
Evolutionary Algorithms for Finding Optimal Gene Sets in Micro array Prediction. J. M. Deutsch Presented by: Shruti Sharma.
Statistical planning and Sample size determination.
META-ANALYSIS, RESEARCH SYNTHESES AND SYSTEMATIC REVIEWS © LOUIS COHEN, LAWRENCE MANION & KEITH MORRISON.
T T Population Sample Size Calculations Purpose Allows the analyst to analyze the sample size necessary to conduct "statistically significant"
Regression Analysis: Part 2 Inference Dummies / Interactions Multicollinearity / Heteroscedasticity Residual Analysis / Outliers.
Molecular Classification of Cancer Class Discovery and Class Prediction by Gene Expression Monitoring.
Ch. 10 Correlation and Regression 10-3 Notes Inferences for Correlation and Regression.
Applied Quantitative Analysis and Practices LECTURE#30 By Dr. Osman Sadiq Paracha.
Developing a Hiring System Measuring Applicant Qualifications or Statistics Can Be Your Friend!
Tutorial I: Missing Value Analysis
Mx modeling of methylation data: twin correlations [means, SD, correlation] ACE / ADE latent factor model regression [sex and age] genetic association.
Sampling Distributions: Suppose I randomly select 100 seniors in Anne Arundel County and record each one’s GPA
LISA A. KELLER UNIVERSITY OF MASSACHUSETTS AMHERST Statistical Issues in Growth Modeling.
Measurements and Data. Topics Types of Data Distance Measurement Data Transformation Forms of Data Data Quality.
Ethnic variation in methylation of birth weight and length Presenter: Zahra Sohani Supervisor: Dr. Sonia Anand.
DISCRIMINANT ANALYSIS. Discriminant Analysis  Discriminant analysis builds a predictive model for group membership. The model is composed of a discriminant.
2011 Data Mining Industrial & Information Systems Engineering Pilsung Kang Industrial & Information Systems Engineering Seoul National University of Science.
Direct method of standardization of indices. Average Values n Mean:  the average of the data  sensitive to outlying data n Median:  the middle of the.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Global predictors of regression fidelity A single number to characterize the overall quality of the surrogate. Equivalence measures –Coefficient of multiple.
Lecture 11 Epigenetics of Aging Andrea Baccarelli, MD, PhD, MPH Laboratory of Environmental Epigenetics Harvard School of Public Health
Methods of Presenting and Interpreting Information Class 9.
Stats Methods at IC Lecture 3: Regression.
Chapter 13 Multiple Regression
Multiple Regression Prof. Andy Field.
Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.
General principles in building a predictive model
Epigenetics and psychiatric illness
876 fetal cord blood DNA samples
Sensitivity Analysis of the MGMT-STP27 Model and Impact of Genetic and Epigenetic Context to Predict the MGMT Methylation Status in Gliomas and Other.
ppmi EPIgenetics Andy Singleton and Dena Hernandez
Diagnostics and Transformation for SLR
Chapter 3 Statistical Concepts.
Volume 25, Issue 4, Pages e6 (April 2017)
Somi Jacob and Christian Bach
Diagnostics and Transformation for SLR
A Data Partitioning Scheme for Spatial Regression
Presentation transcript:

DNA methylation age of human tissues and cell types. Genome Biol (10):R115 PMID:

Statistical goal and challenge Goal: Build an age prediction method based on tens of thousands of variables –dependent variable y= transformed version of chronological age (in years) –covariates= CpGs –Approach: Penalized regression (elastic net) Challenge: how to combine multiple training data generated by different labs etc

Training data sets

Test data sets

Construction of the epigenetic clock assembled a large DNA methylation data set by combining publicly available individual data sets measured on the Illumina 27K or Illumina 450K array platform. training+test data involved n=7844 non-cancer samples from 82 individual data sets which assess DNA methylation levels in 51 different tissues and cell types. Although many data sets were collected for studying certain diseases, they largely involved healthy tissues. –In particular, cancer tissues were excluded

Illumina data sets The first 39 data sets were used to construct ("train") the age predictor. Data sets were used to test (validate) the age predictor. Data sets served other purposes e.g. to estimate the DNAm age of embryonic stem and iPS cells. Training data were chosen i) to represent a wide spectrum of tissues/cell types, ii) to involve samples whose mean age (43 years) is similar to that in the test data, and iii) to involve a high proportion of samples (37%) measured on the Illumina 450K platform since many on-going studies use this recent Illumina platform. Only studied CpGs (measured with the Infinium type II assay) which were present on both Illumina platforms (Infinium 450K and 27K) and had fewer than 10 missing values across the data sets.

Age predictor To ensure an unbiased validation in the test data, only used the training data to define the age predictor. A transformed version of chronological age was regressed on the CpGs using a penalized regression model (elastic net). The elastic net regression model automatically selected 353 CpGs. I refer to the 353 CpGs as (epigenetic) clock CpGs since their weighted average (formed by the regression coefficients) amounts to an epigenetic clock.

Accuracy across tissues and cell types (training)

Accuracy across test data

Accuracy in brain tissue

Results send to me via Blood data from Marco Boks Jan 2014

Excerpts from s Epigenetic clock applied to large cohort studies Median error is less than 3.5 years.

Aging clock applied to urine This figure, created by bioinformatician Wei Guo at Zymo Research

Factors influencing accuracy: standard deviation of age, tissue

Using the clock for measuring the age of different parts of the body

The clock works in the genus pan: common chimpanzees+bonobos

ES cells and iPS cells are perfectly young

Heritability (based on twin studies) of age acceleration is 40% in older subjects and 100% in newborns Rows correspond to 2 different twin data sets Red dots=monozygotic twin pair Black dots=dizygotic twin pair

Conclusions Most studies that involved telomere length and other biomarkers can be revisited User friendly software can be found on my webpage –I recommend the online age calculator since it outputs a host of array quality statistics that can be used to identify samples where the age prediction may not be accurate. Data get deleted right after you upload them. –Don't pre-process data too much. Don't remove batch effects, etc. Raw beta values will be fine. I am always happy to collaborate.