Genome-Wide Association Studies

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Analysis of imputed rare variants
AllerGen / Vancouver - 01/03//2009 Meta-Analysis of GABRIEL GWAS Asthma & IgE F. Demenais, M. Farrall, D. Strachan GABRIEL Statistical Group.
Meta-analysis for GWAS BST775 Fall DEMO Replication Criteria for a successful GWAS P
Genetic Analysis in Human Disease
Analysis of variance (ANOVA)-the General Linear Model (GLM)
MALD Mapping by Admixture Linkage Disequilibrium.
The Inheritance of Complex Traits
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Quantitative Genetics
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Chapter 11 Multiple Regression.
(1) Risk prediction by kernels and (2) Ranking SNPs Usman Roshan.
Modeling Gene Interactions in Disease CS 686 Bioinformatics.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Microarray Gene Expression Data Analysis A.Venkatesh CBBL Functional Genomics Chapter: 07.
P REDICTION M ODELS U SING G ENOMIC P ROFILING H. Zhang E. Warner D. Zhao.
1 Statistical Tools for Multivariate Six Sigma Dr. Neil W. Polhemus CTO & Director of Development StatPoint, Inc.
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Rare and common variants: twenty arguments G.Gibson Homework 3 Mylène Champs Marine Flechet Mathieu Stifkens 1 Bioinformatics - GBIO K.Van Steen.
Epigenome 1. 2 Background: GWAS Genome-Wide Association Studies 3.
ConceptS and Connections
Karri Silventoinen University of Helsinki Osaka University.
12a - 1 © 2000 Prentice-Hall, Inc. Statistics Multiple Regression and Model Building Chapter 12 part I.
Brain Mapping Unit The General Linear Model A Basic Introduction Roger Tait
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
Various topics Petter Mostad Overview Epidemiology Study types / data types Econometrics Time series data More about sampling –Estimation.
Input: A set of people with/without a disease (e.g., cancer) Measure a large set of genetic markers for each person (e.g., measurement of DNA at various.
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Inferential Statistics. The Logic of Inferential Statistics Makes inferences about a population from a sample Makes inferences about a population from.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Statistical Issues in Genetic Association Studies
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Genome wide association studies (A Brief Start)
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 11: Models Marshall University Genomics Core Facility.
Classification Ensemble Methods 1
GenABEL: an R package for Genome Wide Association Analysis
Copyright © 2011, 2005, 1998, 1993 by Mosby, Inc., an affiliate of Elsevier Inc. Chapter 19: Statistical Analysis for Experimental-Type Research.
Complex Adaptive Systems and Human Health: Statistical Approaches in Pharmacogenomics Kim E. Zerba, Ph.D. Bristol-Myers Squibb FDA/Industry Statistics.
Feature Selection and Extraction Michael J. Watts
Stats Term Test 4 Solutions. c) d) An alternative solution is to use the probability mass function and.
Chapter 22 - Quantitative genetics: Traits with a continuous distribution of phenotypes are called continuous traits (e.g., height, weight, growth rate,
24.1 Quantitative Characteristics Vary Continuously and Many Are Influenced by Alleles at Multiple Loci The Relationship Between Genotype and Phenotype.
GridQTL High Performance QTL analysis via the Grid/Cloud.
Sequence Kernel Association Tests (SKAT) for the Combined Effect of Rare and Common Variants 統計論文 奈良原.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
Genome-Wides Association Studies (GWAS) Veryan Codd.
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Methods of multivariate analysis Ing. Jozef Palkovič, PhD.
I. Statistical Methods for Genome-Enabled Prediction of Complex Traits OUTLINE THE CHALLENGES OF PREDICTING COMPLEX TRAITS ORDINARY LEAST SQUARES (OLS)
Boosting and Additive Trees (2)
Workshop on Methods for Genomic Selection (El Batán, July 15, 2013) Paulino Pérez & Gustavo de los Campos.
Quantitative genetics
Genome Wide Association Studies using SNP
Analysis of Data Graphics Quantitative data
Beyond GWAS Erik Fransen.
Lecture 1: Introduction to Machine Learning Methods
What is Regression Analysis?
Statistical Analysis and Design of Experiments for Large Data Sets
What is Artificial Intelligence?
Presentation transcript:

Genome-Wide Association Studies Caitlin Collins, Thibaut Jombart MRC Centre for Outbreak Analysis and Modelling Imperial College London Genetic data analysis using 30-10-2014

Outline Introduction to GWAS Study design Testing for association GWAS design Issues and considerations in GWAS Testing for association Univariate methods Multivariate methods Penalized regression methods Factorial methods

Genomics & GWAS

The genomics revolution Sequencing technology 1977 – Sanger 1995 – 1st bacterial genomes < 10,000 bases per day per machine 2003 – 1st human genome > 10,000,000,000,000 GWAS publications 2005 – 1st GWAS Age-related macular degeneration 2014 – 1,991 publications 14,342 associations Genomics & GWAS

A few GWAS discoveries… Genomics & GWAS

So what is GWAS? Genome Wide Association Study Purpose: Looking for SNPs… associated with a phenotype. Purpose: Explain Understanding Mechanisms Therapeutics Predict Intervention Prevention Understanding not required Genomics & GWAS

Association Definition Heritability P = G + E + G*E p SNPs n individuals Definition Any relationship between two measured quantities that renders them statistically dependent. Heritability The proportion of variance explained by genetics P = G + E + G*E Heritability > 0 Cases Controls Genomics & GWAS

Genomics & GWAS

Why? Environment, Gene-Environment interactions Complex traits, small effects, rare variants Gene expression levels GWAS methodology? Genomics & GWAS

Study Design

GWAS design Case-Control Variations Well-defined “case” Known heritability Variations Quantitative phenotypic data Eg. Height, biomarker concentrations Explicit models Eg. Dominant or recessive Study Design

Issues & Considerations Data quality 1% rule Controlling for confounding Sex, age, health profile Correlation with other variables Population stratification* Linkage disequilibrium* * * Study Design

Population stratification Definition “Population stratification” = population structure Systematic difference in allele frequencies btw. sub-populations… … possibly due to different ancestry Problem Violates assumed population homogeneity, independent observations  Confounding, spurious associations Case population more likely to be related than Control population  Over-estimation of significance of associations Study Design

Population stratification II Solutions Visualise Phylogenetics PCA Correct Genomic Control Regression on Principal Components of PCA Study Design

Linkage disequilibrium (LD) Definition Alleles at separate loci are NOT independent of each other Problem? Too much LD is a problem  noise >> signal Some (predictable) LD can be beneficial  enables use of “marker” SNPs Study Design

Testing for Association

Methods for association testing Standard GWAS Univariate methods Incorporating interactions Multivariate methods Penalized regression methods (LASSO) Factorial methods (DAPC-based FS) Testing for Association

Univariate methods Approach Variations Individual test statistics p SNPs n individuals Cases Controls Approach Individual test statistics Correction for multiple testing Variations Testing Fisher’s exact test, Cochran-Armitage trend test, Chi-squared test, ANOVA Gold Standard—Fischer’s exact test Correcting Bonferroni Gold Standard—FDR Testing for Association

Univariate – Strengths & weaknesses Straightforward Computationally fast Conservative Easy to interpret Multivariate system, univariate framework Effect size of individual SNPs may be too small Marginal effects of individual SNPs ≠ combined effects Testing for Association

What about interactions? Testing for Association

𝜇 𝑖 = 𝛿+ 𝛽 1 𝑥 𝑖 + 𝛽 2 𝑦 𝑖 vs. 𝜇 𝑖 = 𝛿+ 𝛽 1 𝑥 𝑖 + 𝛽 2 𝑦 𝑖 +𝜸 𝒙 𝒊 𝒚 𝒊 Interactions Epistasis “Deviation from linearity under a general linear model” 𝜇 𝑖 = 𝛿+ 𝛽 1 𝑥 𝑖 + 𝛽 2 𝑦 𝑖 vs. 𝜇 𝑖 = 𝛿+ 𝛽 1 𝑥 𝑖 + 𝛽 2 𝑦 𝑖 +𝜸 𝒙 𝒊 𝒚 𝒊 With p predictors, there are: 𝑝 𝑘 = 𝑝 𝑘 𝑘! k-way interactions p = 10,000,000  5 x 1011 That’s 500 BILLION possible pair-wise interactions! Need some way to limit the number of pairwise interactions considered… Testing for Association

Non-parametric Methods Multivariate methods Neural Networks Penalized Regression Genetic programming optimized neural networks Parametric decreasing method LASSO penalized regression The elastic net Ridge regression Logic Trees Logic feature selection Monte Carlo Logic Regression Bayesian Approaches Logic regression Bayesian partitioning Modified Logic Regression-Gene Expression Programming Bayesian Logistic Regression with Stochastic Search Variable Selection Bayesian Epistasis Association Mapping Genetic Programming for Association Studies Set association approach Factorial Methods Non-parametric Methods Multi-factor dimensionality reduction method Sparse-PCA Random forests Restricted partitioning method Supervised-PCA DAPC-based FS (snpzip) Combinatorial partitioning method Odds-ratio- based MDR Testing for Association

Multivariate methods Penalized regression methods Factorial methods LASSO penalized regression Factorial methods DAPC-based feature selection Testing for Association

Penalized regression methods Approach Regression models multivariate association Shrinkage estimation  feature selection Variations LASSO, Ridge, Elastic net, Logic regression Gold Standard—LASSO penalized regression Testing for Association

LASSO penalized regression Generalized linear model (“glm”) Penalization L1 norm Coefficients  0 Feature selection! Testing for Association

LASSO – Strengths & weaknesses Stability Interpretability Likely to accurately select the most influential predictors Sparsity Multicollinearity Not designed for high-p Computationally intensive Calibration of penalty parameters User-defined  variability Sparsity Testing for Association

Factorial methods Approach Variations Place all variables (SNPs) in a multivariate space Identify discriminant axis  best separation Select variables with the highest contributions to that axis Variations Supervised-PCA, Sparse-PCA, DA, DAPC-based FS Our focus—DAPC with feature selection (snpzip) Testing for Association

DAPC-based feature selection Healthy (“controls”) Diseased (“cases”) b e Alleles d c Individuals Discriminant Axis Discriminant Axis Discriminant axis Density of individuals Density of individuals Discriminant axis Testing for Association

DAPC-based feature selection Where should we draw the line?  Hierarchical clustering Discriminant axis Density of individuals ?

Hierarchical clustering (FS) Hooray! Testing for Association

DAPC – Strengths & weaknesses More likely to catch all relevant SNPs (signal) Computationally quick Good exploratory tool Redundancy > sparsity Sensitive to n.pca N.snps.selected varies No “p-value” Redundancy > sparsity Testing for Association

Conclusions Study design Testing for association GWAS design Issues and considerations in GWAS Testing for association Univariate methods Multivariate methods Penalized regression methods Factorial methods

Thanks for listening!

Questions?