Genome-wide association interaction analysis (GWAI)

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

What is an association study? Define linkage disequilibrium
Qualitative and Quantitative traits
METHODS FOR HAPLOTYPE RECONSTRUCTION
Genetic Analysis in Human Disease
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
QTL Mapping R. M. Sundaram.
How to catch epistasis: theory and practice
Transmission Genetics: Heritage from Mendel 2. Mendel’s Genetics Experimental tool: garden pea Outcome of genetic cross is independent of whether the.
Lab 13: Association Genetics. Goals Use a Mixed Model to determine genetic associations. Understand the effect of population structure and kinship on.
Gene-gene and gene-environment interactions Manuel Ferreira Massachusetts General Hospital Harvard Medical School Center for Human Genetic Research.
CS177 Lecture 9 SNPs and Human Genetic Variation Tom Madej
Estimating “Heritability” using Genetic Data David Evans University of Queensland.
More Powerful Genome-wide Association Methods for Case-control Data Robert C. Elston, PhD Case Western Reserve University Cleveland Ohio.
A coalescent computational platform for tagging marker selection for clinical studies Gabor T. Marth Department of Biology, Boston College
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
Gene Set Analysis 09/24/07. From individual gene to gene sets Finding a list of differentially expressed genes is only the starting point. Suppose we.
Integrating domain knowledge with statistical and data mining methods for high-density genomic SNP disease association analysis Dinu et al, J. Biomedical.
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Something related to genetics? Dr. Lars Eijssen. Bioinformatics to understand studies in genomics – São Paulo – June Image:
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Pathway Analysis. Goals Characterize biological meaning of joint changes in gene expression Organize expression (or other) changes into meaningful ‘chunks’
Genetic Analysis in Human Disease. Learning Objectives Describe the differences between a linkage analysis and an association analysis Identify potentially.
Rare and common variants: twenty arguments G.Gibson Homework 3 Mylène Champs Marine Flechet Mathieu Stifkens 1 Bioinformatics - GBIO K.Van Steen.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.
Methods of Genome Mapping linkage maps, physical maps, QTL analysis The focus of the course should be on analytical (bioinformatic) tools for genome mapping,
Systematic reviews of genetic association studies Robert Walton Fiona Fong 15 March 2013.
Chapter 5 Characterizing Genetic Diversity: Quantitative Variation Quantitative (metric or polygenic) characters of Most concern to conservation biology.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
The medical relevance of genome variability Gabor T. Marth, D.Sc. Department of Biology, Boston College Medical Genomics Course – Debrecen,
CS177 Lecture 10 SNPs and Human Genetic Variation
From Genome-Wide Association Studies to Medicine Florian Schmitzberger - CS 374 – 4/28/2009 Stanford University Biomedical Informatics
Copyright © 2013 Pearson Education, Inc. All rights reserved. Chapter 4 Genetics: From Genotype to Phenotype.
Genome-Wide Association Study (GWAS)
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics
Methods in genome wide association studies. Norú Moreno
Lab 13: Association Genetics December 5, Goals Use Mixed Models and General Linear Models to determine genetic associations. Understand the effect.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Future Directions Pak Sham, HKU Boulder Genetics of Complex Traits Quantitative GeneticsGene Mapping Functional Genomics.
Lecture 24: Quantitative Traits IV Date: 11/14/02  Sources of genetic variation additive dominance epistatic.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
An quick overview of human genetic linkage analysis
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Genome wide association studies (A Brief Start)
The International Consortium. The International HapMap Project.
An quick overview of human genetic linkage analysis Terry Speed Genetics & Bioinformatics, WEHI Statistics, UCB NWO/IOP Genomics Winterschool Mathematics.
Fast test for multiple locus mapping By Yi Wen Nisha Rajagopal.
Lecture 22: Quantitative Traits II
Multiple-Locus Genome-Wide Association Testing David Dean CSE280A.
Chapter 22 - Quantitative genetics: Traits with a continuous distribution of phenotypes are called continuous traits (e.g., height, weight, growth rate,
Association tests. Basics of association testing Consider the evolutionary history of individuals proximal to the disease carrying mutation.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Genetics of Gene Expression BIOS Statistics for Systems Biology Spring 2008.
STT2073 Plant Breeding and Improvement. Quality vs Quantity Quality: Appearance of fruit/plant/seed – size, colour – flavour, taste, texture – shelflife.
Complex Patterns of Inheritance INCOMPLETE DOMINANCE, CODOMINANCE, MULTIPLE ALLELES, EPISTASIS AND POLYGENIC INHERITANCE.
An atlas of genetic influences on human blood metabolites Nature Genetics 2014 Jun;46(6)
SNPs and complex traits: where is the hidden heritability?
Complex disease and long-range regulation: Interpreting the GWAS using a Dual Colour Transgenesis Strategy in Zebrafish.
upstream vs. ORF binding and gene expression?
Genome Wide Association Studies using SNP
Gene Hunting: Design and statistics
High level GWAS analysis
Medical genomics BI420 Department of Biology, Boston College
Medical genomics BI420 Department of Biology, Boston College
Kernel Methods for large-scale Genomics Data Analysis
Presentation transcript:

Genome-wide association interaction analysis (GWAI) Epistasis Genome-wide association interaction analysis (GWAI) Elena S. Gusareva, PhD egusareva@ulg.ac.be (1) Systems and Modeling Unit, Montefiore Institute (2) Bioinformatics and Modeling, GIGA-R Université de Liège Belgium

Outline Epistasis: definition and example of biological epistasis Protocol for genome-wide association interaction analysis (GWAI) Data collection Quality control Choosing a strategy for GWAI (exhaustive and selective epistasis screening) Tests of association LD-pruning Confounders and population stratification Interpretation and follow-up (replication analysis and validation) GWAI screening: an example on Alzheimer’s disease

Environmental factor 1…n Human diseases Monogenic disease - Phenylketonuria (Phenylalanine hydroxylase – PAH gene) Single gene Ph: disease mutation Linkage analysis Complex disease - Crohn's disease (99 disease susceptibility loci ~ 25% of heritability of CD) Ph: disease Gender Age Gene 1 Gene2 Gene …n Environmental factor 1…n GWA analysis MISSING HERITABILITY !!!

Missing heritability Where does heritability hide? Overestimated heritability Inaccurate definition of pathological traits Low frequency variants and rare variants De novo mutations Epigenetic effects, CNVs, STRs, etc. Gene X Environment interactions effects Epistasis Etc.

How blue eyed parents can have a brown eyed child? Biological epistasis William Bateson, 1909 - “compositional epistasis” driven by biology Distortions of Mendelian segregation ratios due to one gene masking the effects of another Whenever two or more loci interact to create new phenotypes Whenever an allele at one locus masks or modifies the effects of alleles at one or more other loci Epistasis is an interaction at the phenotypic level of organization. It does not necessary imply biochemical interaction between gene products. How blue eyed parents can have a brown eyed child? By Dr. Barry Starr, Stanford University

pigment compound for pea flower Simple examples of epistasis Genes interact to create new phenotypes precursor Step 1 Step 2 Anthocyanin pigment compound for pea flower Gene 1: c/C - white Gene 2: p/P - white purple   Genotype at locus P Genotype at locus C pp pP PP cc white cC purple CC Masking effect of gene Gene 2: g/G - dominant Gene 1: BB and bB – black bb - white X grey Hair color in mice   Genotype at locus G Genotype at locus B gg gG GG bb white Grey bB Black BB

Statistical epistasis Ronald Fisher, 1918 - “statistical epistasis” Epistasis is when two (or more) different genes contribute to a single phenotype and their effects are not merely additive (deviations from a model of additive multiple effects for quantitative traits). Örjan Carlborg and Chris S. Haley, Nature Reviews Genetics, V 5, 2004

Why is there epistasis? C.H. Waddington, 1942: canalization and stabilizing selection theory: Phenotypes are stable in the presence of mutations through natural selection. The genetic architecture of phenotypes is comprised of networks of genes that are redundant and robust. Only when there are multiple mutational hits to the gene network occur the phenotypes can change dramatically. Epistasis create dependencies among the genes in the network and thus keep the stability of the system. Identification of epistasis is a step to systems-level genetics where we can understand all the complexity of underling biology of the complex traits. Greenspan, R. The flexible genome.  Nature Reviews Genetics 2,385. 

How big can be a disease variance due to epistasis? Joshua S. Bloom et al. & Leonid Kruglyak, Nature 2013 Finding the sources of missing heritability in a yeast cross. Researchers mated two yeast strains (BY from a laboratory and RM from a vineyard) and collected the offspring (tetrads) for genome-wide analysis to look for gene variations that contributed to survival (46 quantitative traits). This experimental setup allowed the researchers to determine how much of the inherited survival traits could be detected by genome-wide scan and how much was "missing" or undetectable using a genome-wide scan. (Illustration by Joshua Bloom, Kruglyak Lab) Broad-sense heritability (total heritability) - narrow-sense heritability (heritability due to main-effect variants or additive genetic factors) = variance due to gene–gene interactions: from 2% to 54% of heritability from 1 till 16 pair-wise interactions associated with 24 quantitative traits two-locus interactions in which neither locus has a detectable main effect were uncommon

Complications in humans ??? Can we use the statistical evidence of epistasis at the population level to infer biological or genetical epistasis in an individual? ??? Does biological evidence of epistasis imply that statistical evidence will be found? Epistasis can be very complex depending on Number of loci in epistasis (if more than one epistatic interaction occurs to cause a disease, then identifying the genes involved and defining their relationships becomes even more difficult.) Manner of inheritance of each particular locus (dominant, co-dominant, recessive, additive) Gene penetrance Confounding factors: environment, age, gender, etc. The trait (phenotype) they contribute to (binary, continuous, complex traits - disease)

Ho to identify epistasis? There are plenty methods exist each utilizing different statistical methodologies and addressing different aspects of biological epistais. Kilpatrick 2009 Clear strategy is required…

Protocol for GWAI 0. Data collecting and genotyping 1. Samples and markers quality control (e.g. SVS 7.5, PLINK software): HWE test (P > 1*10-4), call rate > 98%, marker allele frequency (MAF > 0.05) 2.b.1 Markers prioritization/pre-filtering (e.g. Biofilter 1.1.0 tool) 2.b.2 LD pruning (e.g. SVS 7.5, PLINK software): window size 50 bp, window increment 1 bp, LD r2 threshold 0.75 2.b.4 (Genome-Wide) Screening for pair-wise SNP interactions (e.g. MB-MDR2D analysis, SD plot, logistic regression-based methods) 2.a.1 LD pruning (e.g. SVS 7.5, PLINK software): 2.a.3 Exhaustive genome-wide screening for pair-wise SNP interactions (e.g. BOOST analysis) 4. Replication of epistasis in the independent data and meta-analysis (e.g. fixed effects and random effects models) 3. Replication analysis with alternative methods for epistasis detection: follow up the selected set of markers 2.b.1 Selection of SNPs basing on their function (e.g. SNPper - SNP Finder tool) 2.b.1 Selection of SNPs from candidate genes (data from literature) Exhaustive epistasis screening (a) Selective epistasis screening (b) 2.b.3 Adjustment for confounders (e.g. R software, via logistic regression), family structure (e.g. GenABEL software, via mixed polygenic model), population stratification (e.g. GenABEL software, via mixed polygenic model; SVS 7.5 software, via PCA) 5. Biological validation (e.g. immunological pathway analysis, eQTL analysis, DNA transcription factor binding sites analysis, composite elements binding sites analysis, etc.)

Selecting strategy for GWAI The exhaustive screening includes testing for all possible pair-wise interactions across all genetic markers. + all information is used for the analysis + new genetic loci can be detected - computationally demanding - test statistics has to be quite simple to be run in a reasonable time - power the analysis has to be very large to pass through stringent multiple testing criteria The selective screening, exploits particular assumptions and/or special methods to substantially reduce number of markers in the analysis and search for pair-wise interactions only across potentially more promising genetic loci. + computationally less demanding + more robust statistical methods can be applied (including adjustment for confounders) + less severe multiple testing correction is needed - smaller chance for detection of previously unreported epistasis

Pre-filtering of genetic markers for selective screening The selection of genetic markers is usually based on prior expert knowledge about a trait/disease under investigation. candidate genes/markers markers from coding regions of genes that can potentially change protein structure selection using filtering tools that take into account the biology behind the trait under investigation Biofilter uses biological information about gene-gene relationships and gene-disease relationships to construct multi-SNP models before conducting any statistical analysis. Model production is gene centric. Biofilter data-sources: Gene Ontology KEGG - The Kyoto Encyclopedia of Genes and Genomes Net Path - source of curated immune signaling and cancer pathways PFAM - Protein Families Database Reactome - database of curated core pathways and reactions in human biology DIP - The Database of Interacting Proteins

Exhaustive epistasis screening method BOOST (BOolean Operation-based Screening and Testing) is a fast two-stage (screening and testing) approach to search for epistasis associated with a binary outcome. Stage 1: In the screening stage, a non-iterative method is used to approximate the likelihood ratio statistic. Stage 2: In the testing stage, the classical likelihood ratio test is employed to measure the interaction effects of selected SNP pairs + can calculated interaction around over 0.5 millions of SNPs - can not deal with continuous traits (only for binary traits) cannot deal with LD does not perform automatically the multiple testing correction has limitations with respect to statistical power The BOOST is based on simple logistic regression (co-dominant model of inheritance): A logistic regression produces a logistic curve, which is limited to values between 0 and 1. Logistic regression is similar to a linear regression, but the curve is constructed using the natural logarithm of the “odds” of the independent variable, rather than the probability. Moreover, the predictors do not have to be normally distributed or have equal variance in each group.

Selective epistasis screening methods: MB-MDR Model-Based Multifactor Dimensionality Reduction (MB-MDR) method implies association testing between a trait and a factor consisting of multilocus genotype information. Step1: For every pair of markers, each multilocus genotype (MLG) is tested for association with a trait against of the group of other MLGs. Basing on this statistics each MLG is classified as “high risk”, “low risk” or “no evidence for risk” (by default risk threshold = 0.1), and than all MLGs of the same class are merged. Step2: For each risk category, “high” and “low” (captures summarized information about the importance of the pair of markers), a new association test is performed. Step3: The significance is explored through a permutation test (1000 permutations) and correction for the multiple testing bb bB BB aa   aA AA bb bB BB aa L aA O H AA H L vs + binary and continuous traits + different study designs + adjustment for covariates + adjust for main effect - computationally demanding

Selective epistasis screening methods: SD plot Synergy disequilibrium (SD) method: to assess interaction in a small set of genetic markers and graphically represent the results The synergy between two SNPs Si and Sj with respect to a trait/disease C is defined as the amount of information conveyed by the pair of SNPs about the presence of the disease, minus the sum of the corresponding amounts of information conveyed by each SNP: I(Si,Sj;C)−[I(Si;C)+I(Sj;C)] + can distinguish between LD and epistasis + good for results visualization does not correct for multiple testing can be used for a limited number of markers

Marker LD pruning LD pruning is a procedure of filtering genetic markers by linkage disequilibrium leaving for the analysis only tagging SNPs that are representatives of the genetic haplotype blocks. + allow avoiding top ranked SNP-SNP interactions that are redundant and merely due to the high correlation between genetic markers. + decrease computational burden + relax the excessive multiple testing correction LD, correlation between SNPs, is calculated via r2 statistics - Pearson test statistic for independence in a 2 × 2 table of haplotype counts. LD r2 filtering threshold  > 0.75 (rather conservative but will best minimize the amount of redundant interactions). LD pruning procedure in SVS (Golden Helix Inc.) For any pair of markers under testing whose r2  > 0.75, the first marker of the pair is discarded. Window increment 1 (number of markers by which the beginning window position was incremented). Where pi, pj are the marginal allelic frequencies at the ith and jth SNP respectively and pij is the frequency of the two-marker haplotype

Confounding factors in genetic studies A variable can confound the results of a statistical analysis only if it is related to both the dependent variable and at least one of the other independent variables in the analysis and also do not lie in the directed causal path from the dependent to independent variable. Special case of confounding factors: Population stratification is a systematic difference in allele frequencies between subpopulations in a population. The basic cause of population stratification is different genetic ancestry as a result of nonrandom mating between subgroups in a population due to various reasons (social, cultural, geographical, etc.). The shared ancestry corresponds to relatedness, or kinship, and so population structure can be defined in terms of patterns of kinship among groups of individuals. Dependent variable – Effect (e.g. disease trait) Independent variable – Cause (e.g. genetic factors) Confounding variable (e.g. sex, age, genetic background, ethnicity, etc.)

Principal component analysis (PCS) What is PCA? It is a way of identifying patterns in data, and expressing the data in such a way as to highlight their similarities and differences. PCA allows data transformation to a new coordinate system such that the projection of the data along the first new coordinate has the largest variance; the second principal component has the second largest variance, and so on. + easy to apply + availability of efficient algorithms + easy to adjust variables for by including PCs in regression models - not clear how to select an optimal number of PCs for adjustment - change the original presentation of the variables (specially severe for binary traits) - does not take care of cryptic relatedness (when apparently unrelated individuals actually have some unsuspected relatedness )

Protocol for GWAI 0. Data collecting and genotyping 1. Samples and markers quality control (e.g. SVS 7.5, PLINK software): HWE test (P > 1*10-4), call rate > 98%, marker allele frequency (MAF > 0.05) 2.b.1 Markers prioritization/pre-filtering (e.g. Biofilter 1.1.0 tool) 2.b.2 LD pruning (e.g. SVS 7.5, PLINK software): window size 50 bp, window increment 1 bp, LD r2 threshold 0.75 2.b.4 (Genome-Wide) Screening for pair-wise SNP interactions (e.g. MB-MDR2D analysis, SD plot, logistic regression-based methods) 2.a.1 LD pruning (e.g. SVS 7.5, PLINK software): 2.a.3 Exhaustive genome-wide screening for pair-wise SNP interactions (e.g. BOOST analysis) 4. Replication of epistasis in the independent data and meta-analysis (e.g. fixed effects and random effects models) 3. Replication analysis with alternative methods for epistasis detection: follow up the selected set of markers 2.b.1 Selection of SNPs basing on their function (e.g. SNPper - SNP Finder tool) 2.b.1 Selection of SNPs from candidate genes (data from literature) Exhaustive epistasis screening (a) Selective epistasis screening (b) 2.b.3 Adjustment for confounders (e.g. R software, via logistic regression), family structure (e.g. GenABEL software, via mixed polygenic model), population stratification (e.g. GenABEL software, via mixed polygenic model; SVS 7.5 software, via PCA) 5. Biological validation (e.g. immunological pathway analysis, eQTL analysis, DNA transcription factor binding sites analysis, composite elements binding sites analysis, etc.)

Epistasis replication and validation Given the availability of a comprehensive meta-analysis toolbox, it may be surprising that hardly any meta-GWAIs have been published as the core topic of the publication. Original population Replication study Original Random variation Different population Validation samples Systematic variation Igl et al. 2009 (Mission Impossible @ google) Validation: Gene-based rather then genetic variants-based validation analysis Meta-analytic approaches (replication analysis in an independent sample) Trustworthy biological validation (systematic literature review, use of structured knowledge from databases, biological experiments)

Biological validation of statistical epistasis signals J Thornton, EBI

Useful public sources for studding biology behind statistical epistais signals http://www.ncbi.nlm.nih.gov/  - The National Center for Biotechnology Information (NCBI) advances science and health by providing access to biomedical and genomic information. http://www.ensembl.org/index.html – The Ensembl project produces genome databases for vertebrates and other eukaryotic species, and makes this information freely available online. http://hapmap.ncbi.nlm.nih.gov/ – HapMap – multi-country effort to identify and catalog genetic similarities and differences in human beings. http://lynx.ci.uchicago.edu/ - LYNX – Gene Annotations, Enrichment Analysis and Genes Prioritization. http://www.genemania.org/ – GeneMANIA - Indexing 1,421 association networks containing 266,984,699 interactions mapped to 155,238 genes from 7 organisms.