Association analysis Genetics for Computer Scientists 15.3.-19.3.2004 Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.

Slides:



Advertisements
Similar presentations
BIOL EVOLUTION AT MORE THAN ONE GENE SO FAR Evolution at a single locus No interactions between genes One gene - one trait REAL evolution: 10,000.
Advertisements

What is an association study? Define linkage disequilibrium
Review of main points from last week Medical costs escalating largely due to new technology This is an ethical/social problem with major conseq. Many new.
Association Tests for Rare Variants Using Sequence Data
Note that the genetic map is different for men and women Recombination frequency is higher in meiosis in women.
Genetic research designs in the real world Vishwajit L Nimgaonkar MD, PhD University of Pittsburgh
METHODS FOR HAPLOTYPE RECONSTRUCTION
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Basics of Linkage Analysis
. Parametric and Non-Parametric analysis of complex diseases Lecture #6 Based on: Chapter 25 & 26 in Terwilliger and Ott’s Handbook of Human Genetic Linkage.
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
High resolution detection of IBD Sharon R Browning and Brian L Browning Supported by the Marsden Fund.
Data Mining in Linkage Disequilibrium Mapping Jing Hua Zhao Epidemiology June 2003.
Understanding GWAS Chip Design – Linkage Disequilibrium and HapMap Peter Castaldi January 29, 2013.
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
MALD Mapping by Admixture Linkage Disequilibrium.
Ingredients for a successful genome-wide association studies: A statistical view Scott Weiss and Christoph Lange Channing Laboratory Pulmonary and Critical.
Plant of the day! Pebble plants, Lithops, dwarf xerophytes Aizoaceae
Signatures of Selection
MSc GBE Course: Genes: from sequence to function Genome-wide Association Studies Sven Bergmann Department of Medical Genetics University of Lausanne Rue.
CSE 291: Advanced Topics in Computational Biology Vineet Bafna/Pavel Pevzner
Using biological networks to search for interacting loci in genome-wide association studies Mathieu Emily et. al. European journal of human genetics, e-pub.
Genomewide Association Studies.  1. History –Linkage vs. Association –Power/Sample Size  2. Human Genetic Variation: SNPs  3. Direct vs. Indirect Association.
Give me your DNA and I tell you where you come from - and maybe more! Lausanne, Genopode 21 April 2010 Sven Bergmann University of Lausanne & Swiss Institute.
Haplotype Discovery and Modeling. Identification of genes Identify the Phenotype MapClone.
Shaun Purcell & Pak Sham Advanced Workshop Boulder, CO, 2003
Computational research for medical discovery at Boston College Biology Gabor T. Marth Boston College Department of Biology
Fine mapping QTLs using Recombinant-Inbred HS and In-Vitro HS William Valdar Jonathan Flint, Richard Mott Wellcome Trust Centre for Human Genetics.
Case(Control)-Free Multi-SNP Combinations in Case-Control Studies Dumitru Brinza and Alexander Zelikovsky Combinatorial Search (CS) for Disease-Association:
SNPs Daniel Fernandez Alejandro Quiroz Zárate. A SNP is defined as a single base change in a DNA sequence that occurs in a significant proportion (more.
National Taiwan University Department of Computer Science and Information Engineering Haplotype Inference Yao-Ting Huang Kun-Mao Chao.
The Complexities of Data Analysis in Human Genetics Marylyn DeRiggi Ritchie, Ph.D. Center for Human Genetics Research Vanderbilt University Nashville,
1 Genes and MS in Tasmania, cont. Lecture 5, Statistics 246 February 3, 2004.
National Taiwan University Department of Computer Science and Information Engineering Pattern Identification in a Haplotype Block * Kun-Mao Chao Department.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Jianfeng Xu, M.D., Dr.PH Professor of Public Health and Cancer Biology Director, Program for Genetic and Molecular Epidemiology of Cancer Associate Director,
Finnish Genome Center Monday, 16 November Genotyping & Haplotyping.
INTRODUCTION TO ASSOCIATION MAPPING
Discovery of a rare arboreal forest-dwelling flying reptile (Pterosauria, Pterodactyloidea) from China Wang et al. PNAS Feb. 11, 2008.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
MEME homework: probability of finding GAGTCA at a given position in the yeast genome, based on a background model of A = 0.3, T = 0.3, G = 0.2, C = 0.2.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
1 TreeDT:Gene Mapping by Tree Disequilibrium Test Author:Pettri Sevon Dept. of computer science & Finnish Genome center. Univ. of Helsinki Hannu T.T. Toivonen.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
The International Consortium. The International HapMap Project.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Linkage Disequilibrium and Recent Studies of Haplotypes and SNPs
Lecture 22: Quantitative Traits II
Lectures 7 – Oct 19, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday 12:00-1:20 Johnson Hall.
Types of genome maps Physical – based on bp Genetic/ linkage – based on recombination from Thomas Hunt Morgan's 1916 ''A Critique of the Theory of Evolution'',
Increasing Power in Association Studies by using Linkage Disequilibrium Structure and Molecular Function as Prior Information Eleazar Eskin UCLA.
A Fine Mapping Theorem to Refine Results from Association Genetics Studies S.J. Schrodi, V.E. Garcia, C.M. Rowland Celera, Alameda, CA ABSTRACT Justification.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Inferences on human demographic history using computational Population Genetic models Gabor T. Marth Department of Biology Boston College Chestnut Hill,
Power and Meta-Analysis Dr Geraldine M. Clarke Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015 Africa Centre for.
Date of download: 7/2/2016 Copyright © 2016 American Medical Association. All rights reserved. From: How to Interpret a Genome-wide Association Study JAMA.
Common variation, GWAS & PLINK
Genome Wide Association Studies using SNP
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Statistical Methods for Quantitative Trait Loci (QTL) Mapping II
Genome-wide Association Studies
QTL Fine Mapping by Measuring and Testing for Hardy-Weinberg and Linkage Disequilibrium at a Series of Linked Marker Loci in Extreme Samples of Populations 
A Flexible Bayesian Framework for Modeling Haplotype Association with Disease, Allowing for Dominance Effects of the Underlying Causative Variants  Andrew.
Evaluation of power for linkage disequilibrium mapping
Data Mining Applied to Linkage Disequilibrium Mapping
Presentation transcript:

Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo

Lecture outline Genetic association analysis Allelic association χ 2 –test Linkage disequilibrium (LD) process Formulation of the computational problem for LD mapping Limitations of the LD mapping Approaches. For example: HPM

Genetic association analysis Search for significant correlations between gene variants and phenotype For example: Locus A for SLE: 100 cases and 100 controls genotyped AffectedUnaffected Allele Allele 22154

Allelic association = An allele is associated to a trait Allele 1 seems to be associated, based on sheer numbers, but how sure can one be about it?

AffectedHealthy  Allele Allele 

The idea is to compare the observed frequencies to frequencies expected under hypothesis of no association between alleles and the occurrence of the disease (independency between variables) Test statistic Where o i is the observed class frequency for class i, e i expected (under H 0 of no association) k is the number of classes in the table Degrees of freedom for the test: df=(r-1)(s-1)

df=1 p<<0,001 AffectedHealthy  Allele (79) 62.5 (46) 125 Allele (21) 37.5 (54) 75  Expected

Interpretation of the test results The p-value is low enough that H 0 can be rejected = the probability that the observed frequencies would differ this much (or even more) from expected by just coincidence < χ 2 –tables (Appendix), internet resources, etc.

Genetic association is population level correlation with some known genetic variant and a trait: an allele is over- represented in affected individuals → From a genetic point of view, an association does not imply causal relationship Often, a gene is not a direct cause for the disease, but is in LD with a causative gene →

Linkage disequilibrium (LD) Closely located genes often express linkage disequilibrium to each other: Locus 1 with alleles A and a, and locus 2 with alleles B and b, at a distance of a few centiMorgans from each other At equilibrium, the frequency of the AB haplotype should equal to the product of the allele frequencies of A and B,  AB =  A  B. If this holds, then  Ab =  A  b,  aB =  a  B and  ab =  a  b, as well. Any deviation from these values implies LD.

Linkage disequilibrium (LD) LD follows from the fact that closely located genes are transmitted as a ”block” which only rarely breaks up in meioses An example: –Locus 1 – marker gene –Locus 2 – disease locus, with allele b as dominant susceptibility allele with 100% penetrance

An example

Association evaluated → Locus 1 also seems associated, even though it has nothing to do with the disease – association observed just due to LD LD mapping – utilizing founder effect A new disease mutation born n generations ago in a relatively small, isolated population The original ancestral haplotype slowly decays as a function of generations In the last generation, only small stretches of founder haplotype can be observed in the disease- associated chromosomes

LD mapping: Utilizing founder effect

Data: Searching for a needle in a haystack Disease gene a ? a ? c 2 1 ? ? c 1 1 ? ? ? ? 1 a a Disease status S2...SNP1... … …

Task is to find either an allele or an allele string (haplotype) which is overrepresented in disease-associated chromosomes –markers may vary: SNPs, microsatellites –populations vary: the strength of marker-to- marker LD Many approaches: –”old-fashioned” allele association with some simple test (problem: multiple testing) –TDT; modelling of LD process: Bayesian, EM algorithm, integrated linkage & LD

Limitations of the LD mapping The relationship between the distance of the markers vs. the strength of LD: theoretical curve

Linkage disequilibrium (D’) for the African American (red) and European (blue) populations binned in 5 kb classes after removing all SNPs with minor allele frequencies less than 20% SNPs were included (Source

Limitations: LD is random process LD is a continuous process, which is created and decreased by several factors: –g–genetic drift –p–population structure –n–natural selection –n–new mutations –f–founder effect → limits the accuracy of association mapping

Research challenges … Haplotyping methods needed as prerequisite for association/LD methods …or, searching association directly from genotype data (without the haplotyping stage) Better methods for measurement of the association (and/or the effects of the genes) Taking disease models into consideration

A methodological project: Haplotype Pattern Mining (HPM) AJHG 67: , 2000 Search the haplotype data for recurrent patterns with no pre-specified sequence Patterns may contain gaps, taking into consideration missing and erroneous data The patterns are evaluated for their strength of association Markerwise ‘score’ of association is calculated

n Algorithm 1.Find a set of associated haplotype patterns –number of gaps allowed (2) –maximum gap length (1 marker) –maximum pattern length (7 markers) –association threshold (  2 = 9) 2.Score loci based on the patterns n Evaluate significance by permutation tests n Extendable to quantitative traits n Extendable to multiple genes

Example: a set of associated patterns Marker  2 P * * * 9.6 P * * 9.2 P * 1 1 * 8.9 P4 2 1 * 2 1 * * * 8.1 P5 1 * * * * 7.4 P6 * * * 7.1 P7 * * * * * 7.1 P * * * * 6.9 P * * * * * 6.8 Score

Pattern selection The set of potential patterns is large. Depth-first search for all potential patterns Search parameters limit search space: –number of gaps –maximum gap length –maximum pattern length –association threshold

Score and localization: an example

Permutation tests random permutation of the status fields of the chromosomes 10,000 permutations HPM and marker scores recalculated for each permuted data set proportion of permuted data sets in which score > true score  empirical p-value.

Permutation surface (A=7.5 %). The solid line is the observed frequency.

Localization power with simulated SNP data (density 3 SNPs per 1 cM). Isolated population with a 500-year history was simulated. Disease model was monogenic with disease allele frequency varying from % in the affecteds % of data was missing. Sample size 100 cases and 100 controls.

Benefits & drawbacks Non-parametric, yet efficient approach; no disease model specification is needed + Powerful even with weak genetic effects and small data sets + Robust to genotyping errors, mutations, missing data + Allows for gaps in haplotypes +

Flexible: easily extended to different types of markers, environmental covariates, and quantitative measurements + optimal pattern search parameters may need to be specified case-wise - no rigid statistical theory background - requires dense enough map to find the area where DS gene is in LD with nearby markers.

Search of the susceptibility gene: 1.With good luck - and information from gene banks, pick up the correct candidate gene 2.Genetic region with positive linkage signal is saturated with markers, and this data is now searched for a secondary correlation – correlation of marker allele(s) with the actual disease mutation (LD)

n Improved statistical methods to detect LD –Terwilliger (1995) –Devlin, Risch, Roeder (1996) –McPeek and Strahs (1999) –Service, Lang et al. (1999) n Statistical power of association test statistics –Long, Langley (1999). n Review on statistical approaches to gene mapping –Ott, Hoh (2000)