Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:

Association mapping Fundamental Principles and a few methods Thomas Mailund Slides: www.daimi.au.dk/~mailund/association-mapping/

Outline ● Introduction ➔ Goals and setup ➔ Genetic variation in humans ● Marker/disease association ➔ Indirect association and linkage disequilibrium ➔ Background population genetics concepts ➔ “Global” and “local” genealogies ● Mapping methods ➔ Local genealogies (Blossoc) ➔ Clustering (HapCluster)

What are we looking for? Gunshot woundsCar accidentsSmoking inducedlung cancerCardiovasculardiseaseObesityDiabetes 2AlzheimerSchizophreniaBRCA1 breastcancer Cystic fibrosisHemophilia Environment Genes

Goal of association mapping Identification of susceptibility variant, replication in different cohort/population, understanding of genetic function at the cell level, this can lead to 1. identification of drugable targets, 2. development of drug for prevention 3. better understanding of the cellular processes that are involved in disease  treatments Mapping function = P(Disease gene location | data) Drugable target Treatment Understanding of Cellular processes

Case-control studies Disease Responder Control Non-responder Allele 1Allele 2 Marker A is associated with Phenotype Marker A: Allele 1 = Allele 2 = Cautions ● Subgroup analysis and multiple testing ● Poorly defined phenotypes ● Poorly matched controls, population stratification ● Failure to replicate ● Optimistic interpretation of results ● Positive publication bias ● Measuring the wrong variation

Relative risk ● Relative risk (RR) ● RR is the likelihood of disease in the exposed group (susceptibility allele or genotype carriers) compared to the unexposed group (not carriers) ● E.g. RR = 1.5 indicates that carriers of the A allele have 1.5 times the risk of disease than non-carriers, i.e. 50% more likely to get the disease. ● Genotype relative risk (GRR) ● Relative risk assigned to genotypes AA, Aa, aa ● GRR(Aa) = P(diseased|Aa) / P(diseased | aa) ● GRR(AA) = P(diseased|AA) / P(diseased | aa) ● E.g. additive model: ● P(disease | aa) = b; P(disease | Aa) = b + e; P(disease | AA) = b + 2e ● GRR(Aa) = (b+e)/b; GRR(AA) = (b+2e)/b = 2GRR(Aa)-1 ● If GRR(Aa) = 1.5, then GRR(AA) = 2

Relative risk: Examples ● Huntington’s Disease >1000 ● Cystic Fibrosis 400 ● Autism 75 ● Inflammatory Bowel Disease 60 ● Multiple Sclerosis 20 ● Juvenile Diabetes 15 ● Schizophrenia 10 ● Asthma 6 ● Prostate Cancer 5 ● Late Onset Diabetes 2-3 ● Breast Cancer 2 Examples from Lon Cardon Relative risk of being related to an affected (any genetic effect) ● Genes with relative risk for schizophrenia ➔ Neuregulin (NRG1) GRR: 2 ➔ Calcineurin (PPP3CC) GRR: 1.3 ➔ Cathechol-O-methyl transferase (COMT) GRR: 1.5

Genetic variation ● Very little variation in humans (compared to related species)

Genetic variation ● Each new cell contains ~3 new mutations ● Each new “child” ~20 new mutations ● On average the sequences of any two human genomes are 99.9% the same ➔ 0.1% of the genome ~ 3 million base pairs ➔ Maybe as much as 2.5 billion sites has variation in the entire population ● This genetic variation (plus environmental influences) is responsible for variation in human traits.

Types of variation Annu. Rev. Genom. Human Genet. 2006.7:407-442.

Single Nucleotide Polymorphisms ● The most common form of genetic polymorphism ● Common variants (MAF>5%) estimated to occur every 100-300 bp (10 – 30 million SNPs)

HapMap ● Phase I: 1 million SNPs in 90 individuals from Europe, Africa and Asia ● Phase II: 3 million SNPs ●  SNP selection made easy ●  250K & 500K SNPs based on non-redundant Phase I commercially available ●  Genome wide scans a reality

Setup: case/control sequences --A--------C--------A----G--------T---C---A---- --T--------G--------A----G--------C---C---A---- --A--------G--------G----G--------C---C---A---- --A--------C--------A----G--------T---C---A---- --T--------C--------A----G--------T---C---A---- --T--------C--------A----T--------T---A---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------A----G--------T---C---G---- --T--------C--------A----T--------T---C---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------G----T--------C---A---A---- --A--------C--------A----G--------C---C---G---- Cases (affected) Controls (unaffected) Sequences of nucleotides at known polymorphic sites

Actual setup: unphased sequences -A/T------C/G------A/A--G/G------T/C-C/C-A/A--- -A/T------C/G------G/A--G/G------T/C-C/C-A/A--- -T/T------C/C------A/A--G/T------T/T-C/A-A/A--- -A/A------C/C------A/A--G/G------T/T-C/C-A/A--- -A/T------C/C------A/A--G/T------T/T-C/C-G/A--- -A/A------C/C------G/A--G/T------T/C-C/A-A/A--- Sequences of pairs of nucleotides at known polymorphic sites Phase inference software: Phase SNPHAP but see also: Morris et al. 2004

Association mapping --A--------C--------A----G--------T---C---A---- --T--------G--------A----G--------C---C---A---- --A--------G--------G----G--------C---C---A---- --A--------C--------A----G--------T---C---A---- --T--------C--------A----G--------T---C---A---- --T--------C--------A----T--------T---A---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------A----G--------T---C---G---- --T--------C--------A----T--------T---C---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------G----T--------C---A---A---- --A--------C--------A----G--------C---C---G---- We are searching for an association between variant and disease status Significant difference in distributions?

Significant difference in distribution --A--------C--------A----G--------T---C---A---- --T--------G--------A----G--------C---C---A---- --A--------G--------G----G--------C---C---A---- --A--------C--------A----G--------T---C---A---- --T--------C--------A----G--------T---C---A---- --T--------C--------A----T--------T---A---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------A----G--------T---C---G---- --T--------C--------A----T--------T---C---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------G----T--------C---A---A---- --A--------C--------A----G--------C---C---G---- Consider a single marker...

Significant difference in distribution Affected Unaffected Allele G Allele T Contingency table: T, Unaffected G, Unaffected T, AffectedG, Affected

Significant difference in distribution Affected | G Unaffected | G Allele G Allele T Conditional disease status: T, Unaffected G, Unaffected T, AffectedG, Affected Affected | T Unaffected | T

Significant difference in distribution Null-hypothesis: the marker does not affect the disease status P(T, Unaffected) = P(T)P(Unaffected) P(G, Unaffected) = P(G)P(Unaffected) P(T, Affected) = P(T)P(Affected) P(G, Affected) = P(G)P(Affected) P( Affected ) P( Unaffected ) P( G )P( T )

Significant difference in distribution ● The null-hypothesis tested with ➔ Fisher’s exact test (for small data sets) ➔ 2 test (large sample approximation) when each cell has count > 5 ➔ Allelic level: 2x2 matrix ➔ Genotype level: 2x3 matrix ➔ For two loci, there are 9 different two-loci genotypes, i.e. Interactions can be tested in a 2x9 matrix

Relative risk and power Statistical power: The probability of rejecting the null hypothesis when it is in fact false Simulations by M. Schierup 1000 simulations with additive disease model P(A) = 0.1; 1000 cases and 1000 controls 5% significance level (0.005% with Bonferroni correction)

The Central Dogma: the common disease / common variant (CD/CV) hypothesis Reich & Lander 2001 Population expansion < 100.000 years ago Rare variant Common variant In a small population, allelic heterogeneity is small < 100.000 years ago the human population was very small Even though the human population today is large the frequency spectrum of variants still reflects the recent small size/bottleneck  common diseases caused by few common variants (and a lot of rare undetectable variants caused by recent mutations) Past Present If association studies locate many susceptibility variants, the hypothesis has been tested true

Frequency and power Simulations by M. Schierup 1000 simulations with additive disease model 1000 cases and 1000 controls, P(disease | aa) = 0.05 5% significance level (0.005% with Bonferroni correction)

Example: Cystic fibrosis 2 -test for different distributions Kerem et al. (1989) Control group: 92 SNP Haplotypes Case group: 94 SNP Haplotypes 23 SNP Markers

An indirect approach --A--------C--------A----G---X----T---C---A---- --T--------G--------A----G---X----C---C---A---- --A--------G--------G----G---X----C---C---A---- --A--------C--------A----G---X----T---C---A---- --T--------C--------A----G---X----T---C---A---- --T--------C--------A----T---X----T---A---A---- --A--------C--------A----G---X----T---C---A---- --A--------C--------A----G---X----T---C---G---- --T--------C--------A----T---X----T---C---A---- --A--------C--------A----G---X----T---C---A---- --A--------C--------G----T---X----C---A---A---- --A--------C--------A----G---X----C---C---G---- ● Disease site unlikely to be among our markers ➔ Might be an unknown polymorphic site (and not necessarily a SNP) ➔ Just not part of the typed markers (maybe typed 500K out of 3 billion nucleotides!)

An indirect approach --A--------C--------A----G---X----T---C---A---- --T--------G--------A----G---X----C---C---A---- --A--------G--------G----G---X----C---C---A---- --A--------C--------A----G---X----T---C---A---- --T--------C--------A----G---X----T---C---A---- --T--------C--------A----T---X----T---A---A---- --A--------C--------A----G---X----T---C---A---- --A--------C--------A----G---X----T---C---G---- --T--------C--------A----T---X----T---C---A---- --A--------C--------A----G---X----T---C---A---- --A--------C--------G----T---X----C---A---A---- --A--------C--------A----G---X----C---C---G---- ● The markers are not independent ➔ Knowing one marker is partial knowledge of others ➔ The non-independence is called LD: Linkage Disequilibrium

Genealogical view of LD Variations in Chromosomes Within a Population Common Ancestor Emergence of Variations Over Time time present Disease Mutation

Linkage disequilibrium Variations in Chromosomes Within a Population P( ) = P( ) = 0.42 P( ) = 0.42 P( )P( ) = 0.17 D( ) = P( ) - P( )P( ) = 0.24 P( ) = 0.17 P( ) = 0.29 P( ) = 0.17 P( )P( ) = 0.05 D( ) = P( ) - P( )P( ) = 0.12

Measures of LD Correlation Coeffecient Measure [0,1] Hill & Robertson (1968) Range constrained by allele frequencies [0,1] Lewontin (1964) D’(AB) = if D(AB) > 0: D(AB) / min(P(A)P(b),P(a)P(B)) else: - D(AB) / min(P(A)P(B),P(a)P(b)) D(AB) = P(AB) – P(A)P(B) = D(ab) = -D(Ab) = -D(aB) r 2 (AB) = D 2 (AB) / P(A)P(a)P(B)P(b)

Linkage disequilibrium Variations in Chromosomes Within a Population P( ) = P( ) = 0.42 P( ) = 0.42 P( )P( ) = 0.17 D( ) = P( ) - P( )P( ) = 0.24 D’( ) = D( ) / min{P( )(1-P( )),(1-P( ))P( )} = 0.24 / min{0.42x0.58, 0.58x0.42} = 1 r 2 ( ) = D 2 ( ) / P( )(1-P( ))P( )(1-P( )) = 0.06 / 0.42x0.58x0.42x0.58 =1

Linkage disequilibrium Variations in Chromosomes Within a Population P( ) = 0.17 P( ) = 0.29 P( ) = 0.17 P( )P( ) = 0.05 D( ) = P( ) - P( )P( ) = 0.12 D’( ) = D( ) / min{P( )(1-P( )), (1-P( ))P( )} = 0.12 / min{0.17x0.71, 0.83x0.29} = 1 r 2 ( ) = D 2 ( ) / P( )(1-P( ))P( )(1-P( )) = 0.01 / 0.17x0.83x0.29x0.71 = 0.49

Causes of LD Time t ago Now Creates LDBreaks down LD DriftRecombination Selection(Gene conversion) Admixture

An indirect approach --T--------G--------A----G---X----C---C---A---- --A--------G--------G----G---X----C---C---A---- --A--------C--------A----G---X----T---C---A---- --T--------C--------A----G---X----T---C---A---- --T--------C--------A----T---X----T---A---A---- --A--------C--------A----G---X----T---C---A---- --A--------C--------A----G---X----T---C---G---- --T--------C--------A----T---X----T---C---A---- --A--------C--------A----G---X----T---C---A---- --A--------C--------G----T---X----C---A---A---- --A--------C--------A----G---X----C---C---G---- ● The markers are not independent ➔ Knowing one marker is partial knowledge of others ➔ This non-independence decreases with distance --A--------C--------A----G---X----T---C---A----

A short detour: population genetics Parents Gametes Diploid model of reproduction (Without recombination) Offsprin g Chromosome reproduction (without recombination)

Wright-Fisher model ● Discrete, non-overlapping generations ● Constant population size ● Each individual in one generation is ➔ a random copy of an individual from the previous generation ➔ or a new mutation Mutation

Recombinations

Non-Ancestral Material Crossover point

Wright-Fisher with recombination ● Discrete, non-overlapping generations ● Constant population size ● Each individual in one generation is ➔ a random copy of an individual from the previous generation ➔ a new mutation ➔ a recombination of two individuals from the previous generation, at a random cross-over point Recombination

Mutation + Recombination Mutation Point

Mutation + Recombination Mutation Point (+ )

Indirect association Mutation Point Complete association Less association Even less association

An indirect approach 2 -test for different distributions Highly associated because they are close to the disease affecting site

An indirect approach Linkage disequilibrium measured by r 2 using Haploview 3.12 These associations are NOT independent, i.e. they probably mark the same variant BRCA2 gene Prostate cancer in Iceland

Extend of relatedness ● “Nearby” is ~ 0.1–0.01 cM ➔ ~ 100–10 Kbp ➔ ~ 1/30,000 – 1/300,000 of the genome ● Closer spacing needed for accuracy ➔ ~ 500,000–1,000,000 for whole genome ➔ ~10–100 for typical gene

LD as a function of distance From: Clark et al. 2003, AJHG 73:285-300. Empirical results from HapMap data LD(r 2 ) Recombination rate From: Hein et al. 2005 Simulation results

Extend of relatedness Isolated recently founded Quebec, Cajun Acadiana Utah Amish Iceland Faroese Islands extends over longer distances “low” density marker map low resolution extends over shorter distances “high” density marker map high resolution Isolated relatively old Kainuu (Finland) North Karelia (Finland) Sardinien Ashkenazi Jews non-isolated relatively old (bottlenecks) European, Asian ● Population dependent ➔ Founding age ➔ Isolation (inbreeding) Africa

Variation in recombination rate Sperm analysis Population genetic data Myers et al. 2005 McVean et al. 2005

Tagging SNPs ● Close markers are in linkage disequilibrium, i.e. one marker carries information on nearby variation ● LD between SNPs are so high that typing the whole set will provide no more information than typing a few tagSNPs ● tagSNPs: a minimal number of informative markers can be used to identify the common haplotypes in each block

But notice! 6 markers with low association Responsible marker Distance from APOE locus (Kbp) Alzheimer and ApoE: Closeness to the disease marker does not guarantee significance!

Multi-marker approaches... --A--------C--------A----G--------T---C---A---- --T--------G--------A----G--------C---C---A---- --A--------G--------G----G--------C---C---A---- --A--------C--------A----G--------T---C---A---- --T--------C--------A----G--------T---C---A---- --T--------C--------A----T--------T---A---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------A----G--------T---C---G---- --T--------C--------A----T--------T---C---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------G----T--------C---A---A---- --A--------C--------A----G--------T---C---A---- --T--------G--------A----G--------C---C---A---- --A--------G--------G----G--------C---C---A---- --A--------C--------A----G--------T---C---A---- --T--------C--------A----G--------T---C---A---- --T--------C--------A----T--------T---A---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------A----G--------T---C---G---- --T--------C--------A----T--------T---C---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------G----T--------C---A---A---- Single marker approach: Multi marker approach:

Using the (local) genealogy of the locus ● Tree at disease site: ➔ “Perfect” setup ➔ Incomplete penetrance ➔ Other disease causes HHHHHHHH DDDDD HHHHHHHH DDDHD HDHHHDHH DDDHD Templeton et al 1987

Using the (local) genealogy of the locus ● At the disease site: ➔ A significant clustering of diseased/healthy HDHHHDHH DDDHD Templeton et al 1987

Using the (local) genealogy of the locus ● Local genealogies ➔ Each site a different genealogy ➔ Nearby genealogies only slightly different --T--------G--------A----G---X----C----C-----A-- --A--------G--------G----G---X----C----C-----A-- --A--------C--------A----G---X----T----C-----A-- --T--------C--------A----G---X----T----C-----A-- --T--------C--------A----T---X----T----A-----A-- --A--------C--------A----G---X----T----C-----A-- AAATT T CCGG CC AAAGA A GGGG GT TTCCT T CCCC CA AAAAA A A nearby tree an imperfect local tree

Detour: Genealogies... MRCA of the sampled sequences A coalescent event for two sampled sequences 1 123 2 3

Detour: Genealogies... MRCA of the sampled sequences A coalescent event for two sampled sequences 1 123 2 3 A recombination event

Ancestral Recombination Graph (Hudson 1990, Griffith&Marjoram 1996) Sampled sequences MRCA

Ancestral Recombination Graph (Hudson 1990, Griffith&Marjoram 1996) Recombinations Coalescence

Ancestral Recombination Graph (Hudson 1990, Griffith&Marjoram 1996) Non-ancestral material Non-ancestral material

Ancestral Recombination Graph Mutations 1234 (Hudson 1990, Griffith&Marjoram 1996)

Ancestral Recombination Graph 1234 (Hudson 1990, Griffith&Marjoram 1996) The ARG is a complete genealogy for the sampled sequences

“Local” genealogies For each “point” on the chromosome, the ARG determines a (local) tree:

“Local” genealogies ● Different topologies ● Different branch lengths ● Different inheritance

“Local” genealogies Type 1: No change Type 2: Change in branch lengths Type 3: Change in topology From Hein et al. 2005

“Local” genealogies Recombination rate From Hein et al. 2005 M AB = [∑ i,j I {i=j} bl(i)bl(j)] / tbl(A)tbl(B) S AB = M AB / M AA Tree measure:

Using the (local) genealogy of the locus --T--------G--------A----G---X----C----C-----A-- --A--------G--------G----G---X----C----C-----A-- --A--------C--------A----G---X----T----C-----A-- --T--------C--------A----G---X----T----C-----A-- --T--------C--------A----T---X----T----A-----A-- --A--------C--------A----G---X----T----C-----A-- AAATT T CCGG CC AAAGA A GGGG GT TTCCT T CCCC CA AAAAA A Tree at disease site resembles neighbours

Using the (local) genealogy of the locus ● Near the disease site: ➔ A significant clustering of diseased/healthy HDHHHDHH DDDHD Templeton et al 1987 Zöllner&Pritchard 2004

Using the (local) genealogy of the locus ● Approach: ➔ Infer trees over regions ➔ Score the regions wrt their clustering HDHHHDHH DDDHD Templeton et al 1987 Zöllner&Pritchard 2004

BLOck aSSOCiation (BLOSSOC) Mailund et al. 2006 ● In the infinite sites model: ➔ Each mutation occurs only once ➔ Each mutation splits the sample in two ➔ A consistent tree can efficiently be inferred for a region without recombinations

BLOck aSSOCiation (BLOSSOC) Mailund et al. 2006 Use the four-gamete test to find regions that can be explained by a tree

BLOck aSSOCiation (BLOSSOC) Mailund et al. 2006 Build a tree for each such region

BLOck aSSOCiation (BLOSSOC) Mailund et al. 2006 Score the tree, and assign the score to the region

Scoring trees... Red=cases Green=controls Are the case chromosomes significantly overrepresented in some clusters?

Cystic Fibrosis example

Simulated Example (CoaSim)

Augmented HapMap data

Implementation... Homepage: www.daimi.au.dk/~mailund/Blossoc Command line and graphical user interface...

Statistical model based approaches... Statistic al framewo rk Molecu lar biology Prior knowled ge Geneti cs Some model explaining the sequences and status --A--------C--------A----G--------T---C---A---- --T--------G--------A----G--------C---C---A---- --A--------G--------G----G--------C---C---A---- --A--------C--------A----G--------T---C---A---- --T--------C--------A----G--------T---C---A---- --T--------C--------A----T--------T---A---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------A----G--------T---C---G---- --T--------C--------A----T--------T---C---A---- --A--------C--------A----G--------T---C---A---- --A--------C--------G----T--------C---A---A---- --A--------C--------A----G--------C---C---G----

Statistical model based approaches... A model gives a probability distribution on the data: P( D | ) Data, e.g. sequences and disease status Parameters, e.g. penetrance, disease locus, and genealogy

Statistical model based approaches... A model gives a probability distribution on the data: P( D | ) also gives us likelihood approaches: lhd() = P( D | ) MLE: argmax lhd() and Bayesian approaches: P( |D ) ∝ P( D | ) P() = lhd() P()

MCMC (Metropolis) approach 1Compute the likelihood in the current point, lhd()=L 2Suggest a new point, ' 3Compute the likelihood in this point f(') = L’ 4If L ≤ L’, go to point ' 5If L > L’, go to point ' with the probability L’/L lhd(x) = ∫ lhd(x,) d All parameters except x

MCMC (Metropolis) approach 1 1 2? 2! 2 3 1 Projection on one axis equivalent to integration over the remaining parameters The resulting samples approximate the likelihood lhd

The HapCluster model Waldron et al. 2006 ---A----G----X---C---C---A---- ---A----T--------C---G---A---- ---A----G--------C---G---A---- ---T----G--------C---C---G---- ---A----T--------C---G---A---- ---T----G--------C---G---A---- Unrelated “wildtypes” (Locally) related “mutants”

The HapCluster model Waldron et al. 2006 ---A----G----X---C---C---A---- ---A----T--------C---G---A---- ---A----G--------C---G---A---- ---T----G--------C---C---G---- (Locally) related “mutants” ● “Mutants” defined by local sequence similarity to “ancestral” sequence ● Implicitly assuming star-genealogy

The HapCluster model Waldron et al. 2006 ● Given “ancestral” sequence and a distance measure: ➔ Defines cluster around the ancestral sequence ➔ Sequences above a given similarity threshold considered “mutants” ➔ Sequences below considered “wild types”

The HapCluster model Waldron et al. 2006 ● Each individual has one of the genotypes: ➔ “mutant” & “mutant” ➔ “mutant” & “wild type” ➔ “wild type” & “wild type” ● Each has a different risk ( MM, MW, WW ) of being affected ● Likelihood:

The HapCluster model Waldron et al. 2006 ● Risks considered nuisance parameters and integrated out

HapCluster MCMC approach Point: trait-locus, ancestral haplotype, other (nuisance) parameters Change functions: move trait-locus, change cluster size, change ancestral haplotype... Likelihood function: product of Beta functions Waldron et al. 2006

Example: Simulated dataset

Implementation... Homepage: www.daimi.au.dk/~mailund/HapCluster Command line version only...

Summary ● Introduction ➔ Goals and setup ➔ Genetic variation in humans ● Marker/disease association ➔ Indirect association and linkage disequilibrium ➔ Background population genetics concepts ➔ “Global” and “local” genealogies ● Mapping methods ➔ Local genealogies (Blossoc) ➔ Clustering (HapCluster)

Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:

Similar presentations

Presentation on theme: "Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:

Similar presentations

Presentation on theme: "Association mapping Fundamental Principles and a few methods Thomas Mailund Slides:"— Presentation transcript:

Similar presentations

About project

Feedback