Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237.

Slides:



Advertisements
Similar presentations
Statistical methods for genetic association studies
Advertisements

Mapping genes with LOD score method
Association Tests for Rare Variants Using Sequence Data
Single Nucleotide Polymorphism And Association Studies Stat 115 Dec 12, 2006.
SNP Applications statwww.epfl.ch/davison/teaching/Microarrays/snp.ppt.
Gene Frequency and LINKAGE Gregory Kovriga & Alex Ratt.
METHODS FOR HAPLOTYPE RECONSTRUCTION
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Tutorial #5 by Ma’ayan Fishelson. Input Format of Superlink There are 2 input files: –The locus file describes the loci being analyzed and parameters.
Multiple Comparisons Measures of LD Jess Paulus, ScD January 29, 2013.
Basics of Linkage Analysis
Linkage Analysis: An Introduction Pak Sham Twin Workshop 2001.
GGAW - Oct, 2001M-W LIN Study Design for Linkage, Association and TDT Studies 林明薇 Ming-Wei Lin, PhD 陽明大學醫學系家庭醫學科 台北榮民總醫院教學研究部.
Human Genetics Genetic Epidemiology.
Joint Linkage and Linkage Disequilibrium Mapping
Association Mapping David Evans. Outline Definitions / Terminology What is (genetic) association? How do we test for association? When to use association.
Chi Square Analyses: Comparing Frequency Distributions.
MALD Mapping by Admixture Linkage Disequilibrium.
Power in QTL linkage: single and multilocus analysis Shaun Purcell 1,2 & Pak Sham 1 1 SGDP, IoP, London, UK 2 Whitehead Institute, MIT, Cambridge, MA,
1 How many genes? Mapping mouse traits, cont. Lecture 2B, Statistics 246 January 22, 2004.
Ronnie A. Sebro Haplotype reconstruction BMI /21/2004.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 14 Goodness-of-Fit Tests and Categorical Data Analysis.
Thoughts about the TDT. Contribution of TDT: Finding Genes for 3 Complex Diseases PPAR-gamma in Type 2 diabetes Altshuler et al. Nat Genet 26:76-80, 2000.
Shaun Purcell & Pak Sham Advanced Workshop Boulder, CO, 2003
Hypothesis Testing:.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Standardization of Pedigree Collection. Genetics of Alzheimer’s Disease Alzheimer’s Disease Gene 1 Gene 2 Environmental Factor 1 Environmental Factor.
Introduction to BST775: Statistical Methods for Genetic Analysis I Course master: Degui Zhi, Ph.D. Assistant professor Section on Statistical Genetics.
A gene is composed of strings of bases (A,G, C, T) held together by a sugar phosphate backbone. Reminder - nucleotides are the building blocks.
Family-Based Association Tests
Chi-Square as a Statistical Test Chi-square test: an inferential statistics technique designed to test for significant relationships between two variables.
Random Regressors and Moment Based Estimation Prepared by Vera Tabakova, East Carolina University.
Maximum Likelihood Estimator of Proportion Let {s 1,s 2,…,s n } be a set of independent outcomes from a Bernoulli experiment with unknown probability.
Non-Mendelian Genetics
Chapter 3 – Basic Principles of Heredity. Johann Gregor Mendel (1822 – 1884) Pisum sativum Rapid growth; lots of offspring Self fertilize with a single.
Introduction to Linkage Analysis Pak Sham Twin Workshop 2003.
Lecture 19: Association Studies II Date: 10/29/02  Finish case-control  TDT  Relative Risk.
Type 1 Error and Power Calculation for Association Analysis Pak Sham & Shaun Purcell Advanced Workshop Boulder, CO, 2005.
Joint Linkage and Linkage Disequilibrium Mapping Key Reference Li, Q., and R. L. Wu, 2009 A multilocus model for constructing a linkage disequilibrium.
INTRODUCTION TO ASSOCIATION MAPPING
Regression-Based Linkage Analysis of General Pedigrees Pak Sham, Shaun Purcell, Stacey Cherny, Gonçalo Abecasis.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Tutorial #10 by Ma’ayan Fishelson. Classical Method of Linkage Analysis The classical method was parametric linkage analysis  the Lod-score method. This.
Lecture 15: Linkage Analysis VII
1 B-b B-B B-b b-b Lecture 2 - Segregation Analysis 1/15/04 Biomath 207B / Biostat 237 / HG 207B.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
9 Genes, chromosomes and patterns of inheritance.
Association analysis Genetics for Computer Scientists Biomedicum & Department of Computer Science, Helsinki Päivi Onkamo.
A Transmission/disequilibrium Test for Ordinal Traits in Nuclear Families and a Unified Approach for Association Studies Heping Zhang, Xueqin Wang and.
Errors in Genetic Data Gonçalo Abecasis. Errors in Genetic Data Pedigree Errors Genotyping Errors Phenotyping Errors.
Association mapping for mendelian, and complex disorders January 16Bafna, BfB.
Practical With Merlin Gonçalo Abecasis. MERLIN Website Reference FAQ Source.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
Lecture 11: Linkage Analysis IV Date: 10/01/02  linkage grouping  locus ordering  confidence in locus ordering.
Chapter 11: Categorical Data n Chi-square goodness of fit test allows us to examine a single distribution of a categorical variable in a population. n.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
AP Biology Heredity PowerPoint presentation text copied directly from NJCTL with corrections made as needed. Graphics may have been substituted with a.
Association Mapping in Families Gonçalo Abecasis University of Oxford.
Lecture 17: Model-Free Linkage Analysis Date: 10/17/02  IBD and IBS  IBD and linkage  Fully Informative Sib Pair Analysis  Sib Pair Analysis with Missing.
Power in QTL linkage analysis
Part 2: Genetics, monohybrid vs. Dihybrid crosses, Chi Square
Likelihood Ratio, Wald, and Lagrange Multiplier (Score) Tests
Recombination (Crossing Over)
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
Linkage Analysis Problems
Presentation transcript:

Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237

Linkage has its limits To determine that a trait is closer to marker 1 than marker 2, we need to see recombination between marker 2 and the trait locus. As distance between the markers decreases the number of informative meioses needed to see recombination increases. At some point Linkage analysis because impractical because too many families are needed.

Association Studies Association is a statistical term that describes the co-occurrence of alleles or phenotypes. An Allele A is associated with disease D, if people with D have a different frequency of A than people without D.

best: allele increases disease susceptibility –candidate gene studies good: some subjects share common ancestor –linkage disequilibrium studies Possible causes for allelic association D A1A1 D M K AllelesLoci Under linkage equilibrium P(A,D)=P(A)*P(D) Violation of the equality is termed linkage disequilibrium

Linkage Disequilibrium dD Aa d d aA d d aA D d Aa d d AA d d A A Suppose one of the population founders carries an allelic variant that increases risk of a disease. The disease gene is very close to a marker so  is very small Over many generations (n), there is occasionally recombination between the two genes. So that the population looks like: Note that D is associated with a. P(a|D) is close to one. D d Aa d d AA d d aA D d A a d d aA d d AA D d A A d d A A The degree of association between D and a has decreased P(a|D) but still P(a|D) > P(a). P(a D)>p(a)P(D) Ancestral haplotypes are dA, da, and Da

The Degree of Association Between Two Genes Depends on the Distance Between them and the Age of the Population 1. Let  aD  = P(aD)-P(a)P(D) and similarly for other alleles.  aD (n) =  aD (0)(1-  ) n 2. At linkage equilibrium P(a/a|D/d)=P(a/a|d/d)=P(a/a|D/D)=P(a/a) P(A/a|D/d)=P(A/a|d/d)=P(A/a|D/D)=P(A/a) P(A/A|D/d)=P(A/A|d/d)=P(A/A|D/D)=P(A/A) Violation of these equalities is evidence of linkage disequilibrium.

Allelic association studies test whether alleles are associated with the trait 2 types of association tests –population-based association test cases and controls are unrelated cross-classify by genotype use  2 test or logistic regression –family-based association tests cases and controls are related: parents, sibs etc often based on allele transmission rates prime example TDT

How concerned should we be about population stratification invalidating case/control results? 1. The allele frequencies and disease prevalence rarely differ as dramatically by race as in the example. 2. Good epidemiological methods can reduce the problem. Collect information on racial/ethnic background 3. Sometimes there is no alternative to a case/control design. Family controls may not be available. On the other hand, 1. Better safe than sorry - Family based control designs 2. Family based designs require more genotyping but not more phenotyping than case/control

A/a A/A then the child’s genotype is equally likely to be A/a or A/A The Transmission Disequilibrium Test eliminates concern over false positives due to population stratification A simple illustration of the TDT : Collect parent-child trios If the child is chosen without regard to disease status Spielman et al., 1993 Terwilliger and Ott, 1992

dD Aa d d AA D d Aa However, if the child is chosen because they are affected and the marker allele a is associated with the disease allele D then the child is more likely to have the A/a genotype at the marker than the A/A genotype.

What are we testing with the TDT? For a single affected child per family, the null and alternative hypotheses are equivalent to: When more than one affected child per family is used, the TDT confounds linkage and association. Thus little is gained by running the TDT on a data set consisting of several very large pedigrees if linkage of the trait and marker has already been established. With many small unrelated pedigrees information on association can still be gained. A strongly positive result suggests that the marker tested is a trait susceptibility locus or that the marker is closely linked to a trait susceptibility locus.

Allele transmitted 12k-1k not-transmitted C 1,2... C 1,k-1 C 1,k n 1 2 C 2, C 2,k-1 C 2,k n 2... k-1 C k-1,1 C k-1, C k-1,k n k-1 k C k,1 C k,2... C k,k n k t 1 t 2... t k-1 t k The TDT has been extended to multiple alleles per locus H o = transmission to affected child is not dependent on allele type H a = transmission to affected child depends on allele type t i represents the column sum omitting the diagonal term, n i the row sum also omitting the diagonal. Test statistics include Mendel’s TDT 1 is proportional to this statistic.

Allele transmitted n not-trans t Under some conditions, T mh is asymptotically distributed as chi-square with degree of freedom k-1 Numerical example: data from a locus with 5 alleles. 120 transmissions from heterozygous parents to affected children. T mh = ? TDT 2 = ? Is there evidence of transmission distortion?

MENDEL determines significance using permutation procedures Why? If the sample size is small or alleles are rare, the TDT statistic distribution is poorly approximately by a chi-square distribution. How? (1) For each iteration (usually 10,000 or more) (a) Calculate a new TDT table. Hold the parental genotypes fixed. For each child, designate with equal probability that the child gets one of the parental alleles. (b) Calculate the TDT statistic and determine if larger than the observed TDT statistic. (2) The p-value is equal to the number of iterations in which the TDT statistic is larger than the observed divided by the total number of iterations. What is the reason for the standard error? Permutation p-values are estimated using Monte Carlo simulation with a finite number of iterations.

TDT Summary ignores transmissions from homozygous parents with two alleles it has an approximate chi- square(1) distribution (McNemar test) –but exact p-values can be computed from the Binomial(p=.5) distribution in the bi-allelic case If there is one affected per nuclear family this tests the null: no linkage or no association –If test is significant, there is linkage and association If there are multiple affecteds, the TDT will confound linkage and association owing to the dependencies of the trios. –users should not expect new insight when the data consists of one or two large disease pedigrees already showing linkage –with many small unrelated pedigrees, the chance of confusing linkage with association becomes less of an issue, and the TDT can help in identifying associated marker alleles.

Limitations of the original TDT (1) Nuclear Families (2) Qualitative traits (3) Codominant markers Many methods for extending the TDT have been developed. We will discuss one in detail, the gamete competition model.

One way to extend the TDT: Lange (1988), Jin et al. (1994), and Sham and Curtis (1995) considered a model (Bradley Terry, 1952) that was originally used to predict to rank teams the outcome of team sports. How does the model work? Look at specific example: Suppose we are interested in predicting the outcome of a playoff game where the Diamond Back play the Dodgers. Or suppose we want to know the probability that Dodgers will be the National League West winners this year if we consider regular season results for last year?

Suppose results are: winner D’backs Dodgers Giants Rockies Padres Loser D’Backs Dodgers Giants Rockies Padres Let D’backs/Dodgers Dodgers denote the event that the D’backs and Dodgers play and the Dodgers win. In general for each team i, we assign a win parameter  i so that the probability that i beats j is:

Bradley - Terry Model of Competing Sports Teams Note that multiplying each  i by any  >0 does not change its value, so one  i can be fixed at 1. We fix  d’backs = 1. Note that if  i >  j for all j then i is the best team Let y ij denote the number of times that i plays j and i wins. For example, the D’backs beat the Giants 8 times and the Giants beat the D’backs 4 times (y ij = 8 and y ji = 4). The win parameters can be determined using the following recurrence relationship where the loglikelihood is

winner D’backs Dodgers Giants Rockies Padres Loser D’Backs Dodgers Giants Rockies Padres Ho = all teams are equally likely to win (  i = 1 for all i) LRT = 3.63, the p-value of 0.46 supports acceptance of the null hypothesis. RESULTS

We get the relative rankings.  dodgers = 1.23,  d’backs =1.00,  giants = 0.87,  rockies = 0.71,  padres = 0.67 With these rankings we can make predictions about the outcomes of games: We get more information from this analysis Note that these probabilities are different from the predictions if we just used the individual match up records. The estimate is not 8/12 =.67 for dodgers beating giants

How is this sports analysis analogous to TDT? Think of : (1) Each possible allele at locus = a team (2) A heterozygous parent = a match up (3) Allele received by child from a heterozygous parent = the winner of the game (4) The transmission parameters = the win parameters (5) The win/lost record is determined by the transmissions from heterozygous parents.

transmitted not trans When we ignore disease status, the Bradley- Terry model provides a form of segregation analysis. When we consider the transmission to affected members only (like this example) we have a form of TDT analysis.

Assessing significance We use a likelihood ratio test statistic LRT = 2*( ln(L Ha )-ln(L Ho ) ) Where L Ha and L Ho are the maximum likelihoods under the alternative and null hypotheses. Significance? Approximate p-values can be calculated by assuming a the distribution is chi-square or by gene dropping.

(1) Gamete Competition works on extended pedigrees No need to break up large families into nuclear families. (2) If have only trios, the gamete competition and the TDT are equivalent. Their null hypothesis is no linkage or no association. The alternative hypothesis is linkage and association. (3) When considering more than one affected per family, the TDT and gamete competition confound association with linkage. (4) Exact p-values can be determined with the TDT. Gamete competition p-values are asymptotic. (5) The gamete competition model can be used when there is missing marker information. Allele frequencies can be fixed at population estimates or estimated along with the  ’s. (6) When there is missing data, the gamete competition is not immune to the effects of population stratification or rare alleles. Gamete Competition contrasted with the TDT

Example:Families affected with Noninsulin Dependent Diabetes and linkage to a marker within the sulfonyl urea receptor-1 gene 27 Mexican-American extended pedigrees with 74 affected offspring (all genotyped) at SUR The likelihood ratio test statistic is with 9 degrees of freedom. P-value =0.043 allele freq  i se of  i.215fixed

where x p denotes child p’s standardized trait value, i denotes allele i and the probability of an i/j heterozygous parent transmitting i is Note that one  is set to zero. This is equivalent to conditional logistic regression.

Quantitative Trait Example: ACE High ACE concentration is associated with a deletion within an intron of the ace gene. 404 people in 69 families (Sinsheimer et al., 2000).  insertion  deletion mle s.e. of mlefixed0.17 Ho:  deletion = 0 Ha:  deletion  0 LRT = Asymptotic p-value < 1 x

Another Example: Analyzing tightly linked SNPs: SNPs (single nucleotide polymorphisms) tend to be more stable and more abundant than microsatellite markers. They are predominately biallelic, so we would like to use several tightly linked markers simultaneously to increase the overall information content. Recall that we use the allele transmissions from heterozygous parents. Assuming HWE, the maximum possible % of heterozygous parents for biallelic system is For an n allele system, it is H=(n-1)/n. More alleles more information.

The phase of these multilocus SNPs may not be known: Example: suppose there are three SNPs. An individual with multilocus genotype 1/2, 1/2, 1/2 could have one of the following haplotypes: (1) 111 and 222, (2) 122 and 211 (3) 121 and 212 or (4) 112 and 221. The gamete competition allows the use of non- codominant markers so we don’t need to determine which of these haplotypes combinations is present in a particular individual.

If we are using tightly linked SNPs, then  is effectively zero and the transmission probability reduces to: For two linked loci associated with a quantitative trait, the transmission probability is expressed as:

An Example Again we use sex adjusted ACE levels as a quantitative trait. The three SNPs are labeled by their position and the nucleotides present at the position. A-240T, T1237C, and G2350A. Because the ACE gene spans only 26kb, the recombination fractions between these SNPs are effectively zero. The pedigree data consist of 83 white British families ranging in size from 4 to 18 members. ACE levels were determined on 405 family members. Genotypes were collected on 555 family members.

In MENDEL, the most important difference from the previous example will be observed in the locus file. We need to allow for phase ambiguities (lack of certainty in haplotypes). L469 AUTOSOME 627 <-# haplotypes,# phenotypes ATA ATG ACA ACG T*A ! T*A corresponds to haplotypes TTA and TCA T*G ! T*G corresponds to haplotypes TTG and TCG We are no longer assuming co-dominant markers so we must specify the phenotype (of the marker) / genotype relationship. These phenotypes correspond to the marker phenotypes used in the pedigree file.

RESULTS

Many other extensions / alternatives to the TDT have been developed. These include: TDT using sibling controls Sib-TDT (Spielman and Ewens, 1998) DAT (Boehnke and Langefeld, 1998) SDT (Horvath and Laird) TDT for quantitative traits Allison (1997), Rabinowitz (1997), Abecasis (2000) Joint modeling of linkage and association that allow estimation of recombination Hastabacka (1992) Kaplan, Hill and Weir (1995) Terwilliger (1995)