Download presentation
Presentation is loading. Please wait.
Published byMelinda Morgan Modified over 9 years ago
1
Gene mapping by association 3/4/04 Biomath/HG 207B/Biostat 237
2
Linkage has its limits To determine that a trait is closer to marker 1 than marker 2, we need to see recombination between marker 2 and the trait locus. As distance between the markers decreases the number of informative meioses needed to see recombination increases. At some point Linkage analysis because impractical because too many families are needed.
3
Association Studies Association is a statistical term that describes the co-occurrence of alleles or phenotypes. An Allele A is associated with disease D, if people with D have a different frequency of A than people without D.
4
best: allele increases disease susceptibility –candidate gene studies good: some subjects share common ancestor –linkage disequilibrium studies Possible causes for allelic association D A1A1 D M K AllelesLoci Under linkage equilibrium P(A,D)=P(A)*P(D) Violation of the equality is termed linkage disequilibrium
5
Linkage Disequilibrium dD Aa d d aA d d aA D d Aa d d AA d d A A Suppose one of the population founders carries an allelic variant that increases risk of a disease. The disease gene is very close to a marker so is very small....... Over many generations (n), there is occasionally recombination between the two genes. So that the population looks like: Note that D is associated with a. P(a|D) is close to one. D d Aa d d AA d d aA D d A a d d aA d d AA D d A A d d A A The degree of association between D and a has decreased P(a|D) but still P(a|D) > P(a). P(a D)>p(a)P(D) Ancestral haplotypes are dA, da, and Da
6
The Degree of Association Between Two Genes Depends on the Distance Between them and the Age of the Population 1. Let aD = P(aD)-P(a)P(D) and similarly for other alleles. aD (n) = aD (0)(1- ) n 2. At linkage equilibrium P(a/a|D/d)=P(a/a|d/d)=P(a/a|D/D)=P(a/a) P(A/a|D/d)=P(A/a|d/d)=P(A/a|D/D)=P(A/a) P(A/A|D/d)=P(A/A|d/d)=P(A/A|D/D)=P(A/A) Violation of these equalities is evidence of linkage disequilibrium.
7
Allelic association studies test whether alleles are associated with the trait 2 types of association tests –population-based association test cases and controls are unrelated cross-classify by genotype use 2 test or logistic regression –family-based association tests cases and controls are related: parents, sibs etc often based on allele transmission rates prime example TDT
10
How concerned should we be about population stratification invalidating case/control results? 1. The allele frequencies and disease prevalence rarely differ as dramatically by race as in the example. 2. Good epidemiological methods can reduce the problem. Collect information on racial/ethnic background 3. Sometimes there is no alternative to a case/control design. Family controls may not be available. On the other hand, 1. Better safe than sorry - Family based control designs 2. Family based designs require more genotyping but not more phenotyping than case/control
11
A/a A/A then the child’s genotype is equally likely to be A/a or A/A The Transmission Disequilibrium Test eliminates concern over false positives due to population stratification A simple illustration of the TDT : Collect parent-child trios If the child is chosen without regard to disease status Spielman et al., 1993 Terwilliger and Ott, 1992
12
dD Aa d d AA D d Aa However, if the child is chosen because they are affected and the marker allele a is associated with the disease allele D then the child is more likely to have the A/a genotype at the marker than the A/A genotype.
14
What are we testing with the TDT? For a single affected child per family, the null and alternative hypotheses are equivalent to: When more than one affected child per family is used, the TDT confounds linkage and association. Thus little is gained by running the TDT on a data set consisting of several very large pedigrees if linkage of the trait and marker has already been established. With many small unrelated pedigrees information on association can still be gained. A strongly positive result suggests that the marker tested is a trait susceptibility locus or that the marker is closely linked to a trait susceptibility locus.
15
Allele transmitted 12k-1k not-transmitted 1-----C 1,2... C 1,k-1 C 1,k n 1 2 C 2,1 -----... C 2,k-1 C 2,k n 2... k-1 C k-1,1 C k-1,2... -----C k-1,k n k-1 k C k,1 C k,2... C k,k-1 -------n k t 1 t 2... t k-1 t k The TDT has been extended to multiple alleles per locus H o = transmission to affected child is not dependent on allele type H a = transmission to affected child depends on allele type t i represents the column sum omitting the diagonal term, n i the row sum also omitting the diagonal. Test statistics include Mendel’s TDT 1 is proportional to this statistic.
16
Allele transmitted 12 3 45n not-trans. 1644519 26---57422 387--- 7 527 485 5 ---624 5787 6---28 t2926212420120 Under some conditions, T mh is asymptotically distributed as chi-square with degree of freedom k-1 Numerical example: data from a locus with 5 alleles. 120 transmissions from heterozygous parents to affected children. T mh = ? TDT 2 = ? Is there evidence of transmission distortion?
17
MENDEL determines significance using permutation procedures Why? If the sample size is small or alleles are rare, the TDT statistic distribution is poorly approximately by a chi-square distribution. How? (1) For each iteration (usually 10,000 or more) (a) Calculate a new TDT table. Hold the parental genotypes fixed. For each child, designate with equal probability that the child gets one of the parental alleles. (b) Calculate the TDT statistic and determine if larger than the observed TDT statistic. (2) The p-value is equal to the number of iterations in which the TDT statistic is larger than the observed divided by the total number of iterations. What is the reason for the standard error? Permutation p-values are estimated using Monte Carlo simulation with a finite number of iterations.
18
TDT Summary ignores transmissions from homozygous parents with two alleles it has an approximate chi- square(1) distribution (McNemar test) –but exact p-values can be computed from the Binomial(p=.5) distribution in the bi-allelic case If there is one affected per nuclear family this tests the null: no linkage or no association –If test is significant, there is linkage and association If there are multiple affecteds, the TDT will confound linkage and association owing to the dependencies of the trios. –users should not expect new insight when the data consists of one or two large disease pedigrees already showing linkage –with many small unrelated pedigrees, the chance of confusing linkage with association becomes less of an issue, and the TDT can help in identifying associated marker alleles.
19
Limitations of the original TDT (1) Nuclear Families (2) Qualitative traits (3) Codominant markers Many methods for extending the TDT have been developed. We will discuss one in detail, the gamete competition model.
20
One way to extend the TDT: Lange (1988), Jin et al. (1994), and Sham and Curtis (1995) considered a model (Bradley Terry, 1952) that was originally used to predict to rank teams the outcome of team sports. How does the model work? Look at specific example: Suppose we are interested in predicting the outcome of a playoff game where the Diamond Back play the Dodgers. Or suppose we want to know the probability that Dodgers will be the National League West winners this year if we consider regular season results for last year?
21
Suppose results are: winner D’backs Dodgers Giants Rockies Padres Loser D’Backs --- 6 4 4 5 Dodgers 6 --- 7 5 4 Giants 8 5 --- 5 6 Rockies 8 7 7 --- 5 Padres 7 8 6 7 --- Let D’backs/Dodgers Dodgers denote the event that the D’backs and Dodgers play and the Dodgers win. In general for each team i, we assign a win parameter i so that the probability that i beats j is:
22
Bradley - Terry Model of Competing Sports Teams Note that multiplying each i by any >0 does not change its value, so one i can be fixed at 1. We fix d’backs = 1. Note that if i > j for all j then i is the best team Let y ij denote the number of times that i plays j and i wins. For example, the D’backs beat the Giants 8 times and the Giants beat the D’backs 4 times (y ij = 8 and y ji = 4). The win parameters can be determined using the following recurrence relationship where the loglikelihood is
23
winner D’backs Dodgers Giants Rockies Padres Loser D’Backs --- 6 4 4 5 Dodgers 6 --- 7 5 4 Giants 8 5 --- 5 6 Rockies 8 7 7 --- 5 Padres 7 8 6 7 --- Ho = all teams are equally likely to win ( i = 1 for all i) LRT = 3.63, the p-value of 0.46 supports acceptance of the null hypothesis. RESULTS
24
We get the relative rankings. dodgers = 1.23, d’backs =1.00, giants = 0.87, rockies = 0.71, padres = 0.67 With these rankings we can make predictions about the outcomes of games: We get more information from this analysis Note that these probabilities are different from the predictions if we just used the individual match up records. The estimate is not 8/12 =.67 for dodgers beating giants
25
How is this sports analysis analogous to TDT? Think of : (1) Each possible allele at locus = a team (2) A heterozygous parent = a match up (3) Allele received by child from a heterozygous parent = the winner of the game (4) The transmission parameters = the win parameters (5) The win/lost record is determined by the transmissions from heterozygous parents.
26
transmitted 1 2 3 4 5 not trans. 1 --- 6 4 4 5 2 6 --- 7 5 4 3 8 5 --- 5 6 4 8 7 7 --- 5 5 7 8 6 7 --- When we ignore disease status, the Bradley- Terry model provides a form of segregation analysis. When we consider the transmission to affected members only (like this example) we have a form of TDT analysis.
28
Assessing significance We use a likelihood ratio test statistic LRT = 2*( ln(L Ha )-ln(L Ho ) ) Where L Ha and L Ho are the maximum likelihoods under the alternative and null hypotheses. Significance? Approximate p-values can be calculated by assuming a the distribution is chi-square or by gene dropping.
29
(1) Gamete Competition works on extended pedigrees No need to break up large families into nuclear families. (2) If have only trios, the gamete competition and the TDT are equivalent. Their null hypothesis is no linkage or no association. The alternative hypothesis is linkage and association. (3) When considering more than one affected per family, the TDT and gamete competition confound association with linkage. (4) Exact p-values can be determined with the TDT. Gamete competition p-values are asymptotic. (5) The gamete competition model can be used when there is missing marker information. Allele frequencies can be fixed at population estimates or estimated along with the ’s. (6) When there is missing data, the gamete competition is not immune to the effects of population stratification or rare alleles. Gamete Competition contrasted with the TDT
30
Example:Families affected with Noninsulin Dependent Diabetes and linkage to a marker within the sulfonyl urea receptor-1 gene 27 Mexican-American extended pedigrees with 74 affected offspring (all genotyped) at SUR The likelihood ratio test statistic is 9.133 with 9 degrees of freedom. P-value =0.043 allele 12345678910 freq.054.210.190.048.047.108.140.091.071.042 i.2881.00.8101.40.697.383.556.567.499.082 se of i.215fixed.447.985.681.204.288.322.509.104
31
where x p denotes child p’s standardized trait value, i denotes allele i and the probability of an i/j heterozygous parent transmitting i is Note that one is set to zero. This is equivalent to conditional logistic regression.
32
Quantitative Trait Example: ACE High ACE concentration is associated with a deletion within an intron of the ace gene. 404 people in 69 families (Sinsheimer et al., 2000). insertion deletion mle0.001.31 s.e. of mlefixed0.17 Ho: deletion = 0 Ha: deletion 0 LRT = 82.76 Asymptotic p-value < 1 x 10 -19
33
Another Example: Analyzing tightly linked SNPs: SNPs (single nucleotide polymorphisms) tend to be more stable and more abundant than microsatellite markers. They are predominately biallelic, so we would like to use several tightly linked markers simultaneously to increase the overall information content. Recall that we use the allele transmissions from heterozygous parents. Assuming HWE, the maximum possible % of heterozygous parents for biallelic system is 0.50. For an n allele system, it is H=(n-1)/n. More alleles more information.
34
The phase of these multilocus SNPs may not be known: Example: suppose there are three SNPs. An individual with multilocus genotype 1/2, 1/2, 1/2 could have one of the following haplotypes: (1) 111 and 222, (2) 122 and 211 (3) 121 and 212 or (4) 112 and 221. The gamete competition allows the use of non- codominant markers so we don’t need to determine which of these haplotypes combinations is present in a particular individual.
35
If we are using tightly linked SNPs, then is effectively zero and the transmission probability reduces to: For two linked loci associated with a quantitative trait, the transmission probability is expressed as:
36
An Example Again we use sex adjusted ACE levels as a quantitative trait. The three SNPs are labeled by their position and the nucleotides present at the position. A-240T, T1237C, and G2350A. Because the ACE gene spans only 26kb, the recombination fractions between these SNPs are effectively zero. The pedigree data consist of 83 white British families ranging in size from 4 to 18 members. ACE levels were determined on 405 family members. Genotypes were collected on 555 family members.
37
In MENDEL, the most important difference from the previous example will be observed in the locus file. We need to allow for phase ambiguities (lack of certainty in haplotypes). L469 AUTOSOME 627 <-# haplotypes,# phenotypes ATA 0.40190 ATG 0.00780 ACA 0.06740 ACG 0.18310 T*A 0.01340 ! T*A corresponds to haplotypes TTA and TCA T*G 0.32640 ! T*G corresponds to haplotypes TTG and TCG We are no longer assuming co-dominant markers so we must specify the phenotype (of the marker) / genotype relationship. These phenotypes correspond to the marker phenotypes used in the pedigree file.
38
RESULTS
39
Many other extensions / alternatives to the TDT have been developed. These include: TDT using sibling controls Sib-TDT (Spielman and Ewens, 1998) DAT (Boehnke and Langefeld, 1998) SDT (Horvath and Laird) TDT for quantitative traits Allison (1997), Rabinowitz (1997), Abecasis (2000) Joint modeling of linkage and association that allow estimation of recombination Hastabacka (1992) Kaplan, Hill and Weir (1995) Terwilliger (1995)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.