Linkage Disequilibrium Mapping of Complex Binary Diseases Two types of complex traits Quantitative traits–continuous variation Dichotomous traits–discontinuous variation oBinary, e.g., presence (1) or absence (0) of a disease oMultiple outcomes, e.g., none, moderate or severe disease Special topic for Rebecca and Amy’s project
Consider a nature population One marker with two alleles M and m, Prob(M)=p, Prob(m)=1-p One QTL (affecting a binary trait) with two alleles A and a, Prob(A)=q, Prob(a)=1-q Four haplotypes: Prob(MQ)=p 11 =pq+D p=p 11 +p 10 Prob(Mq)=p 10 =p(1-q)-Dq=p 11 +p 01 Prob(mQ)=p 01 =(1-p)q-DD=p 11 p 00 -p 10 p 01 Prob(mq)=p 00 =(1-p)(1-q)+D D is the linkage disequilibrium between the marker and underlying QTL
Data structure SampleBinary (y i )Marker (j) 11MM (2) 21Mm (1) 31Mm (1) 41mm (0) 50MM (2) 60Mm (1) 70Mm (1) 80mm (0)
Arrange the data in a 2 x 3 contingency table Marker genotype 210 Affected (1)n 12 n 11 n 10 n 1. Normal (0)n 02 n 01 n 00 n 0. n. 2 n. 1 n. 0 n Affected (1)g 12 g 11 g 10 g 1. Normal (0)g 02 g 01 g 00 g 0. g. 2 g. 1 g. 0 1
Independence test 2 df=2 = l=0 1 j=0 2 (n lj - m lj ) 2 /m lj = n l=0 1 j=0 2 (g li - g l.g. j ) 2 /(g l.g. j ) where m lj is the expected value of n lj, m lj =ng l.g.j. H0: g li = g l.g. j H1: g li g l.g. j Under H0, 2 df=2 is central chi 2 -distributed for a large sample size n, with df = (2-1)x(3-1) =2 If H0 is rejected, there is a significant D
Regression analysis Marker ModelQTL model SampleBinary (y ij )Marker(j) #M(T ij )There is 2 A’s 11MM (2)2 2|2 =p Mm (1)1 2|1 =2p 11 p 01 31Mm (1)1 2|1 =2p 11 p 01 41mm (0)0 2|0 =p MM (2)2 2|2 =p Mm (1)1 2|1 =2p 11 p 01 70Mm (1)1 2|1 =2p 11 p 01 80mm (0)0 2|0 =p 01 2 p 11 =pq+D, p 01 =(1-p)q-D
AA (2)Aa (1)aa (0)Obs MMp p 11 p 10 p 10 2 n 2 Mm2p 11 p 01 2(p 11 p 00 +p 10 p 01 )2p 10 p 00 n 1 mmp p 01 p 00 p 00 2 n 0 MMp p 11 p 10 p 10 2 n 2 p 2 p 2 p 2 Mm2p 11 p 01 2(p 11 p 00 +p 10 p 01 )2p 10 p 00 n 1 2p(1-p)2p(1-p)2p(1-p) mmp p 01 p 00 p 00 2 n 0 (1-p) 2 (1-p) 2 (1-p) 2 Joint and conditional ( k|ij ) genotype prob. between marker and QTL
Statistical models Marker Model y ij = a + bT ij + ij The least squares approach can be used to estimate a and b. The size of b reflects the marker effect, confounded by the QTL effect and marker-QTL LD
The phenotype of sample i can be within marker genotype group j is modeled by y ij = 1 If z ij 0If z ij < where is the threshold for the underlying liability of the trait z, which is formulated as z ij = ik k + e ij k = the genotypic value of QTL k ik = the (1/0) indicator variable for sample i e ij = normally distributed residual variable with mean 0 and variance 1
The conditional probability of y ij = 1 given sample i’s QTL genotype (say G ij =k) is obtained by f k = Pr(y ij =1|G ij =k, ) = Pr(z ij |G ij =k, ) = 1 – Pr(z ij < |G ij =k, ) = 1 – 1/(2 ) - exp[-(z- k ) 2 /2]dz f k is called the penetrance of QTL genotype k
F-values as a function of q and D Landscape F q D
Maximum likelihood analysis: Mixture model L( |y)= j=0 2 i=0 nj log [ 2|ij Pr{y ij =1|G ij =2, } yij Pr{y ij =0|G ij =2, } (1-yij) + 1|ij Pr{y ij =1|G ij =1, } yij Pr{y ij =0|G ij =1, } (1-yij) + 0|ij Pr{y ij =1|G ij =0, } yij Pr{y ij =0|G ij =0, } (1-yij) ] = j=0 2 i=0 nj log[ 2|ij f 2 yij (1-f 2 ) (1-yij) + 1|ij f 1 yij (1-f 1 ) (1-yij) + 0|ij f 0 yij (1-f 0 ) (1-yij) ] = (p 11, p 10, p 01, p 00, f 2, f 1, f 0 ) (6 parameters)
EM algorithm Define 2|ij = 2|ij f 2 yij (1-f 2 ) (1-yij) [ 2|ij f 2 yij (1-f 2 ) (1-yij) + 1|ij f 1 yij (1-f 1 ) (1-yij) + 0|ij f 0 yij (1-f 0 ) (1-yij) ] (1) 1|ij = 1|ij f 1 yij (1-f 1 ) (1-yij) [ 2|ij f 2 yij (1-f 2 ) (1-yij) + 1|ij f 1 yij (1-f 1 ) (1-yij) + 0|ij f 0 yij (1-f 0 ) (1-yij) ] (2) 0|ij = 0|ij f 0 yij (1-f 0 ) (1-yij) [ 2|ij f 2 yij (1-f 2 ) (1-yij) + 1|ij f 1 yij (1-f 1 ) (1-yij) + 0|ij f 0 yij (1-f 0 ) (1-yij) ] (3) as the posterior probabilities of QTL genotypes given marker genotypes for sample i
Population genetic parameters Posterior prob AAAaaaObs MM 2|2i 1|2i 0|2i n. 2 Mm 2|1i 1|1i 0|1i n. 1 mm 2|0i 1|0i 0|0i n. 0 p 11 =1/2n{ i=1 n.2 [2 2|2i + 1|2i ]+ i=1 n.1 [ 2|1i + 1|1i ](4) p 10 =1/2n{ i=1 n.2 [2 0|2i + 1|2i ]+ i=1 n.1 [ 0|1i +(1- ) 1|1i ](5) p 01 =1/2n{ i=1 n.0 [2 2|0i + 1|0i ]+ i=1 n.1 [ 2|1i +(1- ) 1|1i ](6) p 00 =1/2n{ i=1 n.2 [2 0|0i + 1|0i ]+ i=1 n.1 [ 0|1i + 1|1i ] (7)
Quantitative genetic parameters j=0 2 i=0 nj ( 2|ij y ij ) f 2 = (8) j=0 2 i=0 nj 2|ij j=0 2 i=0 nj ( 1|ij y ij ) f 1 = (9) j=0 2 i=0 nj 1|ij j=0 2 i=0 nj ( 0|ij y ij ) f 0 = (10) j=0 2 i=0 nj 0|ij
EM algorithm (1) Give initiate values (0) =(p 11,p 10,p 01,p 00,f 2,f 1,f 0 ) (0) (2) Calculate 2|ij (1), 1|ij (1) and 0|ij (1) using Eqs. 1- 3, (3) Calculate (1) using 2|ij (1), 1|ij (1) and 0|ij (1) based on Eqs. 4-10, (4) Repeat (2) and (3) until convergence.
Three genotypic values 2 = + a for AA 1 = + dfor Aa 0 = - afor aa With the MLEs of k, we can estimate , a and d.
How to estimate k ? f 2 = 1 – 1/(2 ) - exp[-(z- 2 ) 2 /2]dz f 1 = 1 – 1/(2 ) - exp[-(z- 1 ) 2 /2]dz f 0 = 1 – 1/(2 ) - exp[-(z- 0 ) 2 /2]dz We can use numerical approaches to estimate 2, 1 and 0
Hypothesis test H0: f 2 = f 1 = f 0 H1: at least one equality does not hold LR = -2[logL( 0 |y,M,D) - logL( 1 |y,M,D)] for interval [max{-p(1-q),-(1-p)q}, min{pq, (1-p)(1-q)}] of D. 0 = MLE under H0 1 = MLE under H1
LR as a function of D Profile D min{p(1-q),(1-p)q}max{pq.(1-p)(1-q)}
Dr Ma will write the program.