Computational Issues on Statistical Genetics Develop Methods Data Collection Analyze Data Write Reports/Papers Research Questions Review the Literature Test the power and robustness by computer simulation Database construction (Excel, Access) Translate data to analyzable form Preliminary results (figures, tables) Program languages Efficient, feasible Graphics Excel graphics Programmable graphics
Program Languages Fortran, C, C++ Matrix language: MATLAB, S-Plus, R, SAS IML Symbolic Calculation: Mathematika,Maple,Matlab Interface Programming: dotnet, C#, Visual Basic SAS, SPSS, BMDP Database: Access, Excel, SQL, SAS, Oracle MACRO –Excel, Access, PowerPoint, Word –Editor: WinEdt –SAS Macro
Two Point Analysis in F2 Fully Informative Markers (codominant) BBBbbb AAObsn 22 n 21 n 20 Freq ¼ (1-r) 2 ½ r(1-r) ¼ r 2 Recom.012 AaObs n 12 n 11 n 10 Freq ½ r(1-r) ½ (1-r) 2 + ½ r 2 ½ r(1-r) Recom.12r 2 /[(1-r) 2 +r 2 ]1 aaObs n 02 n 01 n 00 Freq ¼ r 2 ½ r(1-r) ¼ (1-r) 2 Recom.210
EM algorithm to estimate the recombination fraction r: 1.Given r(0), For t=0,1, 2,… 2.Do While abs[r(t+1)-r(t)]>1.e-8 E-step: Calculate (t) = r(t) 2 /[(1-r(t)) 2 +r(t) 2 ] (expected the number of recombination events for the double heterozygote AaBb) M-step: r(t+1)= 1/(2n)[2(n 20 +n 02 )+(n 21 +n 12 +n 10 +n 01 )+2 (t)n 11 ]
Two Point Analysis in F2 Fully Informative Markers (codominant) AA Aa aa BBBbbb n Input:Result: r0 (t) = r(t) 2 /[(1-r(t)) 2 +r(t) 2 ] r(t+1)= 1/(2n)[2(n20+n02)+(n21+n12+n10+n01)+2 (t)n11]
Two Point Analysis in F2 Fully Informative Markers (codominant) function r=rEstF2(n22,n21,n20,n12,n11,n10,n02,n01,n00) n=n22+n21+n20+n12+n11+n10+n02+n01+n00; r=0.2; r1=-1; while (abs(r1-r)>1.e-8) r1=r; %E-step phi=r^2/((1-r)^2+r^2); %M step r=1/(2*n)*(2*(n20+n02)+(n21+n12+n10+n01)+2*phi*n11); end Matlab program to estimate recombinant r
Log-likelihood ratio test statistic Two alternative hypotheses H0: r = 0.5 vs. H1: r 0.5 Likelihood value under H1 L 1 (r|n ij ) = n!/(n 22 !...n 00 !) [ ¼ (1-r) 2 ] n22+n00 [ ¼ r 2 ] n20+n02 [ ½ r(1-r)] n21+n12+n10+n01 [ ½ (1-r) 2 + ½ r 2 ] n11 Likelihood value under H0 L 0 (r=0.5|n ij ) = n!/(n 22 !...n 00 !) [ ¼ (1-0.5) 2 ] n22+n00 [ ¼ ] n20+n02 [ ½ 0.5(1-0.5)] n21+n12+n10+n01 [ ½ (1- 0.5) 2 + ½ ] n11 LOD = log 10 [L 1 (r|n ij )/L 0 (r=0.5|n ij )] = {(n 22 +n 00 )2[log 10 (1-r)-log 10 (1-0.5)+ … } = 6.08 > critical LOD=3
Two Point Analysis in F2 Fully Informative Markers (codominant) function LOD=calcLOD_F2(r,n22,n21,n20,n12,n11,n10,n02,n01,n00) %log likelihood under H1 LOD=(n22+n00)*log10((1-r)^2/4)... +(n20+n02)*log10(r^2/4)... +(n21+n12+n10+n01)*log10(r*(1-r)/2)... +n11*log10((1-r)^2/2+r^2/2); %log likelihood under H0 r=0.5; LOD0=(n22+n00)*log10((1-r)^2/4)... +(n20+n02)*log10(r^2/4)... +(n21+n12+n10+n01)*log10(r*(1-r)/2)... +n11*log10((1-r)^2/2+r^2/2); LOD=LOD-LOD0; Matlab program to calculate log likelihood test score (LOD)
Two Point Analysis in F2 Partial Informative Markers (codominant X dominant) BBBbbb AAObsn 22 n 21 n 20 Freq ¼ (1-r) 2 ½ r(1-r) ¼ r 2 Recom.012 AaObs n 12 n 11 n 10 Freq ½ r(1-r) ½ (1-r) 2 + ½ r 2 ½ r(1-r) Recom.12r 2 /[(1-r) 2 +r 2 ]1 aaObs n 02 n 01 n 00 Freq ¼ r 2 ½ r(1-r) ¼ (1-r) 2 Recom.210
Two Point Analysis in F2 Partial Informative Markers (codominant X dominant) B_bb AAObs n 2_ =n 22 +n 21 n 20 Freq ¼ (1-r) 2 + ½ r(1-r) ¼ r 2 Recom.C 1 = ½ r(1-r)/[ ¼ (1-r) 2 + ½ r(1-r)]2 AaObs n 1_ =n 12 +n 11 n 10 Freq ½ r(1-r)+ ½ (1-r) 2 + ½ r 2 ½ r(1-r) Recom.C 2 =[ ½ r(1-r) +r 2 ]/ [ ½ r(1-r)+ ½ (1-r) 2 + ½ r 2 ] 1 aaObs n 0_ =n 02 +n 01 n 00 Freq ¼ r 2 + ½ r(1-r) ¼ (1-r) 2 Recom.C 3 =[2* ¼ r 2 + ½ r(1-r)]/[ ¼ r 2 + ½ r(1-r)]0 Estimate of r=(c1* n 2_ +c2* n 1_ +c3* n 0_ +2* n 20 + n 00 )/(2n)
Two Point Analysis in F2 Partial Informative Markers (codominant X dominant) E-Step C 1 = ½ r(1-r)/[ ¼ (1-r) 2 + ½ r(1-r)] C 2 =[ ½ r(1-r) +r 2 ]/ [ ½ r(1-r)+ ½ (1-r) 2 + ½ r 2 ] C 3 =[2* ¼ r 2 + ½ r(1-r)]/[ ¼ r 2 + ½ r(1-r)] M-Step r=(c1* n 2_ +c2* n 1_ +c3* n 0_ +2* n 20 + n 00 )/(2n)
Two Point Analysis in F2 Partial Informative Markers (codominant X dominant) AA Aa aa B_bb n Input:Result: r0
Two Point Analysis in F2 Partial Informative Markers (co dominant X dominant) function r=rEstF2CoXdomin(n2_,n1_,n0_,n20,n10,n00) n=n2_+n1_+n0_+n20+n10+n00; r=0.2;r1=-1; while(abs(r1-r)>1.e-8) r1=r; %E-step c1= 1/2*r*(1-r)/[1/4*(1-r)^2+ 1/2*r*(1-r)]; c2=[1/2*r*(1-r)+r^2]/[1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2]; c3=[2*1/4*r^2+1/2*r*(1-r)]/[1/4*r^2+1/2*r*(1-r)]; %M-step r=(c1*n2_+c2* n1_ +c3* n0_+2* n20 + n00)/(2*n); end Matlab program to estimate recombinant r
Two Point Analysis in F2 Partial Informative Markers (co dominant X dominant) Matlab program to calculate log likelihood test score (LOD) function LOD=calcLOD_F2CoXdomin(r, n2_,n1_,n0_,n20,n10,n00) %log likelihood under H1 LOD=log([1/4*(1-r)^2+ 1/2*r*(1-r)])*n2_... +log([1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2])*n1_... +log([1/4*r^2+1/2*r*(1-r)])*n0_... +log(r^2/4)*n20+log(r*(1-r)/2)*n10+log((1-r)^2/4)*n00; %log likelihood under H0 r=0.5; LOD0=log([1/4*(1-r)^2+ 1/2*r*(1-r)])*n2_... +log([1/2*r*(1-r)+1/2*(1-r)^2+1/2*r^2])*n1_... +log([1/4*r^2+1/2*r*(1-r)])*n0_... +log(r^2/4)*n20+log(r*(1-r)/2)*n10+log((1-r)^2/4)*n00; LOD=LOD-LOD0; LOD=LOD/log(10);
Two Point Analysis in F2 Partial Informative Markers (dominant) BBBbbb AAObsn 22 n 21 n 20 Freq ¼ (1-r) 2 ½ r(1-r) ¼ r 2 Recom.012 AaObs n 12 n 11 n 10 Freq ½ r(1-r) ½ (1-r) 2 + ½ r 2 ½ r(1-r) Recom.12r 2 /[(1-r) 2 +r 2 ]1 aaObs n 02 n 01 n 00 Freq ¼ r 2 ½ r(1-r) ¼ (1-r) 2 Recom.210
Two Point Analysis in F2 Partial Informative Markers (dominant) B_bb A_Obs n 1 =n 22 +n 21 +n 12 + n 11 n 2 =n 20 +n 10 Freq ¼ (1-r) 2 +r(1-r) + ½ (1-r) 2 + ½ r 2 ¼ r 2 Recom.c1c2 aaObs n 3 =n 02 +n 01 n 4 = n 00 Freq ¼ r 2 + ½ r(1-r) ¼ (1-r) 2 Recom.C2= (2( ¼ r 2 )+ ½ r(1-r)) 0 /( ¼ r 2 + ½ r(1-r)) where C1=[r 2 +r(1-r)]/[ ¼(1-r) 2 +r(1-r) + ½(1-r) 2 +½r 2 ], expected number of recombinant gametes Estimate of r=(c1* n 1 +c2* n 2 +c2* n 3 )/(2n)
Two Point Analysis in F2 Fully Informative Markers (codominant) A_ aa B_bb n Input:Result: r0 C1=[r 2 +r(1-r)]/[ ¼(1-r) 2 +r(1-r) + ½(1-r) 2 +½r 2 ], C2= (2( ¼ r 2 )+ ½ r(1-r)) /( ¼ r 2 + ½ r(1-r)) Estimate of r=(c1* n 1 +c2* n 2 +c2* n 3 )/(2n)
Two Point Analysis in F2 Partial Informative Markers (dominant) function r=rEstF2Partial(n1,n2,n3,n4) n=n1+n2+n3+n4; r=0.2;r1=-1; while (abs(r1-r)>1.e-8) r1=r; %E-step c1=(r^2+r*(1-r))/((1-r)^2/4+r*(1-r)+(1-r)^2/2+r^2/2); c2=(r^2/2+r*(1-r)/2)/(r^2/4+r*(1-r)/2); %M-step r=1/(2*n)*(c1*n1+c2*n2+c2*n3); end Matlab program to estimate recombinant r
Log-likelihood ratio test statistic Partial Informative Markers (dominant) Two alternative hypotheses H0: r = 0.5 vs. H1: r 0.5 Likelihood value under H1 L 1 (r|n ij ) = n!/(n 1 !...n 4 !) [3/4(1-r) 2 +r(1-r) + ½ r 2 ] n1 [ ¼ r 2 + ½ r(1-r)] n2+n3 [ ¼ (1-r) 2 ] n4 Likelihood value under H0 L 0 (r=0.5|n ij ) = n!/(n 1 !...n 4 !) [3/4(1-.5) 2 +.5(1-.5) + ½.5 2 ] n1 [ ¼ ½.5(1-.5)] n2+n3 [ ¼ (1-.5) 2 ] n4 LOD = log 10 [L 1 (r|n ij )/L 0 (r=0.5|n ij )] = 3.17 > critical LOD=3
Two Point Analysis in F2 Partial Informative Markers (dominant) function LOD=calcLOD_F2Partial(r,n1,n2,n3,n4) %log likelihood under H1 LOD=(n1)*log10((1-r)^2*3/4+r^2/2+r*(1-r))... +(n2+n3)*log10(r^2/4+r*(1-r)/2)... +(n4)*log10((1-r)^2/4); %log likelihood under H0 r=0.5; LOD0=(n1)*log10((1-r)^2*3/4+r^2/2+r*(1-r))... +(n2+n3)*log10(r^2/4+r*(1-r)/2)... +(n4)*log10((1-r)^2/4); LOD=LOD-LOD0; Matlab program to calculate log likelihood test score (LOD)
Three Point Analysis in Backcross a rice data
RG472 RG K5 U10 RG532 W1 RG173 RZ276 Amy1B RG146 RG345 RG381 RZ19 RG690 RZ730 RZ801 RG810 RG RG437 RG544 RG171 RG157 RZ318 Pall RZ58 CDO686 Amy1A/C RG95 RG654 RG256 RZ213 RZ123 RG RG104 RG348 RZ329 RZ892 RG100 RG191 RZ678 RZ574 RZ284 RZ394 pRD10A RZ403 RG179 CDO337 RZ337A RZ448 RZ519 Pgi -1 CDO87 RG910 RG418A RG218 RZ262 RG190 RG908 RG91 RG449 RG788 RZ565 RZ675 RG163 RZ590 RG214 RG143 RG chrom1chrom2chrom3chrom4
Three Point Analysis in Backcross Summarized the data as A,B,C Obs.A & BB & C 111abcn abc abCn abC aBcn aBc aBCn aBC Abcn Abc AbCn AbC ABcn ABc ABCn ABC 00
Rice Data A,B,C Obs.A & BB & C 111abcn abc = abCn abC = aBcn aBc = aBCn aBC = Abcn Abc = AbCn AbC = ABcn ABc = ABCn ABC =3800 Marker RG472 denoted by A, RG246 by B, K5 by C
Multilocus likelihood – determination of a most likely gene order Consider three markers A, B, C, with no particular order assumed. A triply heterozygous F1 ABC/abc backcrossed to a pure parent abc/abc GenotypeABC or abc ABc or abC Abc or aBC AbC or aBc Obs. n 00 =69 n 01 =12 n 10 =16 n 11 =3 Frequency under Order A-B-C (1-r AB )(1- r BC ) (1-r AB ) r BC r AB (1- r BC ) r AB r BC Order A-C-B (1-r AC )(1- r BC ) r AC r BC r AC (1-r BC ) (1-r AC )r BC Order B-A-C (1-r AB )(1- r AC ) (1-r AB ) r AC r AB r AC r AB (1-r AC ) r AB = the recombination fraction between A and B= (n 10 + n 11 )/n=0.19 r BC = the recombination fraction between B and C= (n 01 + n 11 )/n=0.15 r AC = the recombination fraction between A and C= (n 01 + n 10 )/n=0.28
What order is the mostly likely? L ABC (1-r AB ) n00+n01 (1-r BC ) n00+n10 (r AB ) n10+n11 (r BC ) n01+n11 L ACB (1-r AC ) n00+n11 (1-r BC ) n00+n10 (r AC ) n01+n10 (r BC ) n01+n11 L BAC (1-r AB ) n00+n01 (1-r AC ) n00+n11 (r AB ) n10+n11 (r AC ) n01+n10 Log(LABC) = Loo(LACB) = Log(LBAC) = According to the maximum likelihood principle, the linkage order that gives the maximum likelihood for a data set is the best linkage order supported by the data. the best linkage order A B C 20cM 15cM
GenotypeABC or abc ABc or abC Abc or aBC AbC or aBc Obs. n 00 =69 n 01 =12 n 10 =16 n 11 =3 DATA Result: r AB = =0.19 r BC = =0.15 r AC = =0.28 d AB =1/4*ln[(1+2 r AB )/(1-2 r AB )]=20 d BC =1/4*ln[(1+2 r BC )/(1-2 r BC )]=15 Log(LABC) = Loo(LACB) = Log(LBAC) = the best linkage order A B C 20cM 15cM