Lecture 12: Linkage Analysis V Date: 10/03/02 Least squares An EM algorithm Simulated distribution Marker coverage and density
True Multi-Locus Mapping True multi-locus mapping would use all the data to build an order and distance between loci. BUT... Large number of unknown parameters. There are 2 l-1 gamete types and the sample size is usually not large enough to populate all of these types. Computationally intensive as there are l!/2 possible orders.
Least Squares Method r ij is the recombination fraction between loci i and j. M ij is the map distance between loci i and j. s rij is the standard deviation of r ij. m i is the map distance between loci i and i+1. m1m1 m2m2 m3m3 m4m4 m5m5 m6m6 r 12 r 23 r 34 r 45 r 56 r recomb. fraction map distance
Least Squares Method (cont)
Least Squares: Haldane Map Function Recall the map function. Find the inverse map function F( ). Take the first derivative of F( ). Plug first derivative into approximate formula for S M.
Least Squares: Kosambi Map Function Recall the inverse map function F( ). Take the first derivative of F( ). Plug first derivative into approximate formula for S M.
Least Squares Method (cont)
Least Squares: Data Markersr M HaldaneExpected (0.03)0.11 (0.038)m1m (0.04)0.18 (0.057)m2m (0.13)0.46 (0.325)m 1+ m 2
Least Squares: Calculation
Least Squares: Variance Estimation
Least Squares: Variance Calculation
Why is this Least Squares?
Alternative Weighting Use LOD score for linkage as weight. Then the equation becomes:
EM Algorithm (Lander-Green) Make an initial guess for 0 = ( 1, 2,..., l-1 ). E Step: Compute the expected number of recombinants for each interval assuming current old. M Step: Treating the expected values as true, compute maximum likelihood estimate new. Iterate EM until likelihood converges.
EM Algorithm ABBCAC True recombination fraction 11 22 True number of recombinantst1t1 t2t2 Total observed gametesN 12 N 23 N 13 Number observed recombinantsR 12 R 23 R 13
EM Algorithm: E Step t 1 = R 12 + P(rec. in AB | rec. in AC)R 13 + P(rec. in AB | no rec. in AC)(N 13 – R 13 ) t 2 = R 23 + P(rec. in BC | rec. in AC)R 13 + P(rec. in BC | no rec. in AC)(N 13 – R 13 )
EM Algorithm: E Step (cont)
EM Algorithm: M Step
Simulation Find map function which fits the data well by comparing the likelihoods of the data. Distribution of likelihood difference is unknown, so simulation is needed to obtain it empirically.
Simulation: Evidence for Interference Recall that if you are given pairwise recombination fractions ij and a map function, you know how to find the gametic frequencies . Then the log likelihood is given by (m = 2 l-1 )
Simulation: Implementation To simulate under the null hypothesis of no interference, we assume the neighbor pairwise recombination fractions and simulate gametes under the assumption of no interference
Marker Coverage and Map Density Proportion of genome covered by markers is the marker coverage. It is simply the genomic map length divided by total genome length. The maximum genome segment between two adjacent markers is an indicator of map density. It is the average or maximum map distance between two adjacent markers.
Random Distribution of Markers Markers are generally assumed to be distributed randomly throughout the genome. Nonrandom distribution will generally decrease coverage and lower density. Unfortunately markers may be non- randomly distributed. Name some reasons.
Mapping Population Even if you have many markers, if your sample is small you may have insufficient information to achieve high coverage and density. Unattached genome segments are most common coverage problem. Solutions: increase sample size or using mapping population with more information (greater polymorphism).
Data Analysis and Models Wrong gene order can overestimate the map length thus overestimating map coverage and underestimating density. The wrong mapping function may convert recombination fractions into the wrong map distance, causing over/underestimation. Different grouping criteria can lead to different linkage groups. The more stringent, the more linkage groups and the lower the coverage and higher the density.
Prediction of Marker Coverage and Density A method for predicting marker coverage and density are based on the assumption of random distribution: confidence probability P is the probability that at least one marker is located in a 2d M genome segment.
Calculations Suppose the genome is a total L long. P(a marker not fall on 2d segment) = 1-2d/L. P(n markers don’t fall on 2d segment) = (1- 2d/L) n.
Calculations P(at least one marker on 2d segment) = 1-(1- 2d/L) n
Calculations When 2d/L < 0.1, then
Predicted Number of Markers Needed
Prediction when Genome Length Unknown Use all (500) markers to estimate a genetic map and assume the genome length is the length of this map, say L 500. Randomly draw 100 markers from the dataset with replacement. Estimate the genome length for 100 makers only, say L 100.
Advantages of the Simulation Approach No assumptions on marker distribution needed. No prior information about actual genome length is needed. Approach can be used to test other factors that might affect marker coverage along as those factors can be resampled.
Summary Least squares method for building genetic maps. EM algorithm method for building genetic maps. Simulated likelihood ratio statistic distribution for hypothesis tests. Predicting marker coverage and density.