Bioinformatic tools for Genome Mapping Avraham Korol 8240-449 (2449), room 217 in multipurpose building Course Assistant: Irit.

Bioinformatic tools for Genome Mapping Avraham Korol 8240-449 (2449), room 217 in multipurpose building korol@research.haifa.ac.il Course Assistant: Irit Cohen irit.cs.haifa@gmail.com

Methods of Genome Analysis (linkage maps, physical maps, QTL analysis) The focus of the course is on analytical (bioinformatic) tools for genetic/genomic analysis, including some background from (a) statistics, (b) appl. math., (c) software

A few elementary genetic and molecular-genetic notions (subjects) you are supposed to know General Genetics: meiosis, syngamy, gamete, zygote, DNA, genome, nucleus, chromosome, centromere, bivalent, hybrid, homozygote, F 1, F 2, heterozygote, inbred, haploid, diploid, mutant, gene, allele, locus, phenotype, Mendelian segregation (single-, two-, multilocus), dominant, co-dominant, recessive, additive, linkage, recombination, epistasis, quantitative variation, heritability, test-cross, backcross, intercross, linkage phase (coupling, repulsion), multiple crossovers, interference, polymorphism, linkage disequilibrium, haplotype Molecular Genetics: PCR, tandem repeats, microsatellite, SNP, DNA cloning, BAC-clone, genomic library, DNA fingerprinting, overlapping clones, contig, radiation hybrid, candidate gene, microarray

Structural genomics includes genetic mapping, physical mapping and sequencing of entire genomes Results of consecutive steps of structural genomics

What is genome mapping ? a. Positioning of DNA markers  genetic maps b. Positioning DNA pieces  physical maps c. Locating Mendelian genes relative to markers d. Mapping quantitative trait loci  QTL maps a b c d Gene cloning and sequencing

Building a contig map An ordered set of clones

RECOMBINATION ANALYSIS - THE BASIS OF GENETIC MAPPING Mapping is relative positioning of entities in some space: Maps can be derived after defining “distance” in that space. Genetics: uses recombination rates to build genetic maps a b A B A B A B a b  =  a b A B a b a b a b homo homo hetero “ two-point back-cross ” Parental types Recombinants sperm (1-r m ){AB + ab} r m {Ab + aB} AB/ab meiosis eggs (1- r f ){AB + ab} r f {Ab + aB}

The major method of genetic mapping: to analyze the proportion of parental and recombinant associations of alleles in the progeny AB/ab  ab/ab = {AB/ab, Ab/ab, aB/ab, ab/ab}. If r is the (unknown) rate of recombination, and N – sample size, the progeny structure (sizes of phenotypic classes) will be AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) n 1 n 2 n 3 n 4  2 - statistics:  2 total =  (n io -n ie ) 2 /n ie (d.f.=3) The simplest Statistical Analysis How can we test whether two loci are linked ?

AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) 65 12 8 79 Example: test-cross AB/ab  ab/ab  {AB/ab Ab/ab aB/ab ab/ab} 164  n io =N Components of  2 :  2 total =  2 A : a +  2 B : b +  2 linkage Degrees of freedom : 3 = 1 + 1 + 1 1.Testing for normal (Mendelian) monogenic segregation ratios A : a = 77: 87;  2 A : a = (77 – 82) 2 /82 + (87 – 82) 2 /82 = 0.63 < 3.84 B : b = 73: 91;  2 B : b = (73 – 82) 2 /82 + (91 – 82) 2 /82 = 1.62 < 3.84 2. Testing for ‘linkage’ vs. ‘independent segregation’ If the genes are unlinked, and if monogenic ratios are normal, then we expect (A:a)(B:b) = 1:1:1:1, or n parental = n recombinants = ½N, or (n AB +n ab ) expected = ½N and (n Ab +n aB ) expected = ½N. Thus,  2 linkage = (144 – 82) 2 /82 + (20 – 82) 2 /82 = 93.76 >> 3.84 and we should reject the hypothesis of independent segregation.

Testing for ‘linkage’ (continued) How to conduct linkage test if the monogenic ratios are not normal ? For the expectations, we have now (A : a) (B : b)  1:1:1:1 Still, upon independence of A:a and B:b, one expects that in the 2  2 table the following holds: A a B n AB n aB n AB : n aB = n Ab : n ab or n AB : n Ab = n aB : n ab or b n Ab n ab n AB n ab = n Ab n aB or D = n AB n ab - n Ab n aB = 0. To test this, we employ  2 linkage = with d.f.=1 2  2 analysis : D2 N nA na nB nb D2 N nA na nB nb We need to remember the basic ideas from the statistical testing paradigm: Significance - probability of type I error  (false positive – declare linkage when it does not exist, e.g.,  =0.05, 0.01, 0.001 ) Probability of type II error  (false negative - declare “no linkage” when it exists) Power (1-  ) - probability to detect linkage when it exists (e.g., 1-  =0.8, 0.9)

Testing for ‘linkage’ (continued) How to conduct linkage test if the monogenic ratios are not normal ? For the expectations, we have now (A : a) (B : b)  1:1:1:1 Still, upon independence of A:a and B:b, one expects that in the 2  2 table the following holds: A a B n AB n aB n AB : n aB = n Ab : n ab or n AB : n Ab = n aB : n ab or b n Ab n ab n AB n ab = n Ab n aB or D = n AB n ab - n Ab n aB = 0. Information test for independence in k  m tables:  2 =-2{  n ij ln n ij -  n i. ln n i. -  n. j ln n. j + N lnN } Kullback S. 1959. Information Theory and Statistics. Wiley & Sons, NY. d.f.=(k-1)(m-1)

Another Example: inter-cross, or F 2 =F 1  F 1 (AB/ab  AB/ab) Female gametes ½(1-r f ) AB ½r f Ab ½r f aB ½(1-r f )ab Male gametes ½(1-r m )AB ½r m Ab ½r m aB ½(1-r m )ab Gametes ½(1-r f ) AB ½r f Ab ½r f aB ½(1-r f )ab ½(1-r m ) AB 16 combinations  10 classes ½r m Ab ½r m aB ½(1-r m ) ab Assume dominance  4 classes z ij = g i g j = f ( r m, r f ) Phenotypic classes F 2 AB Ab aB ab Expected frequency ¼(2+  )N ¼(1-  )N ¼(1-  )N ¼  N where  = ( 1-r f )(1-r m )  (1-r) 2 [ or  = r f r m  r 2 ] “Linked “ or “Independent” ?  2 linkage test, as in test-cross, i.e. using 2  2 analysis (or 2  3, or 3  2, or ?)

How to estimate the rate of recombination from data ? Example: test-cross AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) n 1 n 2 n 3 n 4 P 1 R 1 R 2 P 2 (1)  = n 1 n 4 / n 2 n 3 = (1-r) 2 /r 2 = (1/r-1) 2  r = 1 / (1+   ) (2) r = n 2 / (n 1 +n 2 ) (3) r = (n 2 +n 3 ) / N Are these estimates equivalent ? (in what sense ?) If not, what should we choose ? To derive an estimate of r from the 4 numbers, n 1 - n 4, we need some statistical principle, or method. Indeed, consider three estimates: 65 12 8 79 (1)  =53.49, r=12.03% (2*) 15.58; 13.19; 10.96; 9.20 (3) 12.20

How to estimate the rate of recombination from data ? Example: inter-cross AB Ab aB ab Expected (e) ¼(2+  )N ¼(1-  )N ¼(1-  )N ¼  N Observed (o) n 1 n 2 n 3 n 4 P 1 R 1 R 2 P 2 To derive an estimate of r from the 4 numbers, n 1 - n 4, we need some statistical principle, or method. Indeed, consider three estimates: (1)  = n 1 n 4 /n 2 n 3 =(2+  )  /(1-  ) 2  quadratic equation for   r = 1/ (1+   ) (2)  = n 4 / (n 3 +n 4 ) (3)  = 4n 4 / N Are these estimates equivalent ? Are there better ones than these ? What should we choose ? Methods of statistical estimation of parameters: (a) Method of moments (MM), (b) Least squares (LS), (c) Method of maximum likelihood (MML)

Sir R. Fisher - method of max likelihood (MML) MML – a short elementary introduction AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) n 1 n 2 n 3 n 4 P 1 R 1 R 2 P 2 Expected Nr N(1-r) Observed n 2 + n 3 n 1 + n 4 Probability to get n 2 +n 3 recombinants out of N genotypes is a function of unknown r: P(n 2 +n 3 ; N | r) = ( ) r n 2 + n 3 (1-r) n 1 + n 4 =A r n 2 + n 3 (1-r) n 1 + n 4 = L(r)  max N n2+n3N n2+n3 Note, the phase here is known (it was a testcross, with F 1 = AB/ab ). If the phase is unknown, the alternatives should be represented in the likelihood function L(r) : L(r)  [ ½ L(  =r) + ½ L(  =1-r) ] (i.e., F 1 =AB/ab and Ab/aB)

The probability to observe some r = r o = ( n 2 +n 3 )/N under certain (unknown !) real value r = r real So, what can we say about r real= ? 0 0.50 1.0 r o P{r o = ( n 2 +n 3 ) / N | r real }

To derive the ML-estimate of r, max L(r) or max log L(r): log L(r) = log A + (n 2 +n 3 )log r + (n 1 +n 4 )log (1-r)  max by solving the eq. = 0 (maximum likelihood equation): (n 2 +n 3 )/r -(n 1 +n 4 )/(1-r)=0 or [n 2 +n 3 – r (n 1 +n 4 +n 2 +n 3 )]/r(1-r)=0,  or r = (n 2 +n 3 )/N - MML estimate of r How accurate is this estimate??? Looking for ML-estimate of r : dlog L(r) dr R. Fisher Statistical methods for research workers. 14ed., Edinb., Oliver & Boyd, 1970

To get the expected variance of the estimate, Fisher suggests V r = - E{d 2 log L(r)/dr 2 } -1 : d 2 log L(r)/dr 2 = d[d log L(r)/dr]/dr = = d [(n 2 +n 3 )/r - (n 1 +n 4 )/(1-r)] / dr = = - (n 2 +n 3 ) / r 2 - (n 1 +n 4 )/(1-r) 2 -E{d 2 log L(r)/dr 2 } = rN / r 2 + (1-r)N / (1-r) 2 = = N / r + N / (1-r) = N / r(1-r) or Vr =- E{d 2 log L(r)/dr 2 } -1 = r(1-r)/N Looking for ML-estimate of r : Fisher defines (asymptotic) information content in the sample about the unknown parameter  as I  =1/V  NB: MML gives estimates with min V , i.e. max I  !!!

Applying the MML theory to F2 (Fisher’s example) Phenotypic classes F 2 AB Ab aB ab Expected frequency ¼(2+  )N ¼(1-  )N ¼(1-  )N ¼  N n 1 n 2 n 3 n 4 where  =(1-r f )(1-r m )  (1-r) 2 [ or  =r f r m  r 2 ] L(  ) = A (2+  ) n 1 (1-  ) n 2 + n 3  n 4 =  max log L(  ) = logA + n 1 log(2+  ) + (n 2 + n 3 ) log(1-  ) + n 4 log   max To find ML-estimate of , we need to solve the ML-equation d log L(  ) / d  = n 1 /(2+  ) - (n 2 +n 3 ) /(1-  ) + n 4 /  = 0 or solving N  2 - (n 1 - 2n 2 -2n 3 - n 4 )  -2n 4 = 0  ML estimate of   r How to find the variance of the estimated parameter ? Fisher defines (asymptotic) information content in the sample about the unknown parameter  as I  = - E{d 2 log L(  )/d  2 }. Variance V  is the inverse of I  : V  = I  -1 = - E{d 2 log L(  )/d  2 } -1 = 2  (2+  )(1-  ) / N (1+2  ) NB: MML gives estimates with min V , i.e. max I  !!!

From 2- to 3-points: ML-estimation of linkage with 3 loci (1-  1 )(1-  2 )  1  2 (1-  1 )  2  1 (1-  2 )  1  2  ={  1,  2 } - we have now a set (vector) of parameters L(  ) = L(  1,  2 ) max d log L(  ) d  = 0  log L(  )  1 = 0  log L(  )  2 = 0 F 1 (  1,  2 ) = 0 F 2 (  1,  2 ) = 0    ML – estimates  = (  1,  2 ) d 2 log L(  ) d  2 V  = I  - 1 = - E { } - 1  2 log L(  )  1  2 - E { } - 1

Further biological complications in linkage estimation analysis Male vs. female recombination  = {r m, r f } Deviations from Mendelian segregation, due to suvival, penetrance, meiotic drive, deviation from random syngamy (sertation)  = {r,  } Inter-dependence of recombination in different intervals {  1,  2, c} Problems with dominant markers (especially in repulsion phase) Variation of recombination among families  = {  1,  2,  3, … } Unknown linkage phases (coupling – repulsion) Missing data Various combinations of the foregoing complications

Recombination rate and map distance For Genetic Mapping we need Genetic Distance x = d(a,b) - average number of recombination events occurred in the segment across many meiotic cells A problem: “observed vs. occurred”: Only uneven exchanges (1, 3, 5) result in recombinant phenotypes that can be registered. A  B a p 0 p 1 p 2 p 3 …b where p k is the probability of k (k = 0, 1, 2, …) exchanges in the interval.  Thus x = 0 * p 0 + 1 * p 1 + 2 * p 2 + 3 * p 3 + … =  k * p k k= 0 but recombination rate r is defined as the proportion of recombinant gametes:  r = p 1 + p 3 + … =  p 2k+1 k= 0 What about the relationship x  r ???

The relationship of r and x is referred to as mapping function, r=f(x); f(x) depends on the mode of multiple exchanges, or interference; r can be estimated from data, and then x= f -1 (r). Mapping function Let potential recombination points be randomly distributed along the chromosome, independently from each other (i.e., with no interference). Then, the probability of k exchanges between two loci: P k (x) = e -x x k /k!, k=0,1,2,... (Poisson distribution) Thus, r can be calculated as:  r(x) =  P 2k+1 (x) = 0.5(1-e -2x ). k=0 From this, x = r -1 (x) = - 0.5 ln(1- 2r) The main assumption in the above – independence of exchanges. Note that the genetic distance scale is additive, unlike recombination rate scale Haldane mapping function J.B.S. Haldane

Interference – deviation from independence A r 1 B r 2 C a b c With independence: r 3 = r 1 (1-r 2 )+r 2 (1-r 1 ) = r 1 +r 2 -2r 12 where r 12 =r 1 r 2 is the expected probability of double exchanges In fact: c = r 12 (observed)/r 12 (expected) = r 12 (observed)/(r 1 r 2 )  1. c - coefficient of coincidence ; thus : r 3 = r 1 +r 2 -2c * r 12 XX XX 2-32-32-32-3 2-32-41-31-4 12 34 involved: two three three four with no 1 : 2 : 1 interference 2 3 4 Crossover interference: c  1 c 1 negative Chromatid interference: defines how 2 out of 4 strands are involved

Four-strandmeioticconfigurations

Mapping with Interference A x B dx C a x +dx b c r(x+dx) = r(x) + r(dx) - 2c(r) r(x) r(dx), or r(x+dx) - r(x) = r(dx) [1 - 2c(r) r(x)]. In the limit, when dx tends to zero, we will have differential equation: dr(x)/dx = 1 - 2r(x)c(r (x)) This equation is a tool for generating mapping functions. Examples: If c(r)  0  r(x) = x Morgan's function; c(r(x))  1  ½ [1-exp(-2x)] Haldane's function c(r(x)) = 2r  ½ tanh(2x) Kosambi (complete interference for short distances and no interference for large distances)

Let us put c(r(x))=2r to the basic equation dr(x)/dx=1-2r(x)c(r (x)). Its solution is called Kosambi mapping function 1 – exp (-2x) r = ½ = ½ tanh(2x). 1 + exp (-2x) Clearly, r(x)  0.5 when x   Back transformation of r = ½ tanh(2x) gives x = ¼ ln[(1+2r)/(1-2r)]. The formula for combining recombination fractions in adjacent intervals, corresponding to Kosambi function, takes the following form r 3 =(r 1 +r 2 )/(1+4r 1 r 2 ). Compare with r 3 = r 1 +r 2 Morgan’s function r 3 = r 1 +r 2 -2 r 1 r 2 Haldane’s function Kosambi Interference

- Assumption of Poisson distribution not valid: as a rule – one obligate exchange - Obligate exchange + Poisson model (no interference)  positive interference - The differential equation approach deals with 3 loci, no multilocus extension - The differential equation approach ignores variation of interference along the chromosome (arm) - The probabilistic models consider recombination as a serial process: starts from a point (e.g., centromere) and proceeds along the arm (maths: renewal processes subsequent “switches”: Some Comments on inference and mapping functions Sam Karlin & Uri Liberman, 1994. Theoretical recombination processes incorporating interference. Theor. Population Biology, 46: 198-231. Bayley N.T.J. 1961. Introduction to the mathematical theory of genetic linkage. Oxford Univ. Press t 0 t 1 t 2 t 3 - Count-Location approach: (1) c{c 0, c 1, c 2,…}: c i  0;  c i =1 (e.g., Poisson) (2) location function F k (e.g., even distribution) C-L functions can generate positive and negative interference, depending on c

Negative interference (excess of double exchanges) in chromosome 1B of wheat Xgwm18 Xgwm11 Xgwm413 Xgwm273 Xgwm911 Locus 1 2 3 4 5      0.94 6.45*** 4.42***    5.70***    5.47***

Islands of negative interference (excess of double crossovers) in wheat chromosome 1B (highlighted by red). In fact, negative interference in wheat seems to be a general phenomenon, as well as in barley, drosophila, and other species.

Four-strandmeioticconfigurations

Tetrads In some organisms (e.g., yeasts) all four products of an individual m eiosis can be recovered together in what is known as ascus. Th ese are called tetrads. The four asco-spores can be typed for ma rker loci (e.g., SSRs) “individually”. In some cases (e.g., N. crassa) there is one further mitotic division, but the resulting octads are ordered.

Second-division segregation pattern With exchange Second-division segregation pattern No exchange Tetrad analysis of recombination

Bioinformatic tools for Genome Mapping Avraham Korol 8240-449 (2449), room 217 in multipurpose building Course Assistant: Irit.

Similar presentations

Presentation on theme: "Bioinformatic tools for Genome Mapping Avraham Korol 8240-449 (2449), room 217 in multipurpose building Course Assistant: Irit."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bioinformatic tools for Genome Mapping Avraham Korol 8240-449 (2449), room 217 in multipurpose building Course Assistant: Irit.

Similar presentations

Presentation on theme: "Bioinformatic tools for Genome Mapping Avraham Korol 8240-449 (2449), room 217 in multipurpose building Course Assistant: Irit."— Presentation transcript:

Similar presentations

About project

Feedback