Bioinformatic tools for Genome Mapping Avraham Korol 8240-449 (2449), room 217 in multipurpose building Course Assistant: Irit.

Slides:



Advertisements
Similar presentations
Planning breeding programs for impact
Advertisements

Genetic Linkage and Mapping Notation — ————— A _________ A a Aa Diploid Adult Haploid gametes (single chromatid) — ————— Two homologous chromosomes,
Qualitative and Quantitative traits
LINKAGE AND CHROMOSOME MAPPNG
Tutorial #2 by Ma’ayan Fishelson. Crossing Over Sometimes in meiosis, homologous chromosomes exchange parts in a process called crossing-over. New combinations.
Linkage genes and genetic recombination
Concepts and Connections
Gene Linkage and Genetic Mapping
Chapter 11 Mendel & The Gene Idea.
Chromosome Mapping in Eukaryotes
Biology 2250 Principles of Genetics Announcements Lab 3 Information: B2250 (Innes) webpage Lab 3 Information: B2250 (Innes) webpage download and print.
AN INTRODUCTION TO RECOMBINATION AND LINKAGE ANALYSIS Mary Sara McPeek Presented by: Yue Wang and Zheng Yin 11/25/2002.
Basics of Linkage Analysis
Linkage Genes linked on the same chromosome may segregate together.
QTL Mapping R. M. Sundaram.
1 QTL mapping in mice Lecture 10, Statistics 246 February 24, 2004.
Quantitative Genetics Theoretical justification Estimation of heritability –Family studies –Response to selection –Inbred strain comparisons Quantitative.
31 January, 2 February, 2005 Chapter 6 Genetic Recombination in Eukaryotes Linkage and genetic diversity.
Eukaryotic linkage, part 2 I.Three-point mapping to determine genetic maps A. A. Multiple cross-overs B. B. How to: analyzing the 3 pt testcross C. C.
Introduction to Linkage Analysis March Stages of Genetic Mapping Are there genes influencing this trait? Epidemiological studies Where are those.
Genetic Recombination in Eukaryotes
Genetic recombination in Eukaryotes: crossing over, part 1 I.Genes found on the same chromosome = linked genes II.Linkage and crossing over III.Crossing.
Linkage and LOD score Egmond, 2006 Manuel AR Ferreira Massachusetts General Hospital Harvard Medical School Boston.
Lecture 5: Segregation Analysis I Date: 9/10/02  Counting number of genotypes, mating types  Segregation analysis: dominant, codominant, estimating segregation.
Methods of Genome Mapping linkage maps, physical maps, QTL analysis The focus of the course should be on analytical (bioinformatic) tools for genome mapping,
Genetic Mapping Oregon Wolfe Barley Map (Szucs et al., The Plant Genome 2, )
Mapping populations Controlled crosses between two parents –two alleles/locus, gene frequencies = 0.5 –gametic phase disequilibrium is due to linkage,
Linkage & Gene Mapping in Eukaryotes
Experimental Design and Data Structure Supplement to Lecture 8 Fall
Quantitative Genetics
Genetic design. Testing Mendelian segregation Consider marker A with two alleles A and a BackcrossF 2 AaaaAAAaaa Observationn 1 n 0 n 2 n 1 n 0 Expected.
Lecture 13: Linkage Analysis VI Date: 10/08/02  Complex models  Pedigrees  Elston-Stewart Algorithm  Lander-Green Algorithm.
Lecture 12: Linkage Analysis V Date: 10/03/02  Least squares  An EM algorithm  Simulated distribution  Marker coverage and density.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Lecture 21: Quantitative Traits I Date: 11/05/02  Review: covariance, regression, etc  Introduction to quantitative genetics.
Population structure at QTL d A B C D E Q F G H a b c d e q f g h The population content at a quantitative trait locus (backcross, RIL, DH). Can be deduced.
Today… Genome 351, 15 April 2013, Lecture 5 Meiosis: how the genetic material is partitioned during the formation of gametes (sperm and eggs) Probability:
Genetics – Study of heredity is often divided into four major subdisciplines: 1. Transmission genetics, deals with the transmission of genes from generation.
Lecture 22: Quantitative Traits II
Lecture 23: Quantitative Traits III Date: 11/12/02  Single locus backcross regression  Single locus backcross likelihood  F2 – regression, likelihood,
1 Genetic Mapping Establishing relative positions of genes along chromosomes using recombination frequencies Enables location of important disease genes.
Types of genome maps Physical – based on bp Genetic/ linkage – based on recombination from Thomas Hunt Morgan's 1916 ''A Critique of the Theory of Evolution'',
Linkage -Genes on the same chromosome are called linked Human -23 pairs of chromosomes, ~35,000 different genes expressed. - average of 1,500 genes/chromosome.
I. Allelic, Genic, and Environmental Interactions
Extra Credit Question Crossing over in Eukaryotes occurs during when there are _____ chromatids present at the metaphase plate in Meiosis I, but the crossover.
Gene Mapping and Crossing Over –
Genetic Linkage.
Gene Mapping in Eukaryotes
Chapter 6.
I. Allelic, Genic, and Environmental Interactions
The Chromosomal Basis of Inheritance GENE MAPPING AP Biology/ Ms. Day
Modern Synthesis concepts from Laboratory Genetics
Recombination (Crossing Over)
Genes may be linked or unlinked and are inherited accordingly.
PLANT BIOTECHNOLOGY & GENETIC ENGINEERING (3 CREDIT HOURS)
The Chromosomal Basis of Inheritance
Gene Linkage and Genetic Mapping
Mapping Quantitative Trait Loci
Linkage, Recombination, and Eukaryotic Gene Mapping
Linkage, Recombination, and Eukaryotic Gene Mapping
Concept 14.2: The laws of probability govern Mendelian inheritance
Lecture 4: Testing for Departures from Hardy-Weinberg Equilibrium
Mapping Eukaryote Chromosomes by Recombination
Balanced Translocation detected by FISH
DIHYBRID CROSSES & GENE LINKAGE
Lecture 9: QTL Mapping II: Outbred Populations
Linkage Analysis Problems
LECTURE 5: LINKAGE.
Introduction to Genetics
Modern Synthesis concepts from Laboratory Genetics
Presentation transcript:

Bioinformatic tools for Genome Mapping Avraham Korol (2449), room 217 in multipurpose building Course Assistant: Irit Cohen

Methods of Genome Analysis (linkage maps, physical maps, QTL analysis) The focus of the course is on analytical (bioinformatic) tools for genetic/genomic analysis, including some background from (a) statistics, (b) appl. math., (c) software

A few elementary genetic and molecular-genetic notions (subjects) you are supposed to know General Genetics: meiosis, syngamy, gamete, zygote, DNA, genome, nucleus, chromosome, centromere, bivalent, hybrid, homozygote, F 1, F 2, heterozygote, inbred, haploid, diploid, mutant, gene, allele, locus, phenotype, Mendelian segregation (single-, two-, multilocus), dominant, co-dominant, recessive, additive, linkage, recombination, epistasis, quantitative variation, heritability, test-cross, backcross, intercross, linkage phase (coupling, repulsion), multiple crossovers, interference, polymorphism, linkage disequilibrium, haplotype Molecular Genetics: PCR, tandem repeats, microsatellite, SNP, DNA cloning, BAC-clone, genomic library, DNA fingerprinting, overlapping clones, contig, radiation hybrid, candidate gene, microarray

Structural genomics includes genetic mapping, physical mapping and sequencing of entire genomes Results of consecutive steps of structural genomics

What is genome mapping ? a. Positioning of DNA markers  genetic maps b. Positioning DNA pieces  physical maps c. Locating Mendelian genes relative to markers d. Mapping quantitative trait loci  QTL maps a b c d Gene cloning and sequencing

Building a contig map An ordered set of clones

RECOMBINATION ANALYSIS - THE BASIS OF GENETIC MAPPING Mapping is relative positioning of entities in some space: Maps can be derived after defining “distance” in that space. Genetics: uses recombination rates to build genetic maps a b A B A B A B a b  =  a b A B a b a b a b homo homo hetero “ two-point back-cross ” Parental types Recombinants sperm (1-r m ){AB + ab} r m {Ab + aB} AB/ab meiosis eggs (1- r f ){AB + ab} r f {Ab + aB}

The major method of genetic mapping: to analyze the proportion of parental and recombinant associations of alleles in the progeny AB/ab  ab/ab = {AB/ab, Ab/ab, aB/ab, ab/ab}. If r is the (unknown) rate of recombination, and N – sample size, the progeny structure (sizes of phenotypic classes) will be AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) n 1 n 2 n 3 n 4  2 - statistics:  2 total =  (n io -n ie ) 2 /n ie (d.f.=3) The simplest Statistical Analysis How can we test whether two loci are linked ?

AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) Example: test-cross AB/ab  ab/ab  {AB/ab Ab/ab aB/ab ab/ab} 164  n io =N Components of  2 :  2 total =  2 A : a +  2 B : b +  2 linkage Degrees of freedom : 3 = Testing for normal (Mendelian) monogenic segregation ratios A : a = 77: 87;  2 A : a = (77 – 82) 2 /82 + (87 – 82) 2 /82 = 0.63 < 3.84 B : b = 73: 91;  2 B : b = (73 – 82) 2 /82 + (91 – 82) 2 /82 = 1.62 < Testing for ‘linkage’ vs. ‘independent segregation’ If the genes are unlinked, and if monogenic ratios are normal, then we expect (A:a)(B:b) = 1:1:1:1, or n parental = n recombinants = ½N, or (n AB +n ab ) expected = ½N and (n Ab +n aB ) expected = ½N. Thus,  2 linkage = (144 – 82) 2 /82 + (20 – 82) 2 /82 = >> 3.84 and we should reject the hypothesis of independent segregation.

Testing for ‘linkage’ (continued) How to conduct linkage test if the monogenic ratios are not normal ? For the expectations, we have now (A : a) (B : b)  1:1:1:1 Still, upon independence of A:a and B:b, one expects that in the 2  2 table the following holds: A a B n AB n aB n AB : n aB = n Ab : n ab or n AB : n Ab = n aB : n ab or b n Ab n ab n AB n ab = n Ab n aB or D = n AB n ab - n Ab n aB = 0. To test this, we employ  2 linkage = with d.f.=1 2  2 analysis : D2 N nA na nB nb D2 N nA na nB nb We need to remember the basic ideas from the statistical testing paradigm: Significance - probability of type I error  (false positive – declare linkage when it does not exist, e.g.,  =0.05, 0.01, ) Probability of type II error  (false negative - declare “no linkage” when it exists) Power (1-  ) - probability to detect linkage when it exists (e.g., 1-  =0.8, 0.9)

Testing for ‘linkage’ (continued) How to conduct linkage test if the monogenic ratios are not normal ? For the expectations, we have now (A : a) (B : b)  1:1:1:1 Still, upon independence of A:a and B:b, one expects that in the 2  2 table the following holds: A a B n AB n aB n AB : n aB = n Ab : n ab or n AB : n Ab = n aB : n ab or b n Ab n ab n AB n ab = n Ab n aB or D = n AB n ab - n Ab n aB = 0. Information test for independence in k  m tables:  2 =-2{  n ij ln n ij -  n i. ln n i. -  n. j ln n. j + N lnN } Kullback S Information Theory and Statistics. Wiley & Sons, NY. d.f.=(k-1)(m-1)

Another Example: inter-cross, or F 2 =F 1  F 1 (AB/ab  AB/ab) Female gametes ½(1-r f ) AB ½r f Ab ½r f aB ½(1-r f )ab Male gametes ½(1-r m )AB ½r m Ab ½r m aB ½(1-r m )ab Gametes ½(1-r f ) AB ½r f Ab ½r f aB ½(1-r f )ab ½(1-r m ) AB 16 combinations  10 classes ½r m Ab ½r m aB ½(1-r m ) ab Assume dominance  4 classes z ij = g i g j = f ( r m, r f ) Phenotypic classes F 2 AB Ab aB ab Expected frequency ¼(2+  )N ¼(1-  )N ¼(1-  )N ¼  N where  = ( 1-r f )(1-r m )  (1-r) 2 [ or  = r f r m  r 2 ] “Linked “ or “Independent” ?  2 linkage test, as in test-cross, i.e. using 2  2 analysis (or 2  3, or 3  2, or ?)

How to estimate the rate of recombination from data ? Example: test-cross AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) n 1 n 2 n 3 n 4 P 1 R 1 R 2 P 2 (1)  = n 1 n 4 / n 2 n 3 = (1-r) 2 /r 2 = (1/r-1) 2  r = 1 / (1+   ) (2) r = n 2 / (n 1 +n 2 ) (3) r = (n 2 +n 3 ) / N Are these estimates equivalent ? (in what sense ?) If not, what should we choose ? To derive an estimate of r from the 4 numbers, n 1 - n 4, we need some statistical principle, or method. Indeed, consider three estimates: (1)  =53.49, r=12.03% (2*) 15.58; 13.19; 10.96; 9.20 (3) 12.20

How to estimate the rate of recombination from data ? Example: inter-cross AB Ab aB ab Expected (e) ¼(2+  )N ¼(1-  )N ¼(1-  )N ¼  N Observed (o) n 1 n 2 n 3 n 4 P 1 R 1 R 2 P 2 To derive an estimate of r from the 4 numbers, n 1 - n 4, we need some statistical principle, or method. Indeed, consider three estimates: (1)  = n 1 n 4 /n 2 n 3 =(2+  )  /(1-  ) 2  quadratic equation for   r = 1/ (1+   ) (2)  = n 4 / (n 3 +n 4 ) (3)  = 4n 4 / N Are these estimates equivalent ? Are there better ones than these ? What should we choose ? Methods of statistical estimation of parameters: (a) Method of moments (MM), (b) Least squares (LS), (c) Method of maximum likelihood (MML)

Sir R. Fisher - method of max likelihood (MML) MML – a short elementary introduction AB/ab Ab/ab aB/ab ab/ab Expected (e) ½ N(1-r) ½ Nr ½ Nr ½ N(1-r) Observed (o) n 1 n 2 n 3 n 4 P 1 R 1 R 2 P 2 Expected Nr N(1-r) Observed n 2 + n 3 n 1 + n 4 Probability to get n 2 +n 3 recombinants out of N genotypes is a function of unknown r: P(n 2 +n 3 ; N | r) = ( ) r n 2 + n 3 (1-r) n 1 + n 4 =A r n 2 + n 3 (1-r) n 1 + n 4 = L(r)  max N n2+n3N n2+n3 Note, the phase here is known (it was a testcross, with F 1 = AB/ab ). If the phase is unknown, the alternatives should be represented in the likelihood function L(r) : L(r)  [ ½ L(  =r) + ½ L(  =1-r) ] (i.e., F 1 =AB/ab and Ab/aB)

The probability to observe some r = r o = ( n 2 +n 3 )/N under certain (unknown !) real value r = r real So, what can we say about r real= ? r o P{r o = ( n 2 +n 3 ) / N | r real }

To derive the ML-estimate of r, max L(r) or max log L(r): log L(r) = log A + (n 2 +n 3 )log r + (n 1 +n 4 )log (1-r)  max by solving the eq. = 0 (maximum likelihood equation): (n 2 +n 3 )/r -(n 1 +n 4 )/(1-r)=0 or [n 2 +n 3 – r (n 1 +n 4 +n 2 +n 3 )]/r(1-r)=0,  or r = (n 2 +n 3 )/N - MML estimate of r How accurate is this estimate??? Looking for ML-estimate of r : dlog L(r) dr R. Fisher Statistical methods for research workers. 14ed., Edinb., Oliver & Boyd, 1970

To get the expected variance of the estimate, Fisher suggests V r = - E{d 2 log L(r)/dr 2 } -1 : d 2 log L(r)/dr 2 = d[d log L(r)/dr]/dr = = d [(n 2 +n 3 )/r - (n 1 +n 4 )/(1-r)] / dr = = - (n 2 +n 3 ) / r 2 - (n 1 +n 4 )/(1-r) 2 -E{d 2 log L(r)/dr 2 } = rN / r 2 + (1-r)N / (1-r) 2 = = N / r + N / (1-r) = N / r(1-r) or Vr =- E{d 2 log L(r)/dr 2 } -1 = r(1-r)/N Looking for ML-estimate of r : Fisher defines (asymptotic) information content in the sample about the unknown parameter  as I  =1/V  NB: MML gives estimates with min V , i.e. max I  !!!

Applying the MML theory to F2 (Fisher’s example) Phenotypic classes F 2 AB Ab aB ab Expected frequency ¼(2+  )N ¼(1-  )N ¼(1-  )N ¼  N n 1 n 2 n 3 n 4 where  =(1-r f )(1-r m )  (1-r) 2 [ or  =r f r m  r 2 ] L(  ) = A (2+  ) n 1 (1-  ) n 2 + n 3  n 4 =  max log L(  ) = logA + n 1 log(2+  ) + (n 2 + n 3 ) log(1-  ) + n 4 log   max To find ML-estimate of , we need to solve the ML-equation d log L(  ) / d  = n 1 /(2+  ) - (n 2 +n 3 ) /(1-  ) + n 4 /  = 0 or solving N  2 - (n 1 - 2n 2 -2n 3 - n 4 )  -2n 4 = 0  ML estimate of   r How to find the variance of the estimated parameter ? Fisher defines (asymptotic) information content in the sample about the unknown parameter  as I  = - E{d 2 log L(  )/d  2 }. Variance V  is the inverse of I  : V  = I  -1 = - E{d 2 log L(  )/d  2 } -1 = 2  (2+  )(1-  ) / N (1+2  ) NB: MML gives estimates with min V , i.e. max I  !!!

From 2- to 3-points: ML-estimation of linkage with 3 loci (1-  1 )(1-  2 )  1  2 (1-  1 )  2  1 (1-  2 )  1  2  ={  1,  2 } - we have now a set (vector) of parameters L(  ) = L(  1,  2 ) max d log L(  ) d  = 0  log L(  )  1 = 0  log L(  )  2 = 0 F 1 (  1,  2 ) = 0 F 2 (  1,  2 ) = 0    ML – estimates  = (  1,  2 ) d 2 log L(  ) d  2 V  = I  - 1 = - E { } - 1  2 log L(  )  1  2 - E { } - 1

Further biological complications in linkage estimation analysis Male vs. female recombination  = {r m, r f } Deviations from Mendelian segregation, due to suvival, penetrance, meiotic drive, deviation from random syngamy (sertation)  = {r,  } Inter-dependence of recombination in different intervals {  1,  2, c} Problems with dominant markers (especially in repulsion phase) Variation of recombination among families  = {  1,  2,  3, … } Unknown linkage phases (coupling – repulsion) Missing data Various combinations of the foregoing complications

Recombination rate and map distance For Genetic Mapping we need Genetic Distance x = d(a,b) - average number of recombination events occurred in the segment across many meiotic cells A problem: “observed vs. occurred”: Only uneven exchanges (1, 3, 5) result in recombinant phenotypes that can be registered. A  B a p 0 p 1 p 2 p 3 …b where p k is the probability of k (k = 0, 1, 2, …) exchanges in the interval.  Thus x = 0 * p * p * p * p 3 + … =  k * p k k= 0 but recombination rate r is defined as the proportion of recombinant gametes:  r = p 1 + p 3 + … =  p 2k+1 k= 0 What about the relationship x  r ???

The relationship of r and x is referred to as mapping function, r=f(x); f(x) depends on the mode of multiple exchanges, or interference; r can be estimated from data, and then x= f -1 (r). Mapping function Let potential recombination points be randomly distributed along the chromosome, independently from each other (i.e., with no interference). Then, the probability of k exchanges between two loci: P k (x) = e -x x k /k!, k=0,1,2,... (Poisson distribution) Thus, r can be calculated as:  r(x) =  P 2k+1 (x) = 0.5(1-e -2x ). k=0 From this, x = r -1 (x) = ln(1- 2r) The main assumption in the above – independence of exchanges. Note that the genetic distance scale is additive, unlike recombination rate scale Haldane mapping function J.B.S. Haldane

Interference – deviation from independence A r 1 B r 2 C a b c With independence: r 3 = r 1 (1-r 2 )+r 2 (1-r 1 ) = r 1 +r 2 -2r 12 where r 12 =r 1 r 2 is the expected probability of double exchanges In fact: c = r 12 (observed)/r 12 (expected) = r 12 (observed)/(r 1 r 2 )  1. c - coefficient of coincidence ; thus : r 3 = r 1 +r 2 -2c * r 12 XX XX involved: two three three four with no 1 : 2 : 1 interference Crossover interference: c  1 c 1 negative Chromatid interference: defines how 2 out of 4 strands are involved

Four-strandmeioticconfigurations

Mapping with Interference A x B dx C a x +dx b c r(x+dx) = r(x) + r(dx) - 2c(r) r(x) r(dx), or r(x+dx) - r(x) = r(dx) [1 - 2c(r) r(x)]. In the limit, when dx tends to zero, we will have differential equation: dr(x)/dx = 1 - 2r(x)c(r (x)) This equation is a tool for generating mapping functions. Examples: If c(r)  0  r(x) = x Morgan's function; c(r(x))  1  ½ [1-exp(-2x)] Haldane's function c(r(x)) = 2r  ½ tanh(2x) Kosambi (complete interference for short distances and no interference for large distances)

Let us put c(r(x))=2r to the basic equation dr(x)/dx=1-2r(x)c(r (x)). Its solution is called Kosambi mapping function 1 – exp (-2x) r = ½ = ½ tanh(2x). 1 + exp (-2x) Clearly, r(x)  0.5 when x   Back transformation of r = ½ tanh(2x) gives x = ¼ ln[(1+2r)/(1-2r)]. The formula for combining recombination fractions in adjacent intervals, corresponding to Kosambi function, takes the following form r 3 =(r 1 +r 2 )/(1+4r 1 r 2 ). Compare with r 3 = r 1 +r 2 Morgan’s function r 3 = r 1 +r 2 -2 r 1 r 2 Haldane’s function Kosambi Interference

- Assumption of Poisson distribution not valid: as a rule – one obligate exchange - Obligate exchange + Poisson model (no interference)  positive interference - The differential equation approach deals with 3 loci, no multilocus extension - The differential equation approach ignores variation of interference along the chromosome (arm) - The probabilistic models consider recombination as a serial process: starts from a point (e.g., centromere) and proceeds along the arm (maths: renewal processes subsequent “switches”: Some Comments on inference and mapping functions Sam Karlin & Uri Liberman, Theoretical recombination processes incorporating interference. Theor. Population Biology, 46: Bayley N.T.J Introduction to the mathematical theory of genetic linkage. Oxford Univ. Press t 0 t 1 t 2 t 3 - Count-Location approach: (1) c{c 0, c 1, c 2,…}: c i  0;  c i =1 (e.g., Poisson) (2) location function F k (e.g., even distribution) C-L functions can generate positive and negative interference, depending on c

Negative interference (excess of double exchanges) in chromosome 1B of wheat Xgwm18 Xgwm11 Xgwm413 Xgwm273 Xgwm911 Locus      *** 4.42***    5.70***    5.47***

Islands of negative interference (excess of double crossovers) in wheat chromosome 1B (highlighted by red). In fact, negative interference in wheat seems to be a general phenomenon, as well as in barley, drosophila, and other species.

Four-strandmeioticconfigurations

Tetrads In some organisms (e.g., yeasts) all four products of an individual m eiosis can be recovered together in what is known as ascus. Th ese are called tetrads. The four asco-spores can be typed for ma rker loci (e.g., SSRs) “individually”. In some cases (e.g., N. crassa) there is one further mitotic division, but the resulting octads are ordered.

Second-division segregation pattern With exchange Second-division segregation pattern No exchange Tetrad analysis of recombination