Mattew Mazowita, Lani Haque, and David Sankoff Stability of Rearrangement measures in the comparison of genome sequences Presented by: Charlotte Wagner Searle Mattew Mazowita, Lani Haque, and David Sankoff
What are they trying to do? Present data-analytic and statistical tools for: Studying rates of rearrangement of whole genomes To assess the stability of these methods with changes in the level of resolution of the genomic data Building on the ideas of Sankoff et al. (1997, 2000, 2005): Derive an estimator for the number of reciprocal translocations Use simulations to show that the bias and standard deviation of the estimator is less than 5% 2 models of random translocation, with and without conservation of the centromere
How? Construct datasets: Fit the data to an evolutionary tree: Containing the number of conserved syntenies and conserved segments shared by pairs of animal genomes At different levels of resolution (30kb, 100kb, 300kb, and 1 Mb) Fit the data to an evolutionary tree: Find the rates of rearrangement
Conserved syntenies vs. conserved segments Pairs of chromosomes, one from each species, containing at least one sufficiently long stretch of homologous sequence The total number of such stretches of homologous sequences ; regions of chromosomes in two related species in which both gene content and gene order are parallel in the two species (Sankoff, Ferreti and Nadeau, 1997)
1. Datasets Secondary data from UCSC Genome Browser Human, Mouse, Chimp, Rat, Dog, and Chicken Different levels of resolution: 30kb, 100kb, 300kb, and 1Mb “The key to using whole genomes sequences to study evolutionary rearrangements is being able to partition each genome into segments conserved by two genomes since their divergence.”
2. Models of translocation Model the autosomes of a genome as c linear segments with lengths p(1),…,p(c) Assume two breakpoints of a translocation are chosen independently Do not consider chromosome fusion or fission REMINDER: reciprocal translocation between two chromosomes consists of breaking each one at an interior point, creating two segments, and rejoining the four resulting segments such that two new chromosomes are produced
What are the models? Left-right orientation on each chromosome Conservation of the centromere Estimator derived from this version Simulate to test the estimator Inverted left-hand fragment may rejoin another left-hand fragment High level of neocentromeric activity Simulate to test the estimator
3. Prediction and estimation Assume process is reversible Equilibrium state of the process well approximates the distribution of human genomes When comparing two genomes, do not need to consider them as diverging independently from a common ancestor
Equations… i j
4. Estimator for genomes with different number of chromosomes Human, mouse, chimp, rat, dog and chicken have different # of chromosomes Solution? Not good… therefore, need to construct an estimator that takes into account the # of chromosomes in both the genomes
1) Process-based estimator Derived from equation (7)
2) State-base estimator Based on the expectation over all chromosomes in A Remember: in Table 2, t^ is calculated according to equation (9)
5. Simulations Equilibrium distribution of chromosome size Sankoff and Ferretti (1996)Models of accumulated reciprocal translocations for explaining the observed range of chromosome sizes in genome data Proposed lower threshold on chromosome size A cap on largest chromosome size (Schubert and Oud, 1997) and is effective (De et al., 2001). Simulate the translocation process 100 times up to 10, 000 translocations each to produce Figure 1
ii. Performance of the estimators Process-based estimator “State-based” estimator
6. Fitting the data to animal phylogeny Assumed the phylogenetic tree in Figure 4 to infer the rates of rearrangement on evolutionary lineages Fit the data in Table 2 to the tree
7. Observations
7. Discussion Proposed an estimator Estimator is very accurate in simulations Applied estimator to animal genome comparisons at various levels of resolution Translocation estimate= stable; inversion estimate= increases At detailed levels of resolution, translocation # probably refers to other processes as well Increased inversion # are likely to reflect inversion process
THANK YOU!