Download presentation
Presentation is loading. Please wait.
Published byJanice Nelson Modified over 9 years ago
1
15. Lecture WS 2004/05Bioinformatics III1 V15: genome rearrangement – current status * Genome comparison mouse – human: syntenic regions * Breakpoint analysis * Breakpoint reusage * heuristic MGR algorithm * Comparison genomes mouse – rat – human * microsyntenies - macrosyntenies V16:reversal distance problem (Hannenvalli, Tesler, Pevzner) – versus using conserved intervals (Bergeron, Stoye)
2
15. Lecture WS 2004/05Bioinformatics III2 Processes of Genome Evolution Two genomes may have many genes in common, but the genes may be arranged in a different sequence or be moved between chromosomes. Such differences in gene orders are the results of rearrangement events that are common in molecular evolution (frequency ca. only 1 event per million years!) - Substitution - Insertion - Deletion - Translocation - Inversion/ Reversal - Duplication
3
15. Lecture WS 2004/05Bioinformatics III3 What is a reversal = inversion ? Break and Invert A T G C C T G T A C T A T A C G G A C A T G A T A T G T A C A G G C T A T A C A T G T C C G A T Purines (A, G) and Pyrimidines (C, T) switch strands Many organisms have highly similar genes but very different gene orders. Very prominent in prokaryotes, mitochondrial DNA and mamalian X- chromosome.
4
15. Lecture WS 2004/05Bioinformatics III4 Types of Genome Rearrangements In unichromosomal genomes, the most common rearrangement events are reversals, in which a contiguous interval of genes is put into the reverse order. For multichromosomal genomes, the most common rearrangement events are reversals, translocations, fissions, and fusions. The pairwise genome rearrangement problem is to find an optimal scenario transforming one genome to another via these rearrangement events. Genomic distance: the number of inversions and translocations needed to transform one genome into another. Fissions and fusions may be included as a special case of translocations in which one of the input or output chromosomes is empty.
5
15. Lecture WS 2004/05Bioinformatics III5 Representation of a genome We consider a unichromosomal genome to be a sequence of n genes. The genes are represented by numbers 1, 2,..., n. The two orientations of gene i are represented by i and -i. A genome is represented as a signed permutation of the numbers 1, 2,..., n. For example, a unichromosomal genome with n = 5 genes is 5 -3 4 2 -1
6
15. Lecture WS 2004/05Bioinformatics III6 Multichromosal Genome A multichromosomal genome consists of n genes spread over m chromosomes. We represent it as a signed permutation of 1, 2,..., n, with delimiters "$" or ";" inserted between the chromosomes. For example, a genome with 12 genes spread over 3 chromosomes is 7 -2 8 3 $ 5 9 -6 -1 12 $ 11 4 10 $ The order of the chromosomes and the direction of the chromosomes do not matter in the multichromosomal algorithms. Thus, we could represent this same genome by flipping the first chromosome (reverse the order of its entries and negate them) and then moving the last chromosome to the beginning: 11 4 10 $ -3 -8 2 -7 $ 5 9 -6 -1 12 $
7
15. Lecture WS 2004/05Bioinformatics III7 Unichromosomal genomes: sorting by reversal A reversal in a signed permutation is an operation that takes an interval in a permutation, reverses the order of the numbers, and changes all their signs. For example, 5 1 3 2 -9 7 -4 6 8 5 1 -7 9 -2 -3 -4 6 8 The reversal distance between two genomes is the minimum number of reversals it takes to get from one genome to the other. For a given pair of genomes, the reversal distance is unique, but there are usually many possible reversal scenarios with this distance. However, it is (of course) possible that this mathematical notion of reversal distance can underestimate the actual number of steps that occurred biologically.
8
15. Lecture WS 2004/05Bioinformatics III8 Multichromosomal genomes: rearrangement operations We treat four elementary rearrangement events in multichromosomal genomes: reversals, translocations, fusions, and fissions. Reversal: An interval within a single chromosome may be reversed in the same fashion as a reversal acts in the unichromosomal case: 7 -2 8 3 $ 7 -2 8 3 $ 5 9 -6 -1 12 $ 5 9 -12 1 6 $ 11 4 10 $ 11 4 10 $ Note: When the GRIMM program are run in unichromosomal mode, the genomes 3 1 2 and -2 -1 -3 are considered different (one reversal apart, distance = 1), while in multichromosomal mode, those same genomes are considered equivalent (distance = 0) because we have simply flipped an entire chromosome, which gives an equivalent genome in the multichromosomal mode.
9
15. Lecture WS 2004/05Bioinformatics III9 Two chromosomes "A B" and "C D" may be rearranged into "A D" and "C B". (The letters A, B, C, D stand for sequences of genes.) Because flipping chromosomes does not alter a genome (only its representation is altered), "A -C" and "-B D" is another possible translocation. (-B means to reverse the order of the genes in sequence B and negate each one.) For example, a translocation on chromosomes 1 and 3 is 7 -2 8 3 $ 7 -2 8 -4 -11 $ 5 9 -6 -1 12 $ 5 9 -6 -1 12 $ 11 4 10 $-3 10 $ Translocation
10
15. Lecture WS 2004/05Bioinformatics III10 Fussion & Fission Fusion: Two chromosomes may be fused together into a single chromosome. Due to chromosome flippings, there are 4 distinct fusions possible between each pair of chromosomes. Here is one of the fusions between chromosomes 1 and 3: 7 -2 8 3 $ 7 -2 8 3 -10 -4 -11 $ 5 9 -6 -1 12 $ 5 9 -6 -1 12 $ 11 4 10 $ Fission: A chromosome may be broken into two chromosomes between any pair of genes: 7 -2 8 3 $ 7 -2 8 3 $ 5 9 -6 -1 12 $ 5 9 $ 11 4 10 $ -6 -1 12 $ 11 4 10 $
11
15. Lecture WS 2004/05Bioinformatics III11 Signed and unsigned genomes Most comparative mapping techniques determine the physical locations and relative order of genes in each chromosome, but do not determine which of two orientations each gene has. Current sequencing methods do provide the orientations. It turns out that the genome rearrangement problem (uni- and multichromosomal) for unsigned permutations is NP-hard, but the same problems for signed data can be done in polynomial time. Fortunately, with many genomes currently being sequenced, it is likely that many comparative maps (corresponding to unsigned permutations) will soon be replaced by sequencing data (corresponding to signed permutations).
12
15. Lecture WS 2004/05Bioinformatics III12 Multichromosomal genomes: rearrangement operations For example, to turn the unsigned genome 1 2 3 4 5 into the unsigned genome 1 4 3 2 5 requires one unsigned reversal. An assignment of signs may be designed in the source and destination genomes that give a signed reversal scenario requiring this same number of steps. Here, we get 1 2 3 4 5 1 -4 -3 -2 5 which also takes one step. Note that there may be other sign assignments taking this minimum number of steps.
13
15. Lecture WS 2004/05Bioinformatics III13 Multichromosomal genomes: rearrangement operations It is possible that correctly signed data would have increased the number of steps: 1 2 3 4 5 1 -4 -3 -2 5 1 -4 3 -2 5 If the data collection method did not determine signs, it is impossible to know mathematically whether the one step or two step scenario is more biologically accurate; the mathematical problem the genome rearrangement programs solve is to find the signs giving the minimum possible distance.
14
15. Lecture WS 2004/05Bioinformatics III14 A biological model case 8765432111109 4328715611109 cabbage turnip Palmer and Herbon found that the mitochondrial genomes in cabbage and turnip had very similar gene sequences, but with fairly different gene orders. How to design a „transformation“ of cabbage into turnip? Mitochondrial DNA of cabbage and turnip are composed of five conserved blocks of genes that are shuffled in cabbage as compared to turnip. Every conserved block has a direction that is shown by a + or – sign.
15
15. Lecture WS 2004/05Bioinformatics III15 Inversion, Transposition and inverted Transposition inversion transposition inverted transposition
16
15. Lecture WS 2004/05Bioinformatics III16 Sorting by Reversals 8765432111109 8765432111109 8234567111109 4328715611109 8234517611109 4328517611109 4328715611109 4328715611109 Cabbage Turnip
17
15. Lecture WS 2004/05Bioinformatics III17 Permutation ( ) : an ordered arrangement of the set { 1,2,…,n} Reversal ( ) :a rearrangement that inverts a block in {3 4 7 6 1 5 2 } (3,6) ={3 4 5 1 6 7 2} Signed Permutation ( ): a permutation where the elements are oriented a reversal switches element orientation {+3 -4 + 7 -6 +1 -5 +2 } (3,6) ={+3 -4 +5 -1 +6 -7 +2}
18
15. Lecture WS 2004/05Bioinformatics III18 easy to do by eye... 8765432111109 8765432111109 8234567111109 4328715611109 8234517611109 4328517611109 4328715611109 4328715611109 11 1212 123123 1 2…. t = = t …. 2 1
19
15. Lecture WS 2004/05Bioinformatics III19 Formal Approach: Sorting by Reversals The order of genes in 2 organisms is represented by permutations = 1 2 ... n and = 1 2... n. A reversal of an interval [i,j] is the permutation 1 2... i-1 i i+1... j-1 j j+1... n 1 2... i-1 j j-1... i+1 i j+1... n (i,j) has the effect of reversing the order of i i+1... j and transforming 1... i-1 i... j j+1... n into (i,j) = 1... i-1 j... i j+1... n. Given permutations and , the reversal distance problem is to find a series of reversals 1 2... t such that 1 2... t = and t is minimal. t is called the reversal distance between and .
20
15. Lecture WS 2004/05Bioinformatics III20 Reconstruction of phylogenetic trees from WG data 1Phylogeny reconstruction as optimization problem? Attempt to reconstruct an evolutionary scenario with a minimum number of permitted evolutionary events (e.g. duplications, insertions, deletions, inversions, transpositions) on a tree all known approaches are NP-hard Also, no automated tool exists sofar. 2Estimate leaf-to-leaf distances (based on some metric) between all genomes. Then úse a standard distance- based method such as neighbour-joining to construct the tree. Such approaches are quite fast but cannot recover the ancestral gene order. 2aBreakpoint phylogeny (Blanchette & Sankoff) for special case in which the genomes all have the same set of genes, and each gene appears once. Use breakpoint distance as distance matrix.
21
15. Lecture WS 2004/05Bioinformatics III21 Reversal distance problem The reversal distance for a pair of genomes can be computed in polynomial time (Hannenhalli & Pevzner 1999 and others, also see Bioinformatics 1 lecture). However, its use in studies of multiple genome rearrangements was somewhat limited since it was not clear how to combine pairwise rearrangement scenarios into a multiple rearrangement scenario. In particular, Capara (1999) demonstrated that even the simplest version of the Multiple Genome Rearrangement Problem, the Median Problem, is NP-hard. Therefore, this line of research was abandoned for a while in favor of the breakpoint analysis approach (see Blanchette & Sankoff). The existing tools BPAnalysis or GRAPPA use the so-called breakpoint distance to derive rearrangement scenarios.
22
15. Lecture WS 2004/05Bioinformatics III22 Breakpoint phylogeny When each genome has the same set of genes and each gene appears exactly once, a genome can be described by a (circular or linear) ordering = permutation of these genes. Each gene has either positive (g i ) or negative (- g i ) orientation. Given 2 genomes G and G‘ on the same set of genes, a breakpoint in G is defined as an ordered pair of genes (g i,g j ) such that g i and g j appear consecutively in that order in G, but neither (g i,g j ) (- g i,- g j ) appears consecutively in that order in G‘. The breakpoint distance between two genomes is simply the number of breakpoints between that pair of genomes. The breakpoint score of a tree in which each node is labelled by a signed ordering of genes is then the sum of the breakpoint distances along the edges of the tree.
23
15. Lecture WS 2004/05Bioinformatics III23 Breakpoint Graph Sorting a permutation is a hard problem. Breakpoints were introduced by Watterson et al. (1982) and by Nadeau and Taylor (1984) and correlations were noticed between the reversal distance and the number of breakpoints. Let i j if |i – j| = 1. Extend a permutation = 1 2... n by adding 0 = 0 and n+1 = n + 1. We call a pair of elements ( i, i+1 ), 0 i n, of an adjacency if i i+1, and a breakpoint if i i+1. 2 3 1 4 6 5 7 0 2 3 1 4 6 5 7 8 adjacencies breakpoints As the identity permutation has no breakpoints, sorting by reversals corresponds to eliminating breakpoints. An observation that every reversal can eliminate at most 2 breakpoints implies that the reversal distance d( ) b( ) / 2 where b( ) is the number of breakpoints in . However, this is a clear overestimate.
24
15. Lecture WS 2004/05Bioinformatics III24 Breakpoint Graph The breakpoint graph of a permutation is an edge-colored graph G( ) with n + 2 vertices { 0, 1... n, n+1 } {0, 1,..., n, n+1}. We join vertices i and i+1 by a black edge for 0 i n. We join vertices i and j by a gray edge if i j. Black path 0 2 3 1 4 6 5 7 Grey path 0 2 3 1 4 6 5 7 Superposition of black and grey paths forms the breakpoint graph: A breakpoint graph is obtained by a super- position of a black path traversing the vertices 0, 1,..., n, n+1 in the order given by the permutation and a gray path traversing the vertices in the order given by the identity permutation. more next week...
25
15. Lecture WS 2004/05Bioinformatics III25 Comparison of mouse and man at genome level Key findings: * the mouse genome is about 14% smaller than the human genome. The difference probably reflects a higher rate of deletion in mouse. * over 90% of the mouse and human genomes can be partitioned into corresponding regions of conserved synteny (segments in which the gene order in the most recent common ancestor has been conserved in both species) * at the nucleotide level, ca. 40% of the human genome can be aligned to the mouse genome. These sequences seem to represent most of the orthologous sequences that remain in both lineages from the common ancestor. The rest was probably deleted in one or both genomes. * the neutral substitution rate has been roughly half a nucleotide substitution per site since the divergence of the species. About twice as many of these substitutions have occurred in mouse as in human. see paper of the Mouse Genome Sequencing Consortium „Initial sequencing and comparative analysis of the mouse genome“, Nature 420, 520-562 (5.12.2002). Excellent paper! Well readable!
26
15. Lecture WS 2004/05Bioinformatics III26 Comparison of mouse and man at genome level Key findings: * the proportion of small (50-100 bp) segments in the mammalian genome that is under (purifying) selection is ca. 5%, i.e. much higher than can be explained by protein-coding sequences alone. genome contains many additional features (UTRs, regulatory elements, non- protein-coding genes, chromosomal structural elements) under selection for biological function! * the mammalian genome is evolving in a non-uniform manner, various measures of divergence showing substantial variation across the genome. * mouse and human genomes each seem to contain ca. 30.000 protein-coding genes. The proportion of mouse genes with a single identifiable orthologue in the human genome is ca. 80%. The proportion of mouse genes without any homologue currently detectable in the human genome (and vice versa) is < 1%.
27
15. Lecture WS 2004/05Bioinformatics III27 The mouse genome. Nature 420, 520 - 562 Conservation of synteny between human and mouse Starting from a common ancestral genome approximately 75 Million years ago, human and mouse genomes have each been shuffled by chromosomal rearrangements. The rate of these changes is low enough that local gene order remains largely intact (ca. 3.2 chromosomal rearrangements per 1 million year in mouse, and 1.6 per Myr in human). In their pioneering paper, Nadeau and Taylor, 1984 estimated that the mouse and human genomes could be parsed into roughly 180 syntenic regions – a surprisingly small number. Random-breakage model. Today, gene-based syteny maps define about 200 - 500 syntenic regions depending on the minimal segment length considered.
28
15. Lecture WS 2004/05Bioinformatics III28 The mouse genome. Nature 420, 520 - 562 Detect syntenic regions with PatternHunter - perform sequence comparison of entire mouse and human genome sequences to identify regions with a high similarity score > 40 (corresponding to a 40-base perfect match with penalties for mismatches and gaps) - also require that each sequence is the other‘s unique match above this threshold. Such regions probably reflect orthologous sequence pairs. About 558.000 pairs found! Mean spacing of 4.4 kb; N50 length of ca. 500 kb. Together they make up 7.5% of the mouse genome. But there may be many more that have evolved too quickly to be detected. Use RepeatMasker to remove repeats (breakpoint analysis requires unique matches between genomes).
29
15. Lecture WS 2004/05Bioinformatics III29 The mouse genome. Nature 420, 520 - 562 Identify regions of conserved synteny Syntenic segment: maximal region in which a series of landmarks occur in the same order on a single chromosome in both species. Syntenic block: one or more syntenic segments that are all adjacent on the same chromosome in human and on the same chromosome in mouse; may otherwise be shuffled with respect to order and orientation. (only consider regions > 300 kb) Each genome could be parsed into a total of 342 conserved syntenic segments. On average, each landmark resides in a segment containing 1600 other landmarks. Segments vary greatly in length: 303 kb – 64.9 Mb. About 90.2 % of human and 93.3% of mouse genome unambigously reside with conserved syntenic segments.
30
15. Lecture WS 2004/05Bioinformatics III30 The mouse genome. Nature 420, 520 - 562 Conservation of synteny between human and mouse A typical 510-kb segment of mouse chromosome 12 that shares common ancestry with a 600-kb section of human chromosome 14 is shown. Blue lines connect the reciprocal unique matches in the two genomes. The cyan bars represent sequence coverage in each of the two genomes for the regions. In general, the landmarks in the mouse genome are more closely spaced, reflecting the 14% smaller overall genome size.
31
15. Lecture WS 2004/05Bioinformatics III31 The mouse genome. Nature 420, 520 - 562 Correspondence of syntenic regions Segments and blocks >300 kb in size with conserved synteny in human are superimposed on the mouse genome. Each colour corresponds to a particular human chromosome. The 342 segments are separated from each other by thin, white lines within the 217 blocks of consistent colour.
32
15. Lecture WS 2004/05Bioinformatics III32 The mouse genome. Nature 420, 520 - 562 Dot plots of conserved syntenic segments For each of three human (a–c) and mouse (d–f) chromosomes, the positions of orthologous landmarks are plotted along the x axis and the corresponding position of the landmark on chromosomes in the other genome is plotted on the y axis. Different chromosomes in the corresponding genome are differentiated with distinct colours. In a remarkable example of conserved synteny, human chromosome 20 (a) consists of just three segments from mouse chromosome 2 (d), with only one small segment altered in order. Human chromosome 17 (b) also shares segments with only one mouse chromosome (11) (e), but the 16 segments are extensively rearranged. However, most of the mouse and human chromosomes consist of multiple segments from multiple chromosomes, as shown for human chromosome 2 (c) and mouse chromosome 12 (f). Circled areas and arrows denote matching segments in mouse and human.
33
15. Lecture WS 2004/05Bioinformatics III33 The mouse genome. Nature 420, 520 - 562 Size distribution of elements with conserved synteny Size distribution of segments and blocks with synteny conserved between mouse and human. a, b, The number of segments (a) and blocks (b) with synteny conserved between mouse and human in 5-Mb bins (starting with 0.3–5 Mb) is plotted on a logarithmic scale. The dots indicate the expected values for the exponential curve of random breakage given the number of blocks and segments, respectively.
34
15. Lecture WS 2004/05Bioinformatics III34 The mouse genome. Nature 420, 520 - 562 Genome rearrangement? Using the Pevzner & Tesler algorithms one can compute the minimal number of rearrangements needed to „transform“ one genome into the other. When applied to the 342 syntenic segments, the most parsimonious (=shortest) path has 295 rearrangements. The analysis suggests that chromosomal breaks may have a tendency to reoccur in certain regions. With only two species, however, it is not yet possible to recover the ancestral chromosomal order or reconstruct the precise pathway of rearrangements. This is only possible when more than 2 mammalian species are considered.
35
15. Lecture WS 2004/05Bioinformatics III35 Genome Rearrangements: Synteny (a) Human and mouse synteny blocks of conserved gene order. Every block corresponds to a rectangle, with a diagonal showing whether the arrangements of anchors in human and mouse (within the synteny block) are the same or reversed. (b) Combining anchors into clusters by the GRIMM-Synteny algorithm at G = 100 kb. The edges in the anchor graph connect the closest ends of the anchors. The anchors are color- coded by the resulting clusters. At G = 1 Mb, this forms a single cluster, which in turn forms a synteny block (the lower right block in the human 18/mouse 17 rectangle in a). Pevzner, Tesler, Genome Res 13, 37 (2003)
36
15. Lecture WS 2004/05Bioinformatics III36 From Anchors to Breakpoint Graphs X-chromosome: from local similarities, to synteny blocks, to breakpoint graph, to rearrangement scenario. (a) Dot-plot of anchors. Anchors are enlarged for visibility. (b) Clusters of anchors. (c) Rectified clusters. (d) Synteny blocks. (e) Synteny blocks (symbolic representation as genome rearrangement units; rescaling each block has same length on x- and y-axis). (f) 2D breakpoint graph superimposed on synteny blocks. The projections of the 2D graph onto the human and mouse axes form the conventional breakpoint graphs. (g) 2D breakpoint graph. The four cycles in the breakpoint graph are shown by different colors. (h) A most parsimonious rearrangement scenario for human and mouse X-chromosomes. Pevzner, Tesler, Genome Res 13, 37 (2003)
37
15. Lecture WS 2004/05Bioinformatics III37 Genome Rearrangements Construction of the breakpoint graph from synteny blocks. (a) Solid path through human. (b) Dotted path through mouse. (c) Superposition of paths. (d) Remove blocks to obtain cycles. Pevzner, Tesler, Genome Res 13, 37 (2003)
38
15. Lecture WS 2004/05Bioinformatics III38 Multichromosomal breakpoint graph Multichromosomal breakpoint graph of the whole human and mouse genomes. The conventional chromosome order and orientation are not suitable for such graphs; an optimal chromosome order and orientation were determined by the algorithm in Tesler (2002). Three "null chromosomes," N1, N2, N3, were added to mouse to equalize the number of chromosomes in the two genomes. Pevzner, Tesler, Genome Res 13, 37 (2003)
39
15. Lecture WS 2004/05Bioinformatics III39 Multiple Genome Rearrangement Problem Find a phylogenetic tree describing the most „plausible“ rearrangement scenario for multiple species. The genomic distance in the case of genome rearrangement is defined in terms of (1) reversals, (2) translocations, (3) fusions, and (4) fissions which are the most common rearrangement events in multichromosomal genomes. The special case of three genomes (m = 3) is called the Median Problem. Given the gene order of three unichromosomal genomes G 1, G 2, and G 3, find the ancestral genome A which minimizes the total reversal distance
40
15. Lecture WS 2004/05Bioinformatics III40 Multiple Genome Rearrangement Problem New approach: Given a set of m permutations (existing genomes) or order n, find a tree T with the m permutations as leaf nodes and assign permutations (ancestral genomes) to internal nodes such that D(T) is minimized, where is the sum of reversal distances over all edges of the tree. The breakpoint analysis attempts to solve the Median Problem by minimizing the breakpoint distance instead of the reversal distance. However, the breakpoint distance, in contrast to the reversal distance, does not correspond to a minimum number of rearrangement events! As a result, the breakpoint, recovered by breakpoint analysis, rarely corresponds to the ancestral median, the genome that minimizes the overall number of rearrangements in the evolutionary scenario.
41
15. Lecture WS 2004/05Bioinformatics III41 New algorithm Aim: Among all possible reversals for each of the three genomes identify good reversals. A good reversal in a genome G 1 is a reversal that brings a genome closer to the ancestral genome. But since this is unknown, it is unclear to find good reversals, oops! Instead: assume that reversals that reduce the reversal distance between G 1 and G 2 and the reversal distance between G 1 and G 3 are likely to be good reversals. With ( ) as the overall reduction in the reversal distances: the reversal ( ) is good if ( ) = 2.
42
15. Lecture WS 2004/05Bioinformatics III42 New algorithm Iteratively carry on these good rearrangements until the genomes G 1, G 2, and G 3 are transformed into an identical genome, hoping that this is the most likely „ancestral median“. When we are dealing with multichromosomal genomes and with four different types of rearrangements, ambiguous situations may occur too.
43
15. Lecture WS 2004/05Bioinformatics III43 Ambiguities again possible E.g.G 1 = 1 2 3 4 5 G 2 = 1 2 -5 -4 -3 G 3 = 1 2 3 4 5 The parsimony principle does not allow to umambiguously reconstruct the evolutionary scenario. If the ancestor coincides with G 1, then a reversal occurred on the way to G 2, and a fission occurred on the way to G 3. One can as well start with G 2 or G 3 as the ancestors. In this case This kind of ambiguity does not exist for unichromosomal genomes because, there, it is impossible to find 3 genomes that would all be within one reversal of each other.
44
15. Lecture WS 2004/05Bioinformatics III44 Strategy for choosing reversals Therefore one has to select carefully among the good rearrangements. Observe that in most genomes of interest reversals and translocations are more common than fusions and fissions. Therefore use as a rule always to select reversals/translocations before fusions/fissions. Often, the list of good reversals contains nonoverlapping reversals, and the order in which these reversals are performed is often irrelevant. Compute for each good reversal the number of good reversals n that will be available if is carried out. Then choose the good reversal with the maximal n to be carried out. If we run out of good reversals before reaching a solution, the best reversal to be taken will be the result of a depth k search minimizing the total pairwise rearrangement distances.
45
15. Lecture WS 2004/05Bioinformatics III45 How good measure is reversal distance? Authors claim that the reversal distance is a good approximation of the true distance for many biologically relevant cases. Let be a genome that evolved from a genome by k reversals. I.e. the true distance between and is k. We say that and form a valid pair if d( , )= k. Otherwise we say that d( , ) underestimates the true distance. Typically two genomes form a valid pair if the number of rearrangements between them is relatively small – exactly the case in a number of genome rearrangement studies.
46
15. Lecture WS 2004/05Bioinformatics III46 Reversal distance vs. True distance Reversal distance, d( , ), versus the actual number of reversals performed to transform into , where is a genome/permutation that evolved from the identity permutation = 1,2,...,100 by k random reversals. The simulations were repeated 10 times for every k. Shown is the average difference between the reversal distance and the actual number of reversals performed (k). For a genome with n=100 markers, the reversal distance approximates the true distance very well as long as the number of reversals remains below 0.4 n. This is the case in many biological relevant cases. Bourque, Pevzner, Genome Res (2002)
47
15. Lecture WS 2004/05Bioinformatics III47 Test on simulated data Starting from the identity permutation A with n genes/markers. n = 30 or 100. k reversals were performed to get genome G 1, k to get genome G 2, and k to get genome G 3. Use these as input to MGR-MEDIAN and GRAPPA. Check whether programs reconstruct the ancestral identity permutation. The simulations were repeated 10 times for every ratio #reversals/#markers = 3k/n.
48
15. Lecture WS 2004/05Bioinformatics III48 Comparison of MGR-MEDIAN and GRAPPA (a) and (b) show the average difference between the number of reversals on the tree recovered by the algorithm and the number of reversals on the actual tree (equal to 3k). (c) and (d) show the average reversal distance between the solution recovered and the actual ancestor. GRAPPA and MGR-MEDIAN produce very similar solutions for r < 0.20. As ratio r increases, GRAPPA starts making errors. MGR-MEDIAN sometimes finds solutions that even have fewer reversals than the actual ancestor. Reason: for increasing r, assumption that the ancestor corresponds to the most parsimonious scenario sometimes fails. Bourque, Pevzner, Genome Res (2002)
49
15. Lecture WS 2004/05Bioinformatics III49 Tests on simulated data: non-equidistant genomes The genomes G 1, G 2, and G 3 are obtained by k, k, and 2k reversals, each from the ancestral identity permutation 1 2... n (n = 30 and n = 100). The simulations were repeated 10 times for every ratio #reversals/#markers = 4k/n. Figs (a) - (d) have same meaning as on previous figure. Same behavior is found. Also test for 4 – 10 genomes. GRAPPA can‘t do more than 10 genomes because the tree space is too large. Bourque, Pevzner, Genome Res (2002)
50
15. Lecture WS 2004/05Bioinformatics III50 Herpesvirus Data Herpes simplex virus (HSV), Epstein-Barr virus (EBV), and Cytomegalovirus (CMV) gene orders (Hannenhalli et al. 1995 ) as well as the ancestral gene order (A) and optimal evolutionary scenario recovered by MGR-MEDIAN. MGR finds solution with 7 reversals, GRAPPA finds 8 reversals. Here, the ratio r of #reversals / #markers is 7/25 = 0.28. Bourque, Pevzner, Genome Res (2002)
51
15. Lecture WS 2004/05Bioinformatics III51 mtDNA of human, fruit fly, and sea urchin Human, sea urchin, and fruit fly mitochondrial gene order taken from Sankoff et al. (1996). A is the ancestral gene order suggested by MGR-MEDIAN. Solution found is different from Sankoff et al. but the total reversal distance (39) is the same. Here, the ratio of #reversals / #markers is 39/33 = 1.18, marking this as a difficult problem. Running GRAPPA on these genomes gives a solution with a total reversal distance of 43. Bourque, Pevzner, Genome Res (2002)
52
15. Lecture WS 2004/05Bioinformatics III52 Metazoan mtDNA data Data (36 common genes) of 11 metazoan genomes that was studied before by BPA. Shown here: Phylogeny reconstructed by MGR. The genomes come from 6 major metazoan groupings: nematodes (NEM), annelids (ANN), mollusks (MOL), arthropods (ART), echinoderms (ECH), and chordates (CHO). Numbers show the number of reversals (150 in total). Tree is very similar to that of Blanchette et al. that was constructed in a semiautomated fashion. GRAPPA finds after 48 CPUhours three optimal trees with 175 reversals and 200 breakpoints. Bourque, Pevzner, Genome Res (2002)
53
15. Lecture WS 2004/05Bioinformatics III53 Campanulaceae cpDNA data Campanulaceae chloroplast with 13 cpDNAs and 105 markers. The tree space for 13 genomes cannot be searched exhaustively by GRAPPA. Therefore, trees were constrained by Moret et al. (2001). They found 216 trees with a total of 67 reversals. MGR (without using constraints) gives a tree with 65 reversals. Tree topology corresponds to GRAPPA tree but labelling of internal nodes differs. Bourque, Pevzner, Genome Res (2002)
54
15. Lecture WS 2004/05Bioinformatics III54 Nadeau & Taylor model (1984) - suggest presence of conserved segments (i.e., segments with preserved gene orders without disruption by rearrangements) - estimated that there are ca. 180 conserved segments in human and mouse - provided convincing evidence that random breakage model of genomic evolution postulated by Ohno is correct. The model assumes a random (i.e., uniform and independent) distribution of chromosome rearrangement breakpoints and is supported by the observation that the lengths of synteny blocks shared by human and mouse are well fitted by the predicted exponential distribution imposed by the random breakage model. where L is the average length of segments. - model has become widely accepted - new studies of significantly larger datasets that confirmed that newly discovered synteny blocks still fit the predicted exponential distribution very well.
55
15. Lecture WS 2004/05Bioinformatics III55 Breakpoint reusage Two different most parsimonious scenarios that transform the order of the 11 synteny blocks on the mouse X chromosome into the order on the human X chromosome. The arrangement of synteny blocks in the ancestor is unspecified (and is assumed to coincide with one of intermediate arrangements) because it cannot be inferred without availability of a third genome. Breakpoint uses are shown as short vertical yellow lines, and breakpoint region reuses are shown as double yellow lines. In the first scenario (Left) the breakpoint reuses are located in human in breakpoint regions (3,4), (4,5), and (5,6), whereas in the second one (Right) they are located in (5,6), (6,7), and after block 11. In the second scenario, a potential hidden block is shown as a black dot; it restricts the set of possible most parsimonious scenarios, and it separates two breakpoint uses that would have been a breakpoint region reuse. Our theory implies that any rearrangement scenario based on these 11 blocks has at least three reuses of breakpoint regions (possibly including chromosome ends). Pevzner, Tesler, PNAS 100, 7672 (2003)
56
15. Lecture WS 2004/05Bioinformatics III56 Breakpoint reusage Extension of the Hannenhalli–Pevzner theory implies that any rearrangement scenario based on these 11 blocks has at least three reuses of breakpoint regions (although one cannot unambiguously infer where these breakpoint reuses happened). indicates that there are at least three more "hidden" synteny blocks in addition to our 11 "large" synteny blocks. Some of these blocks may be detected by lowering the threshold for synteny block detection, whereas others may escape such detection. The analysis further reveals at least 190 breakpoint region reuses over the whole genome on the evolutionary path from mouse to human. Pevzner, Tesler, PNAS 100, 7672 (2003)
57
15. Lecture WS 2004/05Bioinformatics III57 Length of synteny blocks (Left) Histogram of synteny block lengths in human for N b = 281 synteny blocks of length at least 1 Mb, fitted by an exponential distribution with mean block length L = G b N b = 9.6 Mb, where G b = 2,707 Mb is the overall length of syntenic blocks. The bin size is 2.5 Mb. (Center) The same histogram superimposed with the 190 hidden synteny blocks revealed by genome rearrangement analysis, under the assumption that all hidden blocks are short, i.e., <1 Mb in length. (Right) Histogram of breakpoint region lengths in the human genome (bin size is 100 kb). Most breakpoint regions are very short, with 109 of 258 regions being <100 kb. However, there is a small number of long breakpoint regions: 17 regions are between 1 and 2.5 Mb, and 15 are <2.5 Mb (shown by a single bar at the right end). The rearrangement analysis confirms the existence of many short breakpoints. Their existence immediately implies that an exponential distribution is not a good fit to reality, thus pointing to limitations of the random breakage mode Pevzner, Tesler, PNAS 100, 7672 (2003)
58
15. Lecture WS 2004/05Bioinformatics III58 Rat – mouse – man Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004) Aligned portions and origins of sequences in rat, mouse and human genomes.
59
15. Lecture WS 2004/05Bioinformatics III59 Rat – mouse – man Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004)
60
15. Lecture WS 2004/05Bioinformatics III60 Rat – mouse – man The Rat Sequencing Consortium, Nature 428, 493 (2004)
61
15. Lecture WS 2004/05Bioinformatics III61 Rat – mouse – man The Rat Sequencing Consortium, Nature 428, 493 (2004) X chromosome on each pair. GRIMM synteny for 16 orthologous pairs. Arrangement of the 16 blocks: 15 rearrangement events necessary. Shown is one of a number of most parsimonious inversion scenarios. The last common ancestor of human, mouse and rat should be on the evolutionary path between median ancestor and human.
62
15. Lecture WS 2004/05Bioinformatics III62 Rat – mouse – man Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004)
63
15. Lecture WS 2004/05Bioinformatics III63 Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004) Rat – mouse – man
64
15. Lecture WS 2004/05Bioinformatics III64 Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004) Rat – mouse – man – example 2
65
15. Lecture WS 2004/05Bioinformatics III65 Rat – mouse – man Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004) Full evolutionary model (all chromosomes)
66
15. Lecture WS 2004/05Bioinformatics III66 Rat – mouse – man Bourque, Pevzner, Tesler, Genome Res. 14, 507 (2004)
67
15. Lecture WS 2004/05Bioinformatics III67 Summary Breakpoint analysis (BPA) is a robust technique for small rearrangement problems. Problem of ambiguity between different optimal solutions. Although complexity could be dramatically reduced by algorithmic improvements (e.g. GRAPPA), method is still too expensive for more than 10 genomes. Heuristic MGR algorithm by Bourque & Pevzner minimizes reversal distance instead of breakpoint distance. (Taking the number of breakpoints 2 was not the optimal lower bound for the reversal distance.) Runs more efficient + can be applied to much larger problems + provides only one or a few solutions. MGR algorithm: analogy to conformational search in some energy landscape... What is the correct way to identify the biologically correct = true evolutionary trees: by minimizing the breakpoint distance or the reversal distance or something else?
68
15. Lecture WS 2004/05Bioinformatics III68 Summary II Joint analysis of > 2 genomes allows to identify common ancestors. But just 3 genomes are not sufficient to identify unique most parsimonius evolutionary path. Right now we don‘t understand what to do with the repeats. Why throw them away (using RepeatMasker)? Following duplication history should be as powerful as following genome rearrangement.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.