Presentation is loading. Please wait.

Presentation is loading. Please wait.

Combining with phylogeny Wafa Jobran Seminar in Bioinformatics Technion spring 2005.

Similar presentations


Presentation on theme: "Combining with phylogeny Wafa Jobran Seminar in Bioinformatics Technion spring 2005."— Presentation transcript:

1 Combining with phylogeny Wafa Jobran Seminar in Bioinformatics Technion spring 2005

2 Schedule Genome representation. Genome representation. GNT model. GNT model. Distance based methods. Distance based methods. True evolutionary distance. True evolutionary distance. BP and IEBP variance BP and IEBP variance INV and EDE variance INV and EDE variance simulation. simulation.

3 Representing a chromosome Chromosome is represented by an ordering (linear or circular) of signed genes. Chromosome is represented by an ordering (linear or circular) of signed genes. We assign a number to the same gene in each genome. We assign a number to the same gene in each genome. In the linear genome the sign indicates which strand the gene is located on. In the linear genome the sign indicates which strand the gene is located on. In the circular genome we break off the circle between two neighboring genes and choosing the clockwise or counter clockwise as the positive direction. In the circular genome we break off the circle between two neighboring genes and choosing the clockwise or counter clockwise as the positive direction.

4 Representing a chromosome. example: 2 3 1 Some of the linear representations for this genome : Some of the linear representations for this genome : (1,2,3), (1,2,3), (2,3,1) (2,3,1) or or (-1,-3,-2) (-1,-3,-2)

5 The generalized Nadeau-Taylor model: ” GNT ” We are particularly interested in the following three types of rearrangements along the edges: We are particularly interested in the following three types of rearrangements along the edges: 1.inversions. 1.inversions.

6 Inversions: starting with genome G=(g 1, g 2, ……………………..,g n ) an inversion between indices a and b, 1 ≤a<b≤n+1,produces: (g 1, g 2, …,g a-1,-g b, …,-g a,g b+1, …,g n )

7 The generalized Nadeau-Taylor model: ” GNT ” We are particularly interested in the following three types of rearrangements along the edges: We are particularly interested in the following three types of rearrangements along the edges: 1.inversions. 1.inversions. 2.transposition. 2.transposition.

8 Transpositions: starting with genome G=(g1, g2, ……………………..,gn) a transposition on the three indices a,b,c with 1≤a<b≤n and 2≤c≤n+1,c≠a and c≠b. produces: (g 1,…,g a-1,g b+1,…,g c,g a,g a+1,…,g b,g c+1,…,g n ).

9 The generalized Nadeau-Taylor model: ” GNT ” We are particularly interested in the following three types of rearrangements along the edges: We are particularly interested in the following three types of rearrangements along the edges: 1.inversions. 1.inversions. 2.transposition. 2.transposition. 3.inverted transpositions. 3.inverted transpositions.

10 inverted transposition: starting with genome G=(g 1, g 2, ……………………..,g n ) an inverted transposiotion on the three indices a,b,c with 1≤a<b≤n and 2≤c≤n+1, c≠a and c≠b. produces: (g 1, …,g a-1,g b+1, …,g c,-g b,-g b-1, …,-g a,g c+1, …,g n ).

11 Examples: G=( 1 2 3 4 5 6 7 8 9 10) G=( 1 2 3 4 5 6 7 8 9 10) inversion a=4 b=6: inversion a=4 b=6: G (1 2 3 -6 -5 -4 7 8 9 10) G (1 2 3 -6 -5 -4 7 8 9 10) transposition a=4 b=6 c=8: transposition a=4 b=6 c=8: G (1 2 3 7 8 4 5 6 9 10) inverted transposition a=4 b=6 c=8: inverted transposition a=4 b=6 c=8: G (1 2 3 7 8 -6 -5 -4 9 10)

12 The generalized Nadeau-Taylor model: ” GNT ” We are particularly interested in the following three types of rearrangements along the edges: We are particularly interested in the following three types of rearrangements along the edges: 1.inversions. 1.inversions. 2.transposition. 2.transposition. 3.inverted transpositions. 3.inverted transpositions. Different inversions have equal probability and so do different transpositions and inverted transpositions. Different inversions have equal probability and so do different transpositions and inverted transpositions.

13 Cont: The generalized Nadeau-Taylor model: ” GNT ” Each model tree has two parameters: Each model tree has two parameters: is the probability a rearrangement event is a transposition. is the probability a rearrangement event is a transposition. is the probability a rearrangement event is an inverted transposition. is the probability a rearrangement event is an inversion. is the probability a rearrangement event is an inversion.

14 Reconstructing the true tree T every edge e in T is associated with a number ke, the actual n n n number of rearrangements along edge e. The true evolutionary distance (t.e.d) between two leaves Gi and Gj in T is kij = where Pij is the simple path on T between Gi and Gj. Using good estimates of true evolutionary between genomes greatly improves the performance of distance based methods. A phylogenetic tree T on a set of taxa S is a tree representation of the evolutionary history of S:T is a tree leaf-labeled by S such that the internal nodes reflect past speciation events. A phylogenetic tree T on a set of taxa S is a tree representation of the evolutionary history of S:T is a tree leaf-labeled by S such that the internal nodes reflect past speciation events.

15 Reconstructing the true tree T. --Distance based methods-- NJ “ Neighbor joining “. NJ “ Neighbor joining “. BioNJ BioNJ Weighbor “ weighted neighbor joining ”. Weighbor “ weighted neighbor joining ”. uses the variance of good T.E.Ds and yield more accurate trees than NJ. consists of two main steps that are repeated until the tree is completed. 1.Choosing a pair of taxa to be joined and replaced by a single new node representing their immediate common ancestor. 2.Distances from the new node to all other nodes are inferred. Is widely used because of its elegancy and speed and because when given exact distance, it is guaranteed to reproduce the correct tree. As in neighbor joining,but while choosing a pair of taxa to join takes into account that errors in distance estimates are exponentially larger for longer distances. and that is done by using the variance.

16 Estimating true evolutionary distance (t.e.d) using genome rearrangements The assumption is that the genomes have evolved from a common ancestor under the GNT model of evolution. The assumption is that the genomes have evolved from a common ancestor under the GNT model of evolution.

17 Estimating true evolutionary distance (t.e.d) using genome rearrangements The edit distance: The edit distance: between two gene orders is the minimum of all sequences of events from the given set that transform one gene order into the other. between two gene orders is the minimum of all sequences of events from the given set that transform one gene order into the other. For example the inversion distance is the edit distance when only inversions are permitted and all inversions have weight 1.

18 Estimating true evolutionary distance using genome rearrangements The edit distance. The edit distance. The breakpoint distance: The breakpoint distance: the number of breakpoints in G relative to G ’. the number of breakpoints in G relative to G ’. for example: for example: G=(1,2,3,4,5) G=(1,2,3,4,5) G ’ =(1,-4,-3,2,5) G ’ =(1,-4,-3,2,5) There are three pairs of adjacent genes in G but not in G ’ : (1,2),(2,3)and (4,5) so the breakpoint distance=3. There are three pairs of adjacent genes in G but not in G ’ : (1,2),(2,3)and (4,5) so the breakpoint distance=3. Given two genomes G and G ’ a breakpoint in G is an ordered pair of genes (g a,g b ) such that g a and g b appear consecutively inthat order in G but neither (g a,g b ) nor (-g b,-g a ) appear consecutively in that order in G ’. Given two genomes G and G ’ a breakpoint in G is an ordered pair of genes (g a,g b ) such that g a and g b appear consecutively in that order in G but neither (g a,g b ) nor (-g b,-g a ) appear consecutively in that order in G ’.

19 Estimating true evolutionary distance using genome rearrangements The edit distance. The edit distance. The breakpoint distance. The breakpoint distance. Exact-IEBP (Inverting the breakpoint distance): Exact-IEBP (Inverting the breakpoint distance): replaces the approximation in the IEBP method by computing the expected breakpoint distance exactly. To compute the Exact-IEBP estimator (G,G ’ ) for the true evolutionary distance between two genomes G and G ’ : 1.For all k=1, …,r (where r is some integer large enough to bring a genome to random) compute E[BP(G 0,G k )].(G k is G 0 after k events) 2.To compute k ’ = (G,G ’ )(0≤k’≤r) a.Compute the BP distance b=BP(G,G ’ ), then b.Find the integer k ’, 0≤k ’ ≤r such that|E[BP(G 0,G k ’ )]-b| is minimized.

20 Estimating true evolutionary distance using genome rearrangements The edit distance. The edit distance. The breakpoint distance. The breakpoint distance. Exact-IEBP (Inverting the breakpoint distance). Exact-IEBP (Inverting the breakpoint distance). EDE (Empirically derived estimator): EDE (Empirically derived estimator): We estimate true evolutionary distance by inverting the expected inversion distance. Given two genomes having the same set of n genes and the inversion distance between them is d,we define the EDE distance as n (d/n), where n is the number of genes and f is an approximation to the expected inversion distance normalized by the number of genes.

21 Experiments:Accuracy of the estimators by absolute difference GNT model with 120 genes. GNT model with 120 genes. Starting with the unrearranged genome G 0,we apply k events to it to obtain the genome G k where k=1, …,300. for each value of k we simulate 500 runs then we compute the five distances. Starting with the unrearranged genome G 0,we apply k events to it to obtain the genome G k where k=1, …,300. for each value of k we simulate 500 runs then we compute the five distances.

22 Accuracy of the estimators by absolute difference Both BP and INV distances underestimate the actual number of events. Both BP and INV distances underestimate the actual number of events. EDE slightly overestimates the actual number of events. EDE slightly overestimates the actual number of events. The IEBP and Exact-IEBP distances are both unbiased. The IEBP and Exact-IEBP distances are both unbiased.

23 Accuracy of the estimators by absolute difference Both BP and INV distances underestimate the actual number of events. Both BP and INV distances underestimate the actual number of events. EDE slightly overestimates the actual number of events. EDE slightly overestimates the actual number of events. The IEBP and Exact-IEBP distances are both unbiased. The IEBP and Exact-IEBP distances are both unbiased.

24 Now we will find the variance of the breakpoint distance in an approximating model. Now we will find the variance of the breakpoint distance in an approximating model. We will find the variance of the IEBP estimator. We will find the variance of the IEBP estimator. We will find the variance of the inversion and EDE distances. We will find the variance of the inversion and EDE distances. Based on these variance estimators we will see four new methods : BioNJ-IEBP,Weighbor- IEBP,BioNJ-EDE and Weighbor-EDE. Based on these variance estimators we will see four new methods : BioNJ-IEBP,Weighbor- IEBP,BioNJ-EDE and Weighbor-EDE.

25 Variance of the breakpoint distance

26 Deriving variance (BP) Difficulties: Difficulties: 1.even the expected BP distance between G and G ’ with n genes after k rearrangements in the GNT model is still unsimplified sum. 1.even the expected BP distance between G and G ’ with n genes after k rearrangements in the GNT model is still unsimplified sum. 2.the break points are not independent (under any evolution model). Solution: approximating model. Solution: approximating model.

27 The approximating model We motivate the approximating model by the case of inversion-only evolution on signed circular genome. We motivate the approximating model by the case of inversion-only evolution on signed circular genome. Let n be the number of genes and b the number of breakpoints of the current genome G. Let n be the number of genes and b the number of breakpoints of the current genome G.

28 The approximating model When we apply a random inversion to G we have the following cases according to the two end points of the inversion: When we apply a random inversion to G we have the following cases according to the two end points of the inversion: 1.None of the two endpoints of the inversion is a break point The number of breakpoints is increased by 2. The number of breakpoints is increased by 2. there are such inversions. there are such inversions.

29 The approximating model When we apply a random inversion to G we have the following cases according to the two end points of the inversion: When we apply a random inversion to G we have the following cases according to the two end points of the inversion: 1.None of the two endpoints of the inversion is a break point example: G=(1,2,3,4,5,6,7,8,9,10) G ’ =(1,2,-5,-4,-3,6,7,8,9,10) G ’ =(1,2,-5,-4,-3,6,7,8,9,10) the endpoints: 8,9 the endpoints: 8,9 G ’’ =(1,2,-5,-4,-3,6,7,-9,-8,10) G ’’ =(1,2,-5,-4,-3,6,7,-9,-8,10)

30 The approximating model When we apply a random inversion to G we have the following cases according to the two end points of the inversion: When we apply a random inversion to G we have the following cases according to the two end points of the inversion: 2.exactly one of the two endpoints of the inversion is a breakpoint. 2.exactly one of the two endpoints of the inversion is a breakpoint. the number of breakpoints is increased by 1. there are b(n-b) such inversions.

31 The approximating model When we apply a random inversion to G we have the following cases according to the two end points of the inversion: When we apply a random inversion to G we have the following cases according to the two end points of the inversion: 2.exactly one of the two endpoints of the inversion is a breakpoint. 2.exactly one of the two endpoints of the inversion is a breakpoint. example: G=(1,2,3,4,5,6,7,8,9,10) G ’ =(1,2,-5,-4,-3,6,7,8,9,10) the endpoints:6,8 G ’’ =(1,2,-5,-4,-3,-8,-7,-6,9,10)

32 The approximating model When we apply a random inversion to G we have the following cases according to the two end points of the inversion: When we apply a random inversion to G we have the following cases according to the two end points of the inversion: 3.the two endpoints of the inversion are two breakpoints. there are such inversions. and 3 cases.

33 The approximating model Case 3: the two endpoints of the inversion are two breakpoints. Case 3: the two endpoints of the inversion are two breakpoints. -let g i and g i+1 be the left and right genes at the left breakpoint and let g j and g j+1 be the left and the right genes at the right breakpoint.there are three subcases: ( …, g i, g i+1, …,g j,g j+1, … ) ( …, g i, g i+1, …,g j,g j+1, … ) ( …, g i, - g j, …,-g i+1,g j+1, … ) ( …, g i, - g j, …,-g i+1,g j+1, … )

34 The approximating model Case 3: the two endpoints of the inversion are two breakpoints. Case 3: the two endpoints of the inversion are two breakpoints. -let g i and g i+1 be the left and right genes at the left breakpoint and let g j and g j+1 be the left and the right genes at the right breakpoint.there are three subcases: A.None of (g i,-g j ) and (-g i+1,g j+1 ) is an adjacency in G 0. the number of breakpoint is unchanged.

35 The approximating model Case 3: the two endpoints of the inversion are two breakpoints. Case 3: the two endpoints of the inversion are two breakpoints. -let g i and g i+1 be the left and right genes at the left breakpoint and let g j and g j+1 be the left and the right genes at the right breakpoint.there are three subcases: B.exactly one of (g i,-g j ) and (-g i+1,g j+1 )is an adjacency in G 0. the number of breakpoints is decreased by 1.

36 The approximating model Case 3: the two endpoints of the inversion are two breakpoints. Case 3: the two endpoints of the inversion are two breakpoints. -let g i and g i+1 be the left and right genes at the left breakpoint and let g j and g j+1 be the left and the right genes at the right breakpoint.there are three subcases: C. (g i, - g j ) and ( - g i+1,g j+1 ) are adjacencies in G 0. C. (g i, - g j ) and ( - g i+1,g j+1 ) are adjacencies in G 0. the number of breakpoints is decreased by 2. the number of breakpoints is decreased by 2.

37 The approximating model Case 3: the two endpoints of the inversion are two breakpoints. Case 3: the two endpoints of the inversion are two breakpoints. when b≥3,out of inversions from case 3 case 3(B) and 3(C) count for at most b inversions. this means given that inversion belongs to case 3 with probability at least 1-b/ =(b-3)/(b-2) it does not change the breakpoint distance. this probability is close to 1 when b is large. Because for every breakpoint there is only one specific inversion that can cancel it.

38 The approximating model #inversionsBPCase +21 b(n-b)+12 03.A ≤b3.B -23.C Therefore, when n is large,we can drop case 3(B) and 3(C) without affecting the distribution of breakpoint distance drastically. Therefore, when n is large,we can drop case 3(B) and 3(C) without affecting the distribution of breakpoint distance drastically.

39 The approximating model Approximating box model: boxes correspond to breakpoints. Let us be given n boxes initially empty. At each iteration two boxes will be chosen randomly. We place a ball into each of these two b b b boxes if it is not empty. The number of nonempty boxes after k iterations,bk,can be used to estimate the number of breakpoints after k rearrangement events are applied to an unrearranged genome.

40 This model can also be extended to approximate the GNT model: at each iteration with probability we choose 2 boxes,and with probability we choose 3 boxes. This model can also be extended to approximate the GNT model: at each iteration with probability we choose 2 boxes,and with probability we choose 3 boxes.

41 Derivation of the variance let S =((x 1 x 2 +x 1 x 3 + … +x n-1 x n )/ )) in the INV_only model -each term corresponds to the number of ways of choosing two boxes for k times, where the total number of times box i is chosen is the power of xi and the coefficient of that term is the total probability of these ways. -for example :the coefficient of is the probability of choosing box 1 three times box 2 once,and box 3 twice.

42 If transpositions and inverted transpositions present: S= If transpositions and inverted transpositions present: S= Let u i be the coefficient of the terms with i distinct symbols u i is the probability i boxes are nonempty after k iterations. Let u i be the coefficient of the terms with i distinct symbols u i is the probability i boxes are nonempty after k iterations. To solve for u i exactly for all k is difficult and unnecessary. Instead we can find the expectation and variance of b k directly. To solve for u i exactly for all k is difficult and unnecessary. Instead we can find the expectation and variance of b k directly.

43

44

45 expectation and variance of b k Let S(a 1,a 2, … a n ) be the value of S when xi=ai for all i. Let Sj=(1,1, … 1,0, … 0) j 1 ’ s Results for the inversion only: 1.Eb k =n(1-S n-1 ) 1.Eb k =n(1-S n-1 ) 2.Var b k = 2.Var b k =

46 expectation and variance of b k Results for the GNT model: Results for the GNT model:1.2.

47 Estimating the true evolutionary distance To estimate the true evolutionary distance we use Exact-IEBP. To estimate the true evolutionary distance we use Exact-IEBP. The variance of can be approximated using a common statistical technique called the delta method: The variance of can be approximated using a common statistical technique called the delta method:

48 Accuracy of the estimators for the variance Var(BP k )Var(k(b k )) Each figure consists of two sets of curves, corresponding to the values of simulation and theoretical estimation. The number of genes is 120 The number of rearrangement events is k range from 1 to 220. The evolutionary model is inversion- only GNT. For each k 500 runs.

49 Variance of the inversion and EDE distances The EDE distance: The EDE distance: - -Given two genomes having the same set of n genes and the inversion distance between them is d,we define the EDE distance as n (d/n), where n is the number of genes and f is an approximation to the expected inversion distance normalized by the number of genes.

50 Variance of the inversion and EDE distances Let x be the normalized number of inversions (k/n). Let x be the normalized number of inversions (k/n). We simulate the inversion-only GNT model to evaluate the relationship between the inversion distance and the actual number of inversions applied.Regression on simulation results suggests a=1,b=0.5956,and c=0.4577. We simulate the inversion-only GNT model to evaluate the relationship between the inversion distance and the actual number of inversions applied.Regression on simulation results suggests a=1,b=0.5956,and c=0.4577. Let y=d/n Let y=d/n

51 Variance of the inversion and EDE distances Using the same technique. Using the same technique. -Let be the regression formula for the standard deviation of the inversion distance normalized by the number of genes after nx inversions are applied. -q=-0.6998,u=0.1684,v=0.1573,w=-1,3893 and t=0.8224. - -Var(EDE) can be obtained using the delta method on Var(INV).

52 Simulation study

53 The accuracy of the new methods. We use the original weighbor and BioNJ implementation and make modification so they use the new variance formulas. We use the original weighbor and BioNJ implementation and make modification so they use the new variance formulas. The following four distance estimators are used with neighbor joining:BP,INV,Exact-IEBP and EDE. The following four distance estimators are used with neighbor joining:BP,INV,Exact-IEBP and EDE. According to past simulation studies NJ(EDE) has the best accuracy followed closely by NJ(Exact-IEBP). According to past simulation studies NJ(EDE) has the best accuracy followed closely by NJ(Exact-IEBP).

54 Quantifying error Given an inferred tree,we compare its “ topological accuracy" by computing “ false negatives ” with respect to the “ true tree ”. Given an inferred tree,we compare its “ topological accuracy" by computing “ false negatives ” with respect to the “ true tree ”. False negative edge: False negative edge: –Let T be the true tree and T ’ the inferred tree. An edge e in T is “ missing ” in T ’ if T ’ doesn ’ t contain an edge defining the same bipartition on the leaf set. –The external edges are trivial in the sense that they are in every tree with the same set of leaves. The false negative rate is the number of false negative edges in T’ with respect to T divided by the number of internal edges in T.

55 “ false negatives ” 120 genes160 genomes 120 genes160 genomes Weighbor-EDE has the best accuracy over all methods!

56 When we compare methods between based on breakpoint distance and methods based on inversion distance the inversion distance always better. When we compare methods between based on breakpoint distance and methods based on inversion distance the inversion distance always better. This suggests INV is better statistic than BP for the true evolutionary distance under GNT model even when transpositions and inverted transpositions are present. This suggests INV is better statistic than BP for the true evolutionary distance under GNT model even when transpositions and inverted transpositions are present.

57 Running time NJ,BioNJ-IEBP and BioNJ-EDE all finish within 1 second for all settings on the pentium of the simulation workstation running linux.however Weighbor-IEBP and Weighbor-EDE take about 30 minutes to finish for 160 genomes. NJ,BioNJ-IEBP and BioNJ-EDE all finish within 1 second for all settings on the pentium of the simulation workstation running linux.however Weighbor-IEBP and Weighbor-EDE take about 30 minutes to finish for 160 genomes.

58 Conclusion We studied the variance of the breakpoint and inversion distances under the generalized Nadeau-Taylor model. We studied the variance of the breakpoint and inversion distances under the generalized Nadeau-Taylor model. We used these results to obtain four new methods: We used these results to obtain four new methods:BioNJ-IEBP, Weighbor-IEBP, BioNJ-EDE, and Weighbor-EDE. Of these Weighbor-IEBP and Weighbor-EDE yield very accurate phylogenetic trees and are robust against errors in the model parameters.

59 References [1] W. J. Bruno, N. D. Socci, and A. L. Halpern. Weighted neighbor joining: A likelihood-based approach to distance-based phylogeny reconstruction. Mol. Biol. [2] O. Gascuel. BIONJ: an improved version of the nj algorithm based on a smple model of sequence data. Mol. Biol. Evol., 14:685- 695, 1997. [3]N. Saitou and M. Nei. The neighbor-joining method: A new method for recon-structing phylogenetic trees. Mol. Biol. & Evol., 4:406-425, 1987. [4]L.S. Wang and T. Warnow. Estimating true evolutionary distances between genomes. In Proc. 33th Annual ACM Symp. on Theory of Comp. (STOC 2001),pages 637-646. ACM Press, 2001. [5]L.S. Wang.Exact-IEBP:A New Technique For Estimating Evolutionary Distances Between Whole Genomes. [6]L.S. Wang.Genome Rearrangement phylogeny using Weighbor.


Download ppt "Combining with phylogeny Wafa Jobran Seminar in Bioinformatics Technion spring 2005."

Similar presentations


Ads by Google