1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.

1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida

2 Goals Understand phylogenetic tree Learn –distance matrix based methods –maximum likelihood method –character based methods

3 What is phylogeny?

4 Phylogeny Shows the ancestral relationship between genes or organisms Infer relationship based on genotype rather than phenotype

5 Why Phylogeny? Understand history of organisms Understand how various functions evolved Multiple sequence alignment Gene function prediction

6 Phylogenetic Tree (1) Node = taxonomical unit –Leaf nodes = gene or organism –Internal node = inferred ancestor Bifurcating = two lineages Multifurcating = more than two lineages Branch = ancestral relationship

7 Phylogenetic Tree (2) Rooted = a single node is common ancestor to all Unrooted = provides no information about the direction of evolution Viruses of the family Reoviridae

8 Phylogenetic Tree (3) n = number of data Find the number of rooted trees for n = 3. Rooted => NR = (2n-3)!/2 n-2 (n-2)! Unrooted => NU = (2n-5)!/2 n-3 (n-3)! nNRNU 211 331 510515 1034x10 6 2x10 6 15213x10 12 7x10 12 208x10 21 0.2x10 21 113113 232232 3 -> ((1, 2), 3) 2 -> ((1, 3), 2) 1 -> ((3, 2), 1) Newick format

9 Distance Matrix Methods UPGMA (Unweighted Pair Group Method with Arithmetic mean)

10 UPGMA (1) Create a distance matrix between all pairs of taxa Iteratively do following until all taxa are merged –Merge the pair (x, y) with smallest distance d(x, y) and form xy –Set distance d(z, xy) = (d(z, x) + d(z, y))/2 for all z

11 Choose two clusters with minimum distance and combine them ABCDE A0101297 B04414 C0616 D013 E0 UPGMA (2) A BC D E

12 Update distance matrix Distance of new cluster to nodes in original clusters is half of original distance ABCDE A01197 BC0515 D013 E0 UPGMA (3) A BC D E 2 2

13 ABCDE A01197 BC0515 D013 E0 UPGMA (4) A BC D E 2 2

14 ABCDE A0107 BCD014 E0 UPGMA (5) A BC D E 2 2 2.5 0.5

15 ABCDE A0107 BCD014 E0 UPGMA (6) A BC D E 2 2 2.5 0.5

16 AEBCD AE012 BCD0 UPGMA (7) A BC D E 2 2 2.5 0.5 3.5

17 produced tree (((B, C), D), (A, E)) UPGMA (8) A BC D E 2 2 2.5 0.5 3.5 2.5 ABCDE A0101297 B04414 C0616 D013 E0 Not additive (path lengths may not Indicate actual distance. E.g., C and D)

18 Other distance based methods

19 Neighbor Relation Method (1) Consider all possible arrangements Choose the one that satisfies distance relation B A C D a b e c d AC + BD = AD + BC AB + CD < AC + BD

20 Neighbor Relation Method (2) {A, B, C, D, E, F, …} {A, B, C, D} 1.AB + CD 2.AC + BD 3.AD + BC min ABCDEFGH… A B C D E F G H … (Sattath, Tversky, 1977) {A, B, C, E}... 1.AB + CE 2.AE + BC 3.AC + BE Vote UPGMA on the votes

21 Neighbor Joining Method Start with a star tree Merge pairs of nodes that minimize sum of branch lengths B A C D B A C D E E

22 Maximum Likelihood Method

23 Maximum Likelihood Method Generate all possible trees Find the likelihood of tree –Use substitution probabilities (e.g., Jukes-Cantor) Choose the tree with highest likelihood Exhaustive search. Very slow Requires computation of inferred ancestors ACGCTAFKI GCGCTAFKI ACGCTAFKL GCGCTGFKI GCGCTLFKI ASGCTAFKL ACACTAFKL A  G I  L A  G A  L C  S G  A

24 Character Based Methods

25 AAA AGAAGA AAG GGA AAA AGAAGA AGA AAA AAG GGA Parsimony (1) There are various trees that could explain the phylogeny of the following sequences: AAG, AAA, GGA, AGA Parsimony prefers the second tree because it requires the fewer substitution events

26 Parsimony (2) Multiply align sequences For each column of the alignment –Generate all possible trees –Compute the number of substitutions –Vote for the tree with the smallest number of substitutions Pick the tree with the best vote 1: G G G G G G 2: G G G A G T 3: G G A T A G 4: G A T C A T 2G 1G3A 4A 3A 1G2G 4A

27 How can we infer the ancestors? ? ? ?

28 Inferring Ancestor (1/3) ATGGA A TGG A A TGG A XY Z If X  Y =  Z = X  Y Else Z = X  Y

29 Inferring Ancestor (2/3) A A TG G GA G,A G,A,T A A TG G G A G,T G,A,T A A TG G G A G G,A XY Z If X  Y =  Z = X  Y Else Z = X  Y

30 Inferring Ancestor (3/3) A A TG G GA G,A G,A,T A A TG G G A G,T G,A,T A A TG G G A G G,A Minimum number of substitutions = # unique characters - 1

31 Branch and Bound Method 1.Find an upper bound to tree length (L) –E.g., use UPGMA 2.Start with a small tree 3.Incrementally add more branches to tree –Exclude trees with length > L

32 Branch and Bound Example BC A BC D A BD C A DC B A

33 Consensus Trees There may be many trees of the same parsimony Consensus tree summarizes them by collapsing nodes –Resulting tree may not be bifurcating Strict consensus T% majority rule consensus

34 Consensus 1: (A, ((B, (C, D)), (E, (F, G)))) 2: ((A, (C, (B, D))), (E, (F, G))) 3: ((A, (D, (B, C))), (E, (F, G))) Strict: (A, (B, C, D), (E, (F, G))) 50% : ((A, (B, C, D)), (E, (F, G)))

35 Tree Confidence Is the resulting tree reliable? Usually a confidence is computed for each part of the tree –Bootstrapping

36 Bootstrapping Given a phylogenetic tree T 1.Multiply align sequences based on T 2.Randomly select columns from the alignment (with replacement) to create a new dataset of the same size 3.Find the phylogenetic tree T’ for the subset 4.Repeat steps 2-3 many times 5.Compute the fraction of times T’ overlaps with T 0 1 2 3 4 5 6 7 8 9 1: G G G A G G A T C A 2: G G G A G T A T C A 3: G G A T A G A C A T 4: G A T C A T G T A T 5: G T T C A T A T C T 0 0 2 4 4 4 5 8 8 8 1: G G G G G G G C C C 2: G G G G G T T C C C 3: G G A A A G G A A A 4: G G T A A T T A A A 5: G G T A A T T A C C

37 Reading Assignment Krane, Chapter 4, 5 Mount, Chapter 7

1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.

Similar presentations

Presentation on theme: "1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida.

Similar presentations

Presentation on theme: "1 CAP5510 – Bioinformatics Phylogeny Tamer Kahveci CISE Department University of Florida."— Presentation transcript:

Similar presentations

About project

Feedback