Presentation is loading. Please wait.

Presentation is loading. Please wait.

Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5.

Similar presentations


Presentation on theme: "Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5."— Presentation transcript:

1 Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5

2 2 Sequence Alignment

3 Dot Matrix Sequence A : CTTAACT Sequence B : CGGATCAT 3 C G G A T C A T CTTAACTCTTAACT

4 Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: 4 C---TTAACT CGGATCA--T Sequence A Sequence B

5 Pairwise Alignment Sequence A: CTTAACT Sequence B: CGGATCAT An alignment of A and B: 5 C---TTAACT CGGATCA--T Insertion gap Match Mismatch Deletion gap

6 Alignment Graph Sequence A: CTTAACT Sequence B: CGGATCAT 6 C G G A T C A T CTTAACTCTTAACT C---TTAACT CGGATCA--T

7 A simple scoring scheme Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) 7 C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 Alignment score

8 An optimal alignment -- the alignment of maximum score Let A=a 1 a 2 …a m and B=b 1 b 2 …b n. S i,j : the score of an optimal alignment between a 1 a 2 …a i and b 1 b 2 …b j With proper initializations, S i,j can be computed as follows. 8

9 Computing S i,j 9 i j w(a i,-) w(-,b j ) w(a i,b j ) S m,n

10 Initializations 10 0-3-6-9-12-15-18-21-24 -3 -6 -9 -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT

11 S 3,5 = ? 11 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-5? -12 -15 -18 -21 C G G A T C A T CTTAACTCTTAACT

12 S 3,5 = 5 12 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55 9 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT optimal score

13 C T T A A C – T C G G A T C A T 13 0-3-6-9-12-15-18-21-24 -3852-4-7-10-13 -6530-3741-2 -920-2-55 9 -12-3-563076 -15-4-6-831-285 -18-7-9-110-2963 -21-10-12-14-386414 C G G A T C A T CTTAACTCTTAACT 8 – 5 –5 +8 -5 +8 -3 +8 = 14

14 Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal alignment ? 14

15 Initializations 15 0-3-6-9-12-15-18-21-24 -3 -6 -9 -12 -15 -18 -21 G A A T C T G C CAATTGACAATTGA

16 S 4,2 = ? 16 0-3-6-9-12-15-18-21-24 -3-5-8-11-14-4-7-10-13 -6-830-3-6-9-12-15 -9-11011852-4 -12-14? -15 -18 -21 G A A T C T G C CAATTGACAATTGA

17 S 5,5 = ? 17 0-3-6-9-12-15-18-21-24 -3-5-8-11-14-4-7-10-13 -6-830-3-6-9-12-15 -9-11011852-4 -12-14-38191613107 -15-11-6516? -18 -21 G A A T C T G C CAATTGACAATTGA

18 S 5,5 = 14 18 0-3-6-9-12-15-18-21-24 -3-5-8-11-14-4-7-10-13 -6-830-3-6-9-12-15 -9-11011852-4 -12-14-38191613107 -15-11-651614242118 -18-7-921311213229 -21-101108182927 G A A T C T G C CAATTGACAATTGA optimal score

19 C A A T - T G A G A A T C T G C 19 0-3-6-9-12-15-18-21-24 -3-5-8-11-14-4-7-10-13 -6-830-3-6-9-12-15 -9-11011852-4 -12-14-38191613107 -15-11-651614242118 -18-7-921311213229 -21-101108182927 G A A T C T G C CAATTGACAATTGA -5 +8 +8 +8 -3 +8 +8 -5 = 27

20 Global Alignment vs. Local Alignment global alignment : local alignment : 20

21 An optimal local alignment S i,j : the score of an optimal local alignment ending at a i and b j With proper initializations, S i,j can be computed as follows. 21

22 local alignment 22 000000000 085200852 0530085313 0200085211 0000853? 0 0 0 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3

23 local alignment 23 000000000 085200852 0530085313 0200085211 00008531310 0000852118 08525313107 053021310818 C G G A T C A T CTTAACTCTTAACT Match: 8 Mismatch: -5 Gap symbol: -3 The best score

24 24 000000000 085200852 0530085313 0200085211 00008531310 0000852118 08525313107 053021310818 C G G A T C A T CTTAACTCTTAACT The best score A – C - T A T C A T 8-3+8-3+8 = 18

25 Now try this example in class Sequence A: CAATTGA Sequence B: GAATCTGC Their optimal local alignment ? 25

26 Did you get it right? 26 000000000 000008528 008855305 008161310742 005132421181512 002102119292623 08571816263734 0516131513233432 G A A T C T G C CAATTGACAATTGA

27 27 000000000 000008528 008855305 008161310741 005132421181512 002102119292623 08571816263734 0516131513233432 G A A T C T G C CAATTGACAATTGA A A T – T G A A T C T G 8+8+8-3+8+8 = 37

28 Affine gap penalties Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: -3 (w(-,x)=w(x,-)=-3) Each gap is charged an extra gap-open penalty: -4. 28 C - - - T T A A C T C G G A T C A - - T +8 -3 -3 -3 +8 -5 +8 -3 -3 +8 = +12 -4 Alignment score: 12 – 4 – 4 = 4

29 Affine gap panalties A gap of length k is penalized x + k·y. 29 gap-open penalty gap-symbol penalty Three cases for alignment endings: 1....x...x 2....x...- 3....-...x an aligned pair a deletion an insertion

30 Affine gap penalties Let D(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j. 30

31 Affine gap penalties 31 (A gap of length k is penalized x + k·y.)

32 Affine gap penalties 32 SI D SI D SI D SI D -y -x-y -y w(a i,b j )

33 Constant gap penalties Match: +8 (w(x, y) = 8, if x = y) Mismatch: -5 (w(x, y) = -5, if x ≠ y) Each gap symbol: 0 (w(-,x)=w(x,-)=0) Each gap is charged a constant penalty: -4. 33 C - - - T T A A C T C G G A T C A - - T +8 0 0 0 +8 -5 +8 0 0 +8 = +27 -4 Alignment score: 27 – 4 – 4 = 19

34 Constant gap penalties Let D(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with a deletion. Let I(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j ending with an insertion. Let S(i, j) denote the maximum score of any alignment between a 1 a 2 …a i and b 1 b 2 …b j. 34

35 Constant gap penalties 35

36 Restricted affine gap panalties A gap of length k is penalized x + f(k)·y. where f(k) = k for k c 36 Five cases for alignment endings: 1....x...x 2....x...- 3....-...x 4.and 5. for long gaps an aligned pair a deletion an insertion

37 Restricted affine gap penalties 37

38 D(i, j) vs. D’(i, j) Case 1: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length = D’(i, j) Case 2: the best alignment ending at (i, j) with a deletion at the end has the last deletion gap of length >= c D(i, j) <= D’(i, j) 38

39 Max{S(i,j)-x-ky, S(i,j)-x-cy} 39 k c

40 k best local alignments Smith-Waterman (Smith and Waterman, 1981; Waterman and Eggert, 1987) FASTA (Wilbur and Lipman, 1983; Lipman and Pearson, 1985) BLAST (Altschul et al., 1990; Altschul et al., 1997) 40

41 FASTA 1)Find runs of identities, and identify regions with the highest density of identities. 2)Re-score using PAM matrix, and keep top scoring segments. 3)Eliminate segments that are unlikely to be part of the alignment. 4)Optimize the alignment in a band. 41

42 42 FASTA Step 1: Find runes of identities, and identify regions with the highest density of identities. Sequence A Sequence B

43 43 FASTA Step 2: Re-score using PAM matrix, and keep top scoring segments.

44 44 FASTA Step 3: Eliminate segments that are unlikely to be part of the alignment.

45 45 FASTA Step 4: Optimize the alignment in a band.

46 BLAST Basic Local Alignment Search Tool (by Altschul, Gish, Miller, Myers and Lipman) The central idea of the BLAST algorithm is that a statistically significant alignment is likely to contain a high-scoring pair of aligned words. 46

47 The maximal segment pair measure A maximal segment pair (MSP) is defined to be the highest scoring pair of identical length segments chosen from 2 sequences. (for DNA: Identities: +5; Mismatches: -4) 47 the highest scoring pair The MSP score may be computed in time proportional to the product of their lengths. (How?) An exact procedure is too time consuming. BLAST heuristically attempts to calculate the MSP score.

48 BLAST 1)Build the hash table for Sequence A. 2)Scan Sequence B for hits. 3)Extend hits. 48

49 49 BLAST Step 1: Build the hash table for Sequence A. (3-tuple example) For DNA sequences: Seq. A = AGATCGAT 12345678 AAA AAC.. AGA 1.. ATC 3.. CGA 5.. GAT 2 6.. TCG 4.. TTT For protein sequences: Seq. A = ELVIS Add xyz to the hash table if Score(xyz, ELV) ≧ T; Add xyz to the hash table if Score(xyz, LVI) ≧ T; Add xyz to the hash table if Score(xyz, VIS) ≧ T;

50 50 BLAST Step2: Scan sequence B for hits.

51 51 BLAST Step2: Scan sequence B for hits. Step 3: Extend hits. hit Terminate if the score of the sxtension fades away. (That is, when we reach a segment pair whose score falls a certain distance below the best score found for shorter extensions.) BLAST 2.0 saves the time spent in extension, and considers gapped alignments.

52 Remarks Filtering is based on the observation that a good alignment usually includes short identical or very similar fragments. The idea of filtration was used in both FASTA and BLAST. 52

53 53 Phylogenetic trees

54 Benjamin Loyle 2004 Cse 397 -3 mil yrs -2 mil yrs -1 mil yrs today AAGACTT TGGACTTAAGGCCT AGGGCATTAGCCCTAGCACTT AAGGCCTTGGACTT TAGCCCATAGACTT AGCGCTT AGCACAAAGGGCAT TAGCCCTAGCACTT DNA Sequence Evolution

55 Benjamin Loyle 2004 Cse 397 Problem Definition The Tree of Life Connecting all living organisms All encompassing Find evolution from simple beginnings Even smaller relations are tough Impossible Infer possible ancestral history.

56 Benjamin Loyle 2004 Cse 397 So what…. Genome sequencing provides entire map of a species, why link them? We can understand evolution Viable drug testing and design Predict the function of genes Influenza evolution

57 Benjamin Loyle 2004 Cse 397 Why is that a problem? Over 8 million organisms Current solutions are NP-hard Computing a few hundred species takes years Error is a very large factor

58 Benjamin Loyle 2004 Cse 397 What do we want? Input A collection of nodes such as taxa or protein strings to compare in a tree Output A topological link to compare those nodes to each other When do we want it? FAST!

59 Benjamin Loyle 2004 Cse 397 Preparing the input Create a distance matrix Sum up all of the known distances into a matrix sized n x n N is the number of nodes or taxa Found with sequence comparison

60 Benjamin Loyle 2004 Cse 397 Distance Matrix Take 5 separate DNA strings A : GATCCATGA B : GATCTATGC C : GTCCCATTT D : AATCCGATC E : TCTCGATAG The distance between A and B is 2 The distance between A and C is 4 This is subjective based on what your criteria are.

61 Benjamin Loyle 2004 Cse 397 Distance Matrix Lets start with an example matrix 0639411167 0799616 04783 0100 0 A B C D E ABCDE

62 Benjamin Loyle 2004 Cse 397 Lets make it simple (constrain the input) Lets keep the distance between nodes within a certain limit From F -> G F and G have the largest distance; they are the most dissimilar of any nodes. This is called the diameter of the tree Lets keep the length of the input (length of the strings) polynomial.

63 Benjamin Loyle 2004 Cse 397 ERROR?!?!!? All trees are inferred, how do you ever know if you’re right? How accurate do we have to be? We can create data sets to test trees that we create and assume that it will then work in the real world

64 Benjamin Loyle 2004 Cse 397 Data Sets JC Model Sites evolve independent Sites change with the same probability Changes are single character changes Ie. A -> G or T -> C The expectation of change is a Poisson variable (e)

65 Benjamin Loyle 2004 Cse 397 More Data Sets K2P Model Based on JC Model Allows for probability of transitions to tranversions It’s more likely for A and T to switch and G and C to switch Normally set to twice as likely

66 Benjamin Loyle 2004 Cse 397 Data Use Using these data sets we can create our own evolution of data. Start with one “ancestor” and create evolutions Plug the evolutions back and see if you get what you started with

67 Benjamin Loyle 2004 Cse 397 Aspects of Trees Topology The method in which nodes are connected to each other “Are we really connected to apes directly, or just linked long before we could be considered mammals?” Distance The sum of the weighted edges to reach one node from another

68 Benjamin Loyle 2004 Cse 397 What can distance tell us? The distance between nodes IS the evolutionary distance between the nodes The distance between an ancestor and a leaf(present day object) can be interpreted as an estimate of the number of evolutionary ‘steps’ that occurred.

69 Benjamin Loyle 2004 Cse 397 Current Techniques Maximum Parsimony Minimize the total number of evolutionary events Find the tree that has a minimum amount of changes from ancestors Maximum Likelihood Probability based Which tree is most probable to occur based on current data

70 Benjamin Loyle 2004 Cse 397 More Techniques Neighbor Joining Repeatedly joins pairs of leaves (or subtrees) by rules of numerical optimization It shrinks the distance matrix by considering two ‘neighbors’ as one node

71 Benjamin Loyle 2004 Cse 397 Learning Neighbor Joining It will become apparent later on, but lets learn how to do Neighbor Joining (NJ) 03343 0334 033 03 0 A B C D E ABCDE

72 Benjamin Loyle 2004 Cse 397 NJ Part 1 First start with a “star tree” A BC D E

73 Benjamin Loyle 2004 Cse 397 NJ Part 2 Combine the closest two nodes (from distance matrix) In our case it is node A and B at distance 3 A BC D E

74 Benjamin Loyle 2004 Cse 397 NJ Part 3 Repeat this until you have added n-2 nodes (3) N-2 will make it a binary tree, so we only have to include one more node. A BC D E

75 Benjamin Loyle 2004 Cse 397 Are we done? ML and MP, even in heuristic form take too long for large data sets NJ has poor topological accuracy, especially for large diameter trees We need something that works for large diameter trees and can be run fast.

76 Benjamin Loyle 2004 Cse 397 Here’s what we want Our Goal An “Absolute Fast Converging” Method  is afc if, for all positive f,g, €, on the Model M, there is a polynomial p such that, for all (T,{ (e)}) is in the set M f,g on a set S of n sequences of length at least p(n) generated on T, we have Pr[  (S) = T] > 1- €. Simply: Lets make it in polynomial time within a degree of error.

77 Benjamin Loyle 2004 Cse 397 A DCM* - NJ Solution 2 Phase construction of a final phylogenetic tree given a distance matrix d. Phase 1 : Create a set of plausible trees for the distance matrix Phase 2 : Find the best fitting tree

78 Benjamin Loyle 2004 Cse 397 Phase 1 For each q in {d ij }, compute a tree t q Let T = { t q : q in {d ij } }

79 Benjamin Loyle 2004 Cse 397 Finding t q Step 1: Compute Thresh(d,q) Step 2: Triangulate Thresh(d,q) Step 3: Compute a NJ Tree for all maximal cliques Step 4: Merge the subtrees into a supertree

80 Benjamin Loyle 2004 Cse 397 What does that mean Breaking the problem up Create a threshold of diameters to break the problem into A bunch of smaller diameter trees (cliques) Apply NJ to those cliques Merge them back

81 Benjamin Loyle 2004 Cse 397 Finding t q (terms) Threshold Graph Thresh(d,q) is the threshold graph where (i,j) is an edge if and only if d ij <= q.

82 Benjamin Loyle 2004 Cse 397 Threshold Lets bring back our distance matrix and create a threshold with q equal to d 15 or the distance between A and E So q = 67

83 Benjamin Loyle 2004 Cse 397 Distance Matrix Our old example matrix 0639411167 0799616 04783 0100 0 A B C D E ABCDE

84 Benjamin Loyle 2004 Cse 397 With q = D 15 = 67 A B C D E 47 67 63 16

85 Benjamin Loyle 2004 Cse 397 Triangulating A graph is triangulated if any cycle with four or more vertices has a chord That is, an edge joining two nonconsecutive vertices of the cycle. Our example is already triangulated, but lets look at another

86 Benjamin Loyle 2004 Cse 397 Triangulating WX YZ 5 5 5 5 Lets say this is for q = 5 10 15 10 and 15 would Not be in the graph To triangulate this graph you add the edge length 10.

87 Benjamin Loyle 2004 Cse 397 Maximal Cliques A clique that cannot be enlarged by the addition of another vertex. Recall our original threshold graph which is triangulated:

88 Benjamin Loyle 2004 Cse 397 Triangulated Threshold Graph Our old Graph A B C D E 47 67 63 16

89 Benjamin Loyle 2004 Cse 397 Clique Our maximal cliques would be: {A, B, E} {C, D}

90 Benjamin Loyle 2004 Cse 397 Create Trees for the Cliques We have two maximal cliques, so we make two trees; {A, B, E} and {C, D} How do we make these trees? Remember NJ?

91 Benjamin Loyle 2004 Cse 397 Tree {A, B, E} and {C,D} A B E CD

92 Benjamin Loyle 2004 Cse 397 Merge your separate trees together. Create one Supertree This is done by creating a minimum set of edges in the trees and calling that the “backbone” This is it’s own doctorial thesis, so lets do a little hand waving

93 Benjamin Loyle 2004 Cse 397 That sounds like NP-hard! Computing Threshold is Polynomial Minimally triangulating is NP-hard, but can be obtained in polynomial time using a greedy heuristic without too much loss in performance. Maximal cliques is only polynomial if the data input is triangulated (which it is!). If all previous are done, creating a supertree can be done in polynomial time as well.

94 Benjamin Loyle 2004 Cse 397 Where are we now? We now have a finalized phylogeny created for from smaller trees in our matrix joined together Remember we started from all possible size of smaller trees.

95 Benjamin Loyle 2004 Cse 397 Phase 2 Which one is right? Found using the SQS (Short Quartet Support) method Let T be a tree in S (made from part 1) Break the data into sets of four taxa {A, B, C, D} {A, C, D, E} {A, B, D, E}… etc Reduce the larger tree to only hold “one set” These are called Quartets

96 Benjamin Loyle 2004 Cse 397 SQS - A Guide Q(T) is the set of trees induced by T on each set of four leaves. Let Q w (different Q) be a set of quartets with diameter less than or equal to w Find the maximum w where the quartets are inclusive of the nodes of the tree This w is the “support” of that tree

97 Benjamin Loyle 2004 Cse 397 SQS - Refrased Q w is the set of quartet trees which have a diameter <= w Support of T is the max w where Q w is a subset of Q(T) Support is our “quality measure” What are we exactly measuring?,

98 Benjamin Loyle 2004 Cse 397 Qw = AB C D AB DE ABCDABCDEE

99 Benjamin Loyle 2004 Cse 397 SQS Method Return the tree in which the support of that tree is the maximum. If more than one such tree exists return the tree found first. This is the tree with the smallest original diameter (remember from phase 1)

100 Benjamin Loyle 2004 Cse 397 How do we know we’re right? Compare it to the data set we created Look at Robinson-Foulds accuracy Remove one edge in the tree we’ve created. We now have two trees Is there anyway to create the same set of leaves by removing one edge in our data set? If no, add a ‘point’ of error. Repeat this for all edges When the value is not zero then the trees are not identical

101 Benjamin Loyle 2004 Cse 397 Performance of DCM * - NJ Outperforms NJ method at sequence lengths above 4000 and with more taxa. NJ DCM-NJ 0 40080016001200 No. Taxa 0 0.2 0.4 0.6 0.8 Error Rate

102 Benjamin Loyle 2004 Cse 397 Improvements Improvement possibilities like in Phase 2 Include test of Maximum Parsimony (MP) Try and minimize the overall size of the tree Test using statistical evidence Maximum Likelihood (ML)

103 Benjamin Loyle 2004 Cse 397 Performance gains Simply changing Phase 2 has massive gains in accuracy! DCM - NJ + MP and DCM -NJ + ML are VERY accurate for data sets greater than 4000 and are NOT NP hard. DCM - NJ + MP finished its analysis on a 107 taxon tree in under three minutes.

104 Benjamin Loyle 2004 Cse 397 Comparing Improvements DCM-NJ+SQS NJ DCM-NJ+MP HGT-FP 0 400 800 16001200 # leaves 0 0.2 0.4 0.6 0.8 Error Rate


Download ppt "Alignments and Phylogenetic tree Reading: Introduction to Bioinformatics. Arthur M. Lesk. Fourth Edition Chapter 5."

Similar presentations


Ads by Google