Download presentation
Presentation is loading. Please wait.
Published byRuth O’Connor’ Modified over 9 years ago
1
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004
2
02/19/2004 2 Importance? RNA folding (Trifonov, Bolshoi) Gene regulation (Galas et al.) Protein structure-function relationships (Wu, Kabat) Molecular evolution (Dayhoff)
3
02/19/2004 3 Introduction Original sequence unknown – Must consider all possible transformations – Including insertions, deletions, and replacements Choose the most likely set of transformations – With a given model of protein evolution
4
02/19/2004 4 Sequences and Alignments An alignment of the sequences is written as K-sequence: sequence of k characters Each is obtained from – Blanks are inserted in positions where some of the other sequences have a nonblank character – At least one must be nonblank for each is the length of the aligned sequences
5
02/19/2004 5 Alignments D Q L F D N V Q Q G L D - - Q – L F D N V Q - - - - - - Q G L - Ex: sequences DQLF, DNVQ, QGL
6
02/19/2004 6 Lattices and Paths – Cartesian product of strings of squares A path between the sequences is a set of connected line segments (connected broken line) A lattice of sequences with lengths n – Consists of -dimensional hypercubes – Forms an -dimensional parallelepiped
7
02/19/2004 7 Paths 2 dimensions3 dimensions 3 possible paths 7 possible paths = 2 n -1 = O(2 n )
8
02/19/2004 8 Paths DQ G L NVQ D Q L F 3-dimensional parallelepiped sublattice Sequences DQLF, DNVQ, QGL DD-DD- -N--N- QQQQQQ --G--G L-LL-L F--F-- -V--V-
9
02/19/2004 9 Sequences: ABCD, ABD, BCD Paths and Sequence Length Note: – Where is the length of A B C D A B – D - B C D ABCD A B D B C D
10
02/19/2004 10 Sequences: ABCD, EFGH, IJK Paths and Sequence Length Note: – Where is the length of EI J K FGH A B C D A B C D – - - - - - - - - - - E F G H - - - - - - - - - - - I J K
11
02/19/2004 11 Sequences DQLF, DNVQ, QGL Projections DQ G L NVQ D Q L F denotes an alignment of and D Q – L F - Q G L - DQLF Q G L
12
02/19/2004 12 Optimal Paths is a measure assigned to – Measure of the similarity among based upon a particular metric For each measure there is at least one path with attaining a minimum value at, the optimal path
13
02/19/2004 13 DQ G L NVQ D Q L F Each vertex in L is an end corner of the sublattice Calculating Optimal Paths First: compute score of each of the possible paths for the cube that has a vertex at the original corner Next: using this information, compute minimum score to reach the vertices of the adjacent cubes to the original corner
14
02/19/2004 14 Problems with This Algorithm Calculates a weighted sum of its projected pairwise alignments – Called “Sum-of-the-Pairs” (SP) Other methods fit biological intuition more closely
15
02/19/2004 15 Tree-Alignment Treat sequences as leaves of an evolutionary tree Reconstruct ancestral sequences which minimize the cost of the tree – Must assign sequences to internal nodes Align the given and reconstructed sequences Star-alignment: only one internal node
16
02/19/2004 16 Tree-Alignment Many different methods for calculating tree alignments Discuss version used by ClustalX
17
02/19/2004 17 Tree-Alignment in ClustalX Three main parts 1. Perform pairwise alignment on all sequences to calculate a distance matrix 2. Use distance matrix to calculate a guide tree 3. Sequences are progressively aligned using the branching order in the guide tree http://bimas.dcrt.nih.gov/clustalw/clustalw.html
18
02/19/2004 18 Calculating Distance Matrix Use standard dynamic programming to find the best alignment – Gap penalties for opening a gap and continuing a gap (possibly different) Divide number of matches by total number of residues compared (excluding gaps) Convert to distances by dividing by 100 and subtracting from 1 Gives one entry in the n by n matrix
19
02/19/2004 19 Calculating Distance Matrix Ex: sequences ATCG, ATCC, AGGC, AGCC A T C G A T C C = 3/4 =.75/100 = 1-.0075 =.9925 A T C G A G G C = 1/4 =.25/100 = 1-.0025 =.9975
20
02/19/2004 20 Calculating Distance Matrix ATCGATCTAGGCGCAA ATCG-- ATCT.9925-- AGGC.9975 -- GCAA111--
21
02/19/2004 21 Calculating a Guide Tree Using Nearest-Neighbor method to group sequences – Results in an unrooted tree – Branch lengths proportional to estimated divergence “Mid-point” method used to determine root – Means of the branch lengths to each side of the root are equal (or approximately equal)
22
02/19/2004 22 Calculating a Guide Tree ATCG ATCT ATCG AGGC AGCC GCAA AGAA.9925.9975/2.9975 1/31 ATCG = 1.8245 ATCT = 1.8245 AGGC = 1.3308 1.6599 GCAA = 1
23
02/19/2004 23 Calculating a Guide Tree ATCG = 1.4911 ATCT = 1.4911 1.4911 AGGC = 1.4986 GCAA = 1.4986 1.4986 ATCG ATCT ATCG AGGC AGCC GCAA AGAA.9925 11.9975/2
24
02/19/2004 24 Progressive Alignment Perform a series of pairwise alignments – Slowly align larger and larger groups of sequences Follow the branching order of the tree – From leaves to root
25
02/19/2004 25 Progressive Alignment ATCG ATCT ATCG AGCC AGGCGCAA AGAA
26
02/19/2004 26 Alignment Costs AC A A C A, A, A, C, C -- 6 A A A A A C C C A, A, A, C, C A, A, C 1 C C A A A A A, A, A, C, C A 2 Traditional Input seq Reconstructed seq Missmatches Traditional (SP)Tree-AlignmentStar-Alignment
27
02/19/2004 27 Alignment Inconsistencies Different definitions of multiple alignments can yield different optimal alignments Optimal tree-alignments minimize number of mutations from theorized common ancestors SP-alignments maximize number of positions where aligned sequences agree – Sometimes makes more biological sense since certain regions of proteins more likely to mutate
28
02/19/2004 28 Alignment Inconsistencies Ex: cost of 1 for aligning two different letters, cost of 2 for aligning a letter with a null Sequences: ACC, ACC, TCT, ATCT Input sequences Reconstructed sequences - A C C - A C C - T C T A T C T -- Traditional (SP) A C C - A C C - T C T - A T C T A C C - Star-Alignment
29
02/19/2004 29 ClustalX Demo Multiple sequence alignment program For more information on ClustalX – http://www.at.embnet.org/embnet/progs/clustal/clu stalx.htm
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.