Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004
02/19/ Importance? RNA folding (Trifonov, Bolshoi) Gene regulation (Galas et al.) Protein structure-function relationships (Wu, Kabat) Molecular evolution (Dayhoff)
02/19/ Introduction Original sequence unknown – Must consider all possible transformations – Including insertions, deletions, and replacements Choose the most likely set of transformations – With a given model of protein evolution
02/19/ Sequences and Alignments An alignment of the sequences is written as K-sequence: sequence of k characters Each is obtained from – Blanks are inserted in positions where some of the other sequences have a nonblank character – At least one must be nonblank for each is the length of the aligned sequences
02/19/ Alignments D Q L F D N V Q Q G L D - - Q – L F D N V Q Q G L - Ex: sequences DQLF, DNVQ, QGL
02/19/ Lattices and Paths – Cartesian product of strings of squares A path between the sequences is a set of connected line segments (connected broken line) A lattice of sequences with lengths n – Consists of -dimensional hypercubes – Forms an -dimensional parallelepiped
02/19/ Paths 2 dimensions3 dimensions 3 possible paths 7 possible paths = 2 n -1 = O(2 n )
02/19/ Paths DQ G L NVQ D Q L F 3-dimensional parallelepiped sublattice Sequences DQLF, DNVQ, QGL DD-DD- -N--N- QQQQQQ --G--G L-LL-L F--F-- -V--V-
02/19/ Sequences: ABCD, ABD, BCD Paths and Sequence Length Note: – Where is the length of A B C D A B – D - B C D ABCD A B D B C D
02/19/ Sequences: ABCD, EFGH, IJK Paths and Sequence Length Note: – Where is the length of EI J K FGH A B C D A B C D – E F G H I J K
02/19/ Sequences DQLF, DNVQ, QGL Projections DQ G L NVQ D Q L F denotes an alignment of and D Q – L F - Q G L - DQLF Q G L
02/19/ Optimal Paths is a measure assigned to – Measure of the similarity among based upon a particular metric For each measure there is at least one path with attaining a minimum value at, the optimal path
02/19/ DQ G L NVQ D Q L F Each vertex in L is an end corner of the sublattice Calculating Optimal Paths First: compute score of each of the possible paths for the cube that has a vertex at the original corner Next: using this information, compute minimum score to reach the vertices of the adjacent cubes to the original corner
02/19/ Problems with This Algorithm Calculates a weighted sum of its projected pairwise alignments – Called “Sum-of-the-Pairs” (SP) Other methods fit biological intuition more closely
02/19/ Tree-Alignment Treat sequences as leaves of an evolutionary tree Reconstruct ancestral sequences which minimize the cost of the tree – Must assign sequences to internal nodes Align the given and reconstructed sequences Star-alignment: only one internal node
02/19/ Tree-Alignment Many different methods for calculating tree alignments Discuss version used by ClustalX
02/19/ Tree-Alignment in ClustalX Three main parts 1. Perform pairwise alignment on all sequences to calculate a distance matrix 2. Use distance matrix to calculate a guide tree 3. Sequences are progressively aligned using the branching order in the guide tree
02/19/ Calculating Distance Matrix Use standard dynamic programming to find the best alignment – Gap penalties for opening a gap and continuing a gap (possibly different) Divide number of matches by total number of residues compared (excluding gaps) Convert to distances by dividing by 100 and subtracting from 1 Gives one entry in the n by n matrix
02/19/ Calculating Distance Matrix Ex: sequences ATCG, ATCC, AGGC, AGCC A T C G A T C C = 3/4 =.75/100 = =.9925 A T C G A G G C = 1/4 =.25/100 = =.9975
02/19/ Calculating Distance Matrix ATCGATCTAGGCGCAA ATCG-- ATCT AGGC GCAA111--
02/19/ Calculating a Guide Tree Using Nearest-Neighbor method to group sequences – Results in an unrooted tree – Branch lengths proportional to estimated divergence “Mid-point” method used to determine root – Means of the branch lengths to each side of the root are equal (or approximately equal)
02/19/ Calculating a Guide Tree ATCG ATCT ATCG AGGC AGCC GCAA AGAA / /31 ATCG = ATCT = AGGC = GCAA = 1
02/19/ Calculating a Guide Tree ATCG = ATCT = AGGC = GCAA = ATCG ATCT ATCG AGGC AGCC GCAA AGAA /2
02/19/ Progressive Alignment Perform a series of pairwise alignments – Slowly align larger and larger groups of sequences Follow the branching order of the tree – From leaves to root
02/19/ Progressive Alignment ATCG ATCT ATCG AGCC AGGCGCAA AGAA
02/19/ Alignment Costs AC A A C A, A, A, C, C -- 6 A A A A A C C C A, A, A, C, C A, A, C 1 C C A A A A A, A, A, C, C A 2 Traditional Input seq Reconstructed seq Missmatches Traditional (SP)Tree-AlignmentStar-Alignment
02/19/ Alignment Inconsistencies Different definitions of multiple alignments can yield different optimal alignments Optimal tree-alignments minimize number of mutations from theorized common ancestors SP-alignments maximize number of positions where aligned sequences agree – Sometimes makes more biological sense since certain regions of proteins more likely to mutate
02/19/ Alignment Inconsistencies Ex: cost of 1 for aligning two different letters, cost of 2 for aligning a letter with a null Sequences: ACC, ACC, TCT, ATCT Input sequences Reconstructed sequences - A C C - A C C - T C T A T C T -- Traditional (SP) A C C - A C C - T C T - A T C T A C C - Star-Alignment
02/19/ ClustalX Demo Multiple sequence alignment program For more information on ClustalX – stalx.htm