Multiple Sequence Comparison
/course/eleg667-01-f/Topic-2c Outline Motivation Multiple Sequence Alignment using Dynamic programming Multiple Sequence Alignment using Heuristics Star Alignments Tree Alignments (CLUSTAL W) PSI-BLAST and multiple sequence alignment Evaluation of Alignment Methods Summary /course/eleg667-01-f/Topic-2c
Pair-wise sequence comparison “In biomolecular sequences (DNA, RNA, Protein), high sequence similarity usually implies significant functional or structural similarity.” Underlies the effectiveness of pair-wise sequence comparison and of biological database searching Find sequences that have common sub-patterns but may not have been known to be biologically related. /course/eleg667-01-f/Topic-2c
Multiple sequence comparison “Evolutionarily and functionally related molecular sequences can differ significantly at the sequence level and yet preserve similar function and/or structure. Underlies the effectiveness of multiple sequence comparison. Deduce unknown conserved patterns from a set of sequences already known to be biologically related. /course/eleg667-01-f/Topic-2c
Common MSA Applications Characterization and representation of protein families and later identification of other potential members of the family; Identification and representation of conserved sequence features that correlate with structure and function; Deduction of evolutionary history. /course/eleg667-01-f/Topic-2c
Common MSA Applications To detect/demonstrate homology between new sequences and existing families of sequences To help predict the secondary and tertiary structures of new sequences To suggest oligonucleotide primers for PCR /course/eleg667-01-f/Topic-2c
Comparing Multiple Sequences We can compare multiple sequences by aligning the sequences and assigning a score to the alignments Multiple Sequence Alignment (MSA) /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Application of MSA Homology Search (e.g. BLAST) Database top scoring hits MSA Conserved regions Evolution paths …... /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Definition of MSA A Multiple Sequence Alignment is obtained by inserting into each sequence a (possibly zero) number of gaps so that the resulting sequences are of the same length and each column has at least one character different from ‘-’ (gap). IMAGINABLE IMPRACTICABLE INFALLIBLE IM—-AG-INABLE IMPRACTICABLE IN-FALLI--BLE IM-—-AG-INABLE IM-PRACTICABLE IN--FALLI--BLE /course/eleg667-01-f/Topic-2c
How to score an alignment? The Sum-of-Pairs (SP) score: A multiple alignment implies a pair-wise alignment for each pair of sequences; SP defines the score of multiple alignment as the sum of scores of all implied pair-wise alignments. A A C G T A C G A T A A – C G T A – A A T G G T C G T A - - T T A match = 1 mismatch = 0 gap-character = -1 gap-gap = 0 5 3 4 SP score = 12 1 –2 3 3 3 3 –2 –2 1 3 1 = 12 Note: score (-,-) = 0 /course/eleg667-01-f/Topic-2c
MSA using dynamic programming If k sequences of size n then: O(nk) space and O(k22knk) time G C - C G - G - - C G T - G T - - - C G T - - G T A - - - A C G T - G - G T A - - - - A G A A T G nk cells 2k–1 calculations/cell k(k-1)/2 calculations to compute the SP-score C G T G 7 calculations/cell /course/eleg667-01-f/Topic-2c
Recall the pair-wise case ? 3 5 7 1 2 4 6 8 9 Question: from 1 to 9 how many paths? G A 0 -1 -2 A -1 0 0 G -2 0 0 1 3 5 2 8 6 9 7 4 Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? Answer: Let us count Total = 13 /course/eleg667-01-f/Topic-2c
Align Multiple Sequences Queston: when DP comparison ends – how many possible distinct paths have been explored in total for this example? Answer: Let us count Total = 13 G A 0 -1 -2 A -1 0 0 G -2 0 0 3 5 7 1 2 4 6 8 9 Question: from 1 to 9 how many paths? 1 3 5 2 8 6 9 7 4 Align Multiple Sequences Assume we have 3 sequences: AG AC GC How to do DP? A G C Question: When DP comparison ends - how many possible distinct path have been explored in total? Answer: Count! /course/eleg667-01-f/Topic-2c
MSA Using DP with Heuristics How to cut down the search space (# of calculations) at each step? One way is to eliminate pairwise projections which does not contribute to the optimal alignment – develop such a test. /course/eleg667-01-f/Topic-2c
Other MSA Methods Using Heuristics Star Alignment: Build a multiple alignment based upon the pair-wise alignments between a fixed sequence – called the “center” of the input set and all others. Tree Alignment: Build a multiple alignment based upon the pair-wise alignments along edges of a tree relating all the sequences. /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Star Alignment Given k sequences Pick one of the sequences as the center Find optimal pair-wise alignments between the center sequence and each other sequence. Aggregate the pair-wise alignments (progressive alignment) /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Aggregate Step Using the center Sc as a guide Starting with one pairwise alignment, say Sc and S1, and aggregate the rest pairs one at a time When add one pair (Si, Sc) in, make sure we progressively increase the gaps in Sc to suit further alignment, never removing gaps. /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Star Alignment (cont.) How should we select the center sequence? Build a table with the pair-wise similarity score for each pair of sequences. Choose the sequence with the highest sum of scores. /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Star Alignment (cont.) S2 7 A T T G C C A T T A T G C 1 -1 -3 -5 -7 -9 -11 -13 -15 -2 -4 -6 -8 -10 -12 -14 -16 -18 2 3 4 5 6 S1 S1 = ATTGCCATT S2 = ATGGCCATT S3 = ATCCAATTTT S4 = ATCTTCTT S5 = ACTGACC S1 S2 S3 S4 S5 S1 S2 S3 S4 S5 -2 0 -3 7 7 -2 0 -4 -2 –2 0 -7 0 0 0 -3 -3 -4 -7 -3 For k sequences, each size n Time =T1 = O((k.(k-1)/2).n2) =O(k2.n2) Score = 7 S1 = ATTGCCATT S2 = ATGGCCATT So S1 is picked as the center /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Star Alignment (cont.) S1 = ATTGCCATT S5 = ACTGACC-- S2 = ATGGCCATT S1 = ATTGCCATT-- S3 = ATC-CAATTTT S4 = ATCTTC-TT S1 S2 S3 S4 S5 S1 = ATTGCCATT S2 = ATGGCCATT S3 = ATC-CAATTTT S4 = ATCTTC-TT S5 = ACTGACC-- S1 = ATTGCCATT—- S2 = ATGGCCATT-- S3 = ATC-CAATTTT S4 = ATCTTC-TT-- S5 = ACTGACC---- For k sequences, each size n, and an upper bound on the alignment length of a: Time =T2 = O((k-1).n2 + (k-1)2.a ) T1+T2 = O(k2.n2 + k.n2 + (k-1)2.a) “Once a gap, always a gap” /course/eleg667-01-f/Topic-2c
Issues in Star Alignment How to select the best anchor ? How to determine the order of progression ? /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Tree Alignment Uses a clustering technique to order groups of related sequences in a hierarchical tree; Based on the tree hierarchy (order from leaves to root), the multiple sequence alignment is generated by aligning and combining groups of sequences; /course/eleg667-01-f/Topic-2c
The Basic Idea of Tree Alignment (a) A set of sequences S2 S5 S4 S2 S5 S7 S3 S6 S8 S1 S9 (a) A set of sequences (c) A pair-wise distance matrix S1 S2 S3 S4 S5 S6 S7 S8 S9 S1 S2 S3 S4 S5 S6 S7 S8 S9 /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Question: In general, given pair-wise distances between a set S of objects (e.g. distance matrix), how to derive a weighted tree T where each leaf of T corresponds to an object in S, and the distance between two leafs i, j correspond to the distance between i and j in S? Answer: This problem is an important problem in computation biology, and has been studied by many authors using a variable of techniques. /course/eleg667-01-f/Topic-2c
Clustal W – A Tool of Progressive Multiple Sequence Alignment with Improved Sensitivity
/course/eleg667-01-f/Topic-2c CLUSTAL W (Cont.) All pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences; A guide tree is calculated from the distance matrix; The sequences are progressively aligned according to the branching order in the guide tree. /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c CLUSTAL W (Cont.) S1 S3 S2 S4 Guide Tree S1 S2 S3 S4 S1 S2 S3 S4 S1 S2 S3 S4 D12D13 D14 D23 D24 D34 Distance matrix /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c CLUSTAL W (Cont.) S2 Align most similar pair S4 Guide Tree gaps to optimize alignment S1 Align next most similar pair S3 S2 S1 S3 S2 S4 S4 Align alignments, preserve gaps S1 S3 new gap to optimize alignment of (S1S3)with (S2S4 ) /course/eleg667-01-f/Topic-2c
Clastal-W: Some Implementation Hints /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Distance Matrix Initially all sequences are pairwised aligned. S1 S2 S3 S4 S5 S6 S7 S2 S3 S4 S5 S6 S7 S8 S1 S2 Sn S3 S1 S2 = 7 7 S3 S1 = 8 8 17 14 11 12 10 13 8 5 11 8 5 13 10 7 8 16 13 10 11 5 13 10 7 8 6 9 /course/eleg667-01-f/Topic-2c
Two Options for Pairwise Alignment Fast approximate method (Bashford,D.,Chothia.,C., 1987,J.Mol.Biol.) Allows large number of seqs to be aligned even on a microcomputer Fully dynamic programming alignments (Myers,E.,Miller,W., 1988,CABIOS) Two gap penalties Full weight matrix /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c The Guide Tree Unrooted tree Calculated from distance matrix ( Neighbour-Joining Method ) Rooted tree Calculated from unrooted tree ( Middle Point Method ) /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Unrooted Tree Neighbor Joining Method provides not only the topology but also the branch lengths (Fitch, Margoliash) of the final tree Each node represents a sequence Each path length represents the distance between two specific sequences /course/eleg667-01-f/Topic-2c
Unrooted Tree - Example S1 S4 L1 S5 L4 L5 A E B L2 L8 C D S2 L3 S8 L6 L7 F S3 S7 S6 /course/eleg667-01-f/Topic-2c
Neighbour Joining Method S8 S7 S1 S7 S1 S8 S6 S2 S6 X X Y S2 S3 S5 S3 S5 S4 S4 S12 = Sum of all branch lengths = f(D’s) /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c NJ-Method Example /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c PSI-BLAST Observation Database searches using position-specific score matrices, also called profiles or motifs, often are much better able to detect weak relationships than are database searches that use a simple sequences as query /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c PSI-BLAST Cont’d PSI-BLAST uses a procedure to contruct a position-specific score matrix automatically from the output of a BLAST run, and modified BLAST to operate using such a matrix in the place of a simple query The resulting PSI-BLAST program often is substantially more sensitive than the corresponding BLAST program. /course/eleg667-01-f/Topic-2c
PSI-BLAST and Multiple Sequence Alignment PSI-BLAST also produce a multiple sequence alignment with the query sequence as a master template Collect all hits with E-value below a theshold-say 0.01, and Do not include copies of sequences identical to the query Retain one copy for each hit which is very similar to the query Other details The MSA constructed is used by PSI-BLAST for construction a scoring matrix /course/eleg667-01-f/Topic-2c
Where PSI-BLAST Differ from Other “True” MSA Methods? PSI-BLAST deals with local alignments, so each columns of M (the multiple alignment) may involve varying numbers of sequences. In fact, some columns may include only the query sequence itself. /course/eleg667-01-f/Topic-2c
Classification of Multiple Sequence Alignment Methods MSA Progressive Iterative (local) Global Alignment Local Alignment DALIGN PIMA HMM (HMMT) STAR Tree Genetic Algorithm (SAGA) MULTAL MULTALIGN PILEUP CIUSTA-W PSI-BLAST /course/eleg667-01-f/Topic-2c
How to Compare Alignment Software ?
/course/eleg667-01-f/Topic-2c CASA -- A Server for the Critical Asessment of Protein Sequence Alignment Accuracy Sequence alignment Structural alignment Fasta proteins database User 1 H i g h S p e e d N e t w o r k Benchmark Server CE alignments Web Interactive Benchmarking Program User 4 User 3 User 2 ASVIE-AAVI VIVI-EPAAG A-SVIE-AAV- VIVI-EPAAG Remote users Download fasta sequences Produce set of sequence alignment Submit the resulted alignments Benchmarking program evaluates parameters /course/eleg667-01-f/Topic-2c
/course/eleg667-01-f/Topic-2c Methods of Discovery of Biological Sequence Homology Alignment Pattern Matching Pair wise MSA Eventuation All and verify Scan Seeds And ??? Combined Optimal Heuristic Heuristic FLASH Global Local FAST BLAST Progressive Iterative MOTIF/ ASSET (See slide 41) Discover DP PRATT TEIRESIAS ATGC /course/eleg667-01-f/Topic-2c