Protein Sequence Alignment and Database Searching
What is a protein sequence alignment? The equivalencing of residues in two different proteins. Alignment implies that the aligned residues in the proteins are performing similar roles in the two different proteins. Important to think of proteins as three- dimensional objects, not just strings of letters.
Barton, G. J. et al, (1992), "Human Platelet Derived Endothelial Cell Growth Factor is Homologous to E.coli Thymidine Phosphorylase", Prot. Sci., 1,
Immunoglobulin Variable Domains
Protein Sequence Alignment - How? Need scoring scheme for matching amino acid residues. Need to cope with insertions and deletions (gaps or indels). Need algorithm to find ‘best’ alignment. Need some way of judging if the alignment is likely to be correct.
Protein Scoring Schemes A table of scores for aligning each possible amino acid pair. Simplest scheme, just scores 1 for identity and 0 for non identity. Better schemes weight similarities in amino acid properties or observed substitutions. For example, BLOSUM and PAM series.
A R N D C Q E G H I L K M F P S T W Y V B Z X * A R N D C Q E G H I L K M F P S T W Y V B Z X * BLOSUM62 Matrix
Finding the ‘best’ alignment The mathematically best alignment is the one that gives the highest score when the amino acids of the two proteins are aligned. This alignment is not necessarily the one that is biologically meaningful.
Dot-Plot comparison of Human Annexin I with itself. Four repeats (domains ?) are visible. Sequence Analysis of Annexin Domains Program: DOTTER
Gap Penalties Score for aligning a residue or residues in one protein to a gap in the other. Most usual form: penalty = ul + v where l is the length of the gap and u and v are constants. u is often called the gap extension penalty, v, the gap creation penalty.
Dynamic Programming Trick to avoid having to generate all possible alignments. First introduced in molecular biology by Needleman and Wunsch (1970). Many variations on the theme. Basis of (nearly) all sequence alignment programs. Finds the mathematically ‘best’ score for alignment of two sequences of length M and N in MN steps.
Is the alignment correct? Randomisation test (Monte-Carlo) can suggest if the sequences are similar enough to align accurately. Z-score from randomisation test > 6 suggest alignment will be correct over most of its length.
What is a randomisation test? Align sequences by dynamic programming and record score S. Shuffle order of amino acids in the sequences and re-align the pair. Record the score for this alignment, repeat 100 times. Calculate mean and Standard Deviation (sd) of shuffled sequence comparison scores. Z= (S-mean)/sd
Mean x (e.g. 0.0) Value V (e.g. 4.3) Standard Deviation (e.g. 1.8) Z-score = (Value – Mean)/(Standard Deviation) = (V – x) / e.g. = ( )/1.8 = 2.39
Why perform multiple alignment? Can help improve alignment accuracy between any pair of sequences. Prediction of functionally important residues. Sub-family analysis (not this lecture.) Prediction of secondary structure and buried residues (not this lecture.)
Single sequence N Q L E V F M D G E L A... physico-chemical properties of amino acids
Multiple sequences N Q L E V F M D G E L E A... N D E K V Y M E G D I Q V...
Multiple sequences N Q L E V F M D G E L E A... N D E K V Y M E G D I Q V... N S S Q V K I K G Q V D L... N N T N V A M R G K M N T... conserved positions with conserved hydrophobics
Multiple sequences help fit a sequence on a structure (threading) N Q L E V F M D G E L E A... N D E K V Y M E G D I Q V... N S S Q V K I K G Q V D L... N N T N V A M R G K M N T...
Multiple sequences help alignment itself N V A H G K M... N T N V I R G K M N T E V F D G E L... D E K V Y E G N I Q V
Multiple sequences help alignment itself (also pattern matching) E F M D L E A... K Y M E I Q V... Q K I V D L Q... N V A H G K M... Q L E V A D G E L E A D V K V L Y G D I Q V S V Q V K K G Q V D L N T N V I R G K M N T E V F D G E L... K V Y E G D I... Q V K K G Q V... N V A R G K M... Q L E F M D E W L E A D E K V Y E G N I Q V S S Q K I K Q A V D L N T N A M R K F M N T
Multiple sequences help alignment itself (also pattern matching) E F M D L E A... K Y M E I Q V... Q K I V D L Q... N V A H G K M... Q L E V A D G E L E A D V K V L Y G D I Q V S V Q V K K G Q V D L N T N V I R G K M N T E V F D G E L... K V Y E G D I... Q V K K G Q V... N V A R G K M... Q L E F M D E W L E A D E K V Y E G N I Q V S S Q K I K Q A V D L N T N A M R K F M N T
Multiple sequences help alignment itself (also pattern matching) E F M D K Y M E Q K I V N V A H Q L E V A D G E L E A D V K V L Y G D I Q V S V Q V K K G Q V D L N T N V I R G K M N T E V F D G E L... K V Y E G D I... Q V K K G Q V... N V A R G K M... L E F M D E W L E A E K V Y E G N I Q V S Q K I K Q A V D L T N A M R K F M N T
Multiple Sequence Alignment How? Alignment of more than 2 sequences. Can’t directly extend dynamic programming to more than 3 sequences due to memory and CPU limitations. Corner cutting can allow alignments up to around 10 sequences. Practical multiple alignment methods are HIERARCHICAL.
Hierarchical multiple alignment Compare all pairs of sequences Generate a guide tree or dendrogram Follow tree from leaves to root, building the alignment as you go. Most popular program is CLUSTAL. Others are AMPS, MULTAL and PileUp.
Protein Sequence Database Searching Take single sequence and look for similar sequences in a large database. –For database of 2,300,000 sequences, needs 2,300,000 sequence comparisons –Needs good statistics to evaluate quality of match. –Needs local alignment method.
A protein may have multiple domains and so only match in some regions. Local alignment methods (algorithms) overcome this problem. Smith & Waterman algorithm
Ranking the results list Want proteins that are similar to rank above those that are not! No method does this perfectly.
Black bars - proteins related to query sequence. White bars - proteins that are unrelated to query. (a) - no separation (b) - partial separation (c) - full separation. (c) is the goal of searching, but this rarely happens...
Expectation Value For a sequence pair that scores S in a database search, the E-value is the number of sequences that one would expect to see with a score at least as high as S in the database. E values are usually estimated from the Extreme Value Distribution (EVD)
Expectation values If E=5 for a score of 200 in a database search, then one would expect to see 5 sequences with this score or higher by chance alone. If E= for a score of 750, then one would not expect to see sequence pairs with this score by chance alone, so the pair are probably related.
Database Searching Algorithms Can use dynamic programming to search. Slowest, but best method. Most commonly, HEURISTIC methods are used - e.g. BLAST and FASTA. These reduce the time for a search by taking shortcuts.
FASTA Algorithm Does fast lookup of identical matches Then looks for runs of identity Then builds alignment Then estimates significance
BLAST Algorithm Basic Local Alignment Search Tool Applications to Protein-Protein, Protein- DNA, DNA-Protein and DNA-DNA comparisons.
More advanced searching Iterative searching - PSI-BLAST Profile searching Hidden Markov Models (HMMs) Combination of sequence information with other information.
Reading material for this lecture - look at BLAST service. - look at Tools, in particular SRS and CLUSTAL. Book chapter (online) Same information in PDF File:
The end