Alignment of biological sequences Bioinformatics Alignment of biological sequences UL, 2017, Juris Viksna
Topics Short review about sequence comparison: biological motivation to compare sequences sequence similarity criteria DP basic algorithm for distance computation between sequences Global and local sequence comparisons similarity matrices and gap penalties modified algorithms that use gap penalties local sequence comparison Similarity matrices how to obtain them relations between similarity matrices and sequence evolution suitability for matrices for specific sequences
Comparison of biological sequences Two sequence comparisons (pairwise alignment): the formulation of the problem DP algorithm (match = 1, mismatch = 1, gap = 2) gloabal and local comparisons affine gap penalties similarity matrices Multiple alignment the formulation of the problem (SOP) Star alignment relation with phylogenetic trees, progressive alignment Sequence classification: profiles and moitifs profile matrices HMM (Hidden Markov Models)
Why we need to compare sequences? Genome is already sequenced (assume...) There are methods that predict DNA coding regions (genes) What are biological functions of these genes?? We can find out what protein (sequence) gene encodes But we still do not know what this protein does... However we can search for known proteins with similar sequences and such that functions of these proteins are known We want to find out something about proteins in humans The best approach is “experimental”, but tricky with humans... But we can try to use similar protein (e.g. in mice) and start our experiments with them
Basic assumptions Will consider proteins/RNA/DNA just as sequences in correspondingly 20 and 4 letter alphabets Aims of comparison: to find out how similar the sequences are (some similarity measure) to find “common motif” of sequences (alignment) Regarding algorithmic complexity two distinctive cases: comparison of two sequences (relatively easy) simultaneous comparison of n sequences (complexity grows exponentially with n) In this lecture we will consider the problem of comparison of two sequences
Nucleotides and DNA [Watson, Crick 1953] For us DNA is a sequence in 4 letter alphabet [Adapted from Y.Guo]
Proteins For our purposes we will treat proteins as sequences in 20 symbol alphabet [Adapted from R.Shamir]
From DNA to proteins Each codon consists of 3 nucleotides Mutations: Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon
Genetic code Genetic code Completely worked out in 1962
Evolution of sequences Mutations are a natural process of DNA evolution DNA replication errors: substitutions insertions deletions Similarity between sequences: indicates their common ancestral origin indicates similarity of biological functions Well, this is of course simplification: the change of protein function will determine whether the organism will have offsprings and the changed gene will survive Protein sequence similarity is closely associated with similarity of DNA coding regions } indels
Sequence evolution Each codon consists of 3 nucleotides Mutations: Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon
Sequence evolution ggcatt agcatt agcata agcatg agccta aggatt gacatt
Sequence homology Homologs - evolved from the common ancestor Orthologs - the same function in different organisms Paralogs - similar function in the same organism
Orthologs vs paralogs [Adapted from R.Shamir]
How to compare sequences? Given two proteins: >sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR How to assess their similarity?
Sequence alignment - BLAST
Sequence alignment - BLAST
Sequence alignment – the results we expect
Sequence alignment - SSEARCH
Sequence alignment - SSEARCH
Sequence alignment - scores sequence similarity/identity (%) This is well-defined for aligned sequence parts “Score” (usually very method-specific in absolute value) p-value – probability that alignment with given score or higher is found by chance Normally the given values are only approximations Expect(E)-value (a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size) The lower the better. But short similar sequences might have comparatively high values (E-value decreases exponentially with Score) Z-score – number of standard deviations from mean value
Z-score
Z-score
How to align two sequences - BLAST Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold Anyway, how we should do this “correctly”?
The “Manhattan Tourist” problem Visit as many sights as possible starting from top-left corner and moving just down or right
Longest common subsequence Given two sequences A and B find a longest possible sequence C that is subsequence of both A and B (such C does not need to be unique) Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA C = ATTCGCA or ATTAGCT How can we find it?
LCS – dynamic programming solution A = a1 a2an B = b1 b2bm c(i,k) - length of LCS of a1 a2ai and b1 b2bk 0, if i = 0 or k = 0 c(i–1,k–1) +1, if i, k > 0 and ai = bk max{c(i, k–1), c(i–1, k)}, if i, k > 0 and ai bk c(i, k) =
LCS – example A = GADTAMAWGRAMMA B = GAGAWKIAMM
LCS - example G A D T M W R G A W K I M
LCS - example G A D T M W R G A W K I M 1
LCS - example G A D T M W R G A W K I M 1 2
LCS - example G A D T M W R G A W K I M 1 2 3
LCS - example G A D T M W R G A W K I M 1 2 3 4
LCS - example G A D T M W R G A W K I M 1 2 3 4
LCS - example G A D T M W R G A W K I M 1 2 3 4
LCS - example G A D T M W R G A W K I M 1 2 3 4
LCS - example G A D T M W R G A W K I M 1 2 3 4 5
LCS - example G A D T M W R G A W K I M 1 2 3 4 5 6
LCS - example G A D T M W R G A W K I M 1 2 3 4 5 6 7
LCS - example G A D T M W R G A W K I M 1 2 3 4 5 6 7
LCS - example LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA G A D T M W R G A W K I M 1 2 3 4 5 6 7 LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA GAG----AWKI—AMM-
Edit distance Levenshtein 1966 Minimal number of operations that transforms one sequence into another insert, delete, substitute (1 simbols) Edit distance is 0 (sequences are identical) or positive For example “AIMS” & “AMOS”: (distance=2 for all three solutions) AIMS AMOS AIM-S A-MOS AIMS AMOS [Adapted from D.Gilbert]
Edit distance Given two sequences A and B find a the smallest possible number of Insertion, Deletion and Substitution operations that chnages A to B Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA [G][G]AT[A]T[C][C][C][C]CG[G-C][G-C]C[G-C]C[A]T[A] ED = 12? How can we find it?
Edit distance A = a1 a2an B = b1 b2bm e(i,k) – lenght of ED for sequences a1 a2ai and b1 b2bk i, if k = 0 k, if i = 0 e(i–1,k–1), if i, k > 0 and ai = bk min{e(i–1,k–1),e(i,k–1),e(i–1,k)}+1,if i,k > 0 and ai bk e(i, k) =
{ ED - modifications e(i,0) = i e(0,j) = j e(i-1,j)+ t e(i,j-1)+ t If you interested in result «up to a sign» it does not matter whether min or max is used. min is more natural for ED, max for LCS. max is also the usual choice for sequence comparison. tij – probability that aa ai changes to aa bj e(i,j)= min { e(i-1,j)+ t e(i,j-1)+ t e(i-1,j-1) + t(ai,bj) e(i,0) = i e(0,j) = j For ED: t = 1 t(ai,bj) = 0 if ai=bj t(ai,bj) = 1 if ai bj For «inverse» LCS: t = 0 t(ai,bj) = 1 if ai=bj t(ai,bj) = 0 if ai bj
Substitution (similarity) matrices A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 C 0 -3 -3 -3 8 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 H -2 0 1 -1 -3 0 0 -2 7 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 6 -1 -1 -4 -3 -2 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 10 2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 6 -1 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 Similarity Matrix Most popular: PAM Blossum Gonnet The one shown is BLOSSOM 62 (almost :) "Traditional assumption": substitution score > 0 for substitutions that are more frequent as random ones, and < 0 for less frequent than random ones.
Sequence similarity as the longest path problem We can treat matrix as graph with weighted edges. The problem then translates to finding path with the largest/smallest weight in Directed Acyclic Graph.
Complexity of similarity computation Size of matrix: nm Computing of value for each cell: const Total time: (nm) Total memory: (nm) Notice that if we want just score only two rows are needed. In this case the required memory: (nm) However, if we also need the alignment (and we usually do)?
Edit distance in linear space?
Interpretation of comparison results Alignment grid (edit graph). Every alignment is a path from (0,0) to (n,m).
Interpretation of comparison results
Needleman-Wunsch algorithm
Global and local alignments
Global and local alignments Using LCS the best local alignment will have the same score as the best global alignment (however the alignment might be «better») Using ED best local alignments are likely to be for sequences with length 1 Local comparison/alignment «starts to work» if scoring is somewhere between the two above – there are extra points for each match and penalty points for each mismatch
Computing local aligments Just allow a «free ride» from each node to the top-left vertex
Computing local aligments
Computing local aligments In this case global alignment has better score, but misses «conserved domain»
Global and local comparisons GLOBAL best alignment of entirety of both sequences For optimum global alignment, we want best score in the final row or final column Are these sequences generally the same? Needleman Wunsch find alignment in which total score is highest, perhaps at expense of areas of great local similarity LOCAL best alignment of segments, without regard to rest of sequence For optimum local alignment, we want best score anywhere in matrix Do these two sequences contain high scoring subsequences Smith Waterman find alignment in which the highest scoring subsequences are identified, at the expense of the overall score [Adapted from R Altman]
Alignment with gap penalties
Alignment with gap penalties
Finding gapped alignments Add yet another «gap» edge between each vetex and each of its «gap predecessors» However there are (nm(m+n)) of them
Finding gapped alignments
Finding gapped alignments
Finding gapped alignments
Finding local gapped alignements [Adapted from M.Craven]
Gap penalties vin general case The computation requires time though O(n3)... [Adapted from M.Craven]
Substitutiom matrices Margaret Dayhoff (1925-1983) First woman in the field of Bioinformatics
Substitutiom matrices Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found. pp. 345–358.
Frequencies (probabilities) of amino acids 1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Frequencies (probabilities) of amino acids
Substitution matrices as mutation probabilities Therefore: score 1 (or log score 0) the better score, the better alignment
Substitution matrices as mutation probabilities Currently we assume that there are no gaps in alignments [Adapted from M.Craven]
Substitution matrices as mutation probabilities [Adapted from M.Craven]
Substitution matrices as mutation probabilities This gives extra ”scoring points” for each matched symbol. [Adapted from M.Craven]
PAM matrices [Adapted from M.Craven]
PAM matrices [Adapted from M.Craven]
PAM matricas [Adapted from M.Craven]
PAM matrices A PAM (Percent Accepted Mutation) is one accepted point mutation on the path between two sequences, per 100 residues. Most frequently used PAM250 Obtained from PAM1 by matrix multiplication...
PAM matrices - problems Qh Qm T years ancestor “The common ancestori", is actually unknown
PAM matrices - problems ancestor ~ A C Q shT PAMs Evolution distances will be different for different pairs...
PAM-250
BLOSOM matrices BLOSUM62 is the BLAST default [Adapted from M.Craven]
How to align two sequences - BLAST Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold
BLOSOM matrices
Relationship between PAM & BLOSUM
Substitution matrices The best matrix depends from evolutionary distance In general the similarity score should be proportional to logarithm of probability of having common ancestor The exact «right procedure» for computation of matrices is not trivial
Protein evolution rates Mutation frequencies are fairly stable, but still could differ for different groups of proteins: fibrinopeptides > hemoglobin > cytochrome > Hystone For longer proteins mutations rates might be different in different sequence regions.
Protein evolution rates
Substitution matrices matrices - problems If we observe a substitution a b between two sequences this actually could mean: a b a x b a x y b ........................ As a result the computed probabilities will not be “exactly right"...
Nucleotide substitution matrices Probably describe better to mutation processes, but not the mutations that could survive during evolution. These tend to be much simpler, since they can not reflect the role of specific nucleotide position.
What about aligment scores? p-value – probability that alignment with given score or higher is found by chance E-value – average number of alignments with given score and higher Assuming all match probabilities to be equal to p, p-value could be computed using the fact that probability to have k matches from n is equal to: 𝑛 𝑘 𝑝 𝑘 (1−𝑝) 𝑛−𝑘 . Such probabilities correspond to binomial distribution. E-value can be derived from p-value and database size.
Probability distributions For larger n binomial distribution can be aslo well approximated by normal distribution. Both of them are easy to use to compute p-values. However...
Computing of p-values However situation are made much more complex by: Use of local, not global, alignments Use of similarity matrices with different probabilities for matches/mismatches Use of gap penalties The computations of exact distributions becomes non-realistic, still a good approximations exist that deals with all these additional requirements. Still, these rely on assumption that probabilities of protein sequences are determined just by probabilities of amino acids. This may result to p-values being considerably lover than «real probabilities» to reach specific alignment score.
P-Values P(s > S) = .01 Likewise for P=.001 and so on. P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time Likewise for P=.001 and so on. [Adapted from M.Gerstein]
ROC (Receiver Operating Characterisctic) curves We will consider proteins to be homologs, if their similarity exceed some threshold t true positives (tp) - s(a,b) t and a, b are homologs false positives (fp) - s(a,b) t and a, b are not homologs true negatives (tn) - s(a,b) < t and a, b are not homologs false negatives (fn) - s(a,b) < t and a, b are homologs Sensitivity = tp/n = tp/(tp+fn) Specificity = tn/n = tn(tn+fp)
ROC curves
ROC curves 100% Coverage (roughly, fraction of sequences that one confidently “says something” about) Thresh=10 Thresh=20 [sensitivity=tp/n=tp/(tp+fn)] Thresh=30 Different score thresholds Error rate (fraction of the “statements” that are false positives) Two “methods” (red is more effective) [Specificity = tn/n =tn/(tn+fp)] error rate = 1-specificity = fp/n [Adapted from M.Gerstein]
Homology (>10%) clusters of CATH 2
Multiple sequence alignments Very similar DP recurions work, however in time (nN), where N is number of sequences Not tractable for usual requirements on N In practice many heuristic methods are used that «work well» but do not guarantee the optimal result
Time required for sequence comparisons Smith-Waterman algorithm (1981) Local comparisons Linear gap penalties Use of substitution matrices Is (n2) time practically acceptable? Protein length - around 300 aa Comparison of 2 proteins: 100000 op. (0.0001 sec at 1 GHz) Protein database - 1000000 entries Database search: 1011 op, 100 sec Comparison of two databases - 1017 op, 25 years :(
Heuristic methods - FASTA
Heuristic methods - FASTA Hash table of short words in the query sequence Go through DB and look for matches in the query hash (linear in size of DB) K-tuple determines word size (k-tup 1 is single aa) Lipman & Pearson 1985 VLICT = _ VLICTAVLMVLICTAAAVLICTMSDFFD [Adapted from M.Gerstein]
The FASTA Algorithm 4 steps: use lookup table to find all identities at least ktup long, find regions of identities rescan 10 regions (diagonals) with highest density of identities using PAM250 join regions if possible without decreasing score below threshold rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment)
FASTA parameters ktup = 2 for proteins, 6 for DNA init1 Score after rescanning with PAM250 (or other) initn Score after joining regions opt Score after Smith-Waterman
FASTA algorithm [Fig.1 of Pearson and Lipman 1988]
FASTA algorithm [Adapted from D Brutlag]
Heuristic methods - BLAST Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, 403-410 Indexes query and DB Starts with all overlapping words from query Calculates “neighborhood” of each word using PAM matrix and probability threshold matrix and probability threshold Looks up all words and neighbors from query in database index Extends High Scoring Pairs (HSPs) left and right to maximal length Finds Maximal Segment Pairs (MSPs) between query and database Blast 1 does not permit gaps in alignments [Adapted from M.Gerstein]
Heuristic methods - BLAST
BLAST algorithm Keyword search of all words of length w from the in the query of length n in database of length m with score above threshold w = 11 for nucleotide queries, 3 for proteins Do local alignment extension for each found keyword Extend result until longest match above threshold is achieved Running time O(nm) [Adapted from S.Daudenarde]
BLAST algorithm keyword Neighborhood words neighborhood Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 Neighborhood words neighborhood score threshold (T = 13) extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) [Adapted from S.Daudenarde]
Original BLAST Dictionary Alignment Output All words of length w Ungapped extensions until score falls below some threshold Output All local alignments with score > statistical threshold [Adapted from S.Daudenarde]
Original BLAST: Example w = 4 Exact keyword match of GGTC Extend diagonals with mismatches until score is under 50% Output result: GTAAGGTCC GTTAGGTCC A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
Gapped BLAST
Gapped BLAST
Gapped BLAST: Example GTAAGGTCC-AGT GTTAGGTCCTAGT Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
Gapped BLAST : Example GTAAGGTCC-AGT GTTAGGTCCTAGT Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
BLAST - Programms blastp compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive
PSI-BLAST – position-specific matrices and transitive sequence comparison
PSSM – position-specific scoring matrix
Iterated PSI-BLAST
Iterated PSI-BLAST
General Protein Search Principles Choose between local or global search algorithms Use most sensitive search algorithm available Original BLAST for no gaps Smith-Waterman for most sensitivity FASTA with k-tuple 1 is a good compromise Gapped BLAST for well delimited regions PSI-BLAST for families Initially BLOSUM62 and default gap penalties If no significant results, use BLOSUM30 and lower gap penalties FASTA cutoff of .01 Blast cutoff of .0001 Examine results between exp. 0.05 and 10 for biological significance Ensure expected score is negative Beware of hits on long sequences or hits with unusual aa composition Reevaluate results of borderline significance using limited query region Segment long queries ³ 300 amino acids Segment around known motifs (some text adapted from D Brutlag)
Links to databases and search tools UniProt/Swiss-Prot – “main” protein sequence database: http://www.uniprot.org/ Text queries, BLAST search, etc https://www.ebi.ac.uk/uniprot (EBI site) http://pir.georgetown.edu/ (PIR site) Text queries, SSearch, ClustalW Search tools: http://www.ebi.ac.uk/Tools/sss/ (pair alignments) FASTA, BLAST, SSearch etc http://www.ebi.ac.uk/Tools/msa/ (multiple alignments) ClustalW, Tea-Coffee etc http://fasta.bioch.virginia.edu FASTA, SSearch – searches and software downloads
Links to databases and search tools Protein structure database (PDB): http://www.rcsb.org/pdb Ensembl genome browser: www.ensembl.org/