Download presentation
Presentation is loading. Please wait.
1
Alignment of biological sequences
Bioinformatics Alignment of biological sequences UL, 2017, Juris Viksna
2
Topics Short review about sequence comparison:
biological motivation to compare sequences sequence similarity criteria DP basic algorithm for distance computation between sequences Global and local sequence comparisons similarity matrices and gap penalties modified algorithms that use gap penalties local sequence comparison Similarity matrices how to obtain them relations between similarity matrices and sequence evolution suitability for matrices for specific sequences
3
Comparison of biological sequences
Two sequence comparisons (pairwise alignment): the formulation of the problem DP algorithm (match = 1, mismatch = 1, gap = 2) gloabal and local comparisons affine gap penalties similarity matrices Multiple alignment the formulation of the problem (SOP) Star alignment relation with phylogenetic trees, progressive alignment Sequence classification: profiles and moitifs profile matrices HMM (Hidden Markov Models)
4
Why we need to compare sequences?
Genome is already sequenced (assume...) There are methods that predict DNA coding regions (genes) What are biological functions of these genes?? We can find out what protein (sequence) gene encodes But we still do not know what this protein does... However we can search for known proteins with similar sequences and such that functions of these proteins are known We want to find out something about proteins in humans The best approach is “experimental”, but tricky with humans... But we can try to use similar protein (e.g. in mice) and start our experiments with them
5
Basic assumptions Will consider proteins/RNA/DNA just as sequences in correspondingly 20 and 4 letter alphabets Aims of comparison: to find out how similar the sequences are (some similarity measure) to find “common motif” of sequences (alignment) Regarding algorithmic complexity two distinctive cases: comparison of two sequences (relatively easy) simultaneous comparison of n sequences (complexity grows exponentially with n) In this lecture we will consider the problem of comparison of two sequences
6
Nucleotides and DNA [Watson, Crick 1953]
For us DNA is a sequence in 4 letter alphabet [Adapted from Y.Guo]
7
Proteins For our purposes we will treat proteins as sequences in 20 symbol alphabet [Adapted from R.Shamir]
8
From DNA to proteins Each codon consists of 3 nucleotides Mutations:
Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon
9
Genetic code Genetic code Completely worked out in 1962
10
Evolution of sequences
Mutations are a natural process of DNA evolution DNA replication errors: substitutions insertions deletions Similarity between sequences: indicates their common ancestral origin indicates similarity of biological functions Well, this is of course simplification: the change of protein function will determine whether the organism will have offsprings and the changed gene will survive Protein sequence similarity is closely associated with similarity of DNA coding regions } indels
11
Sequence evolution Each codon consists of 3 nucleotides Mutations:
Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon
12
Sequence evolution ggcatt agcatt agcata agcatg agccta aggatt gacatt
13
Sequence homology Homologs - evolved from the common ancestor
Orthologs - the same function in different organisms Paralogs - similar function in the same organism
14
Orthologs vs paralogs [Adapted from R.Shamir]
15
How to compare sequences?
Given two proteins: >sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR How to assess their similarity?
16
Sequence alignment - BLAST
17
Sequence alignment - BLAST
18
Sequence alignment – the results we expect
19
Sequence alignment - SSEARCH
20
Sequence alignment - SSEARCH
21
Sequence alignment - scores
sequence similarity/identity (%) This is well-defined for aligned sequence parts “Score” (usually very method-specific in absolute value) p-value – probability that alignment with given score or higher is found by chance Normally the given values are only approximations Expect(E)-value (a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size) The lower the better. But short similar sequences might have comparatively high values (E-value decreases exponentially with Score) Z-score – number of standard deviations from mean value
22
Z-score
23
Z-score
24
How to align two sequences - BLAST
Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold Anyway, how we should do this “correctly”?
25
The “Manhattan Tourist” problem
Visit as many sights as possible starting from top-left corner and moving just down or right
26
Longest common subsequence
Given two sequences A and B find a longest possible sequence C that is subsequence of both A and B (such C does not need to be unique) Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA C = ATTCGCA or ATTAGCT How can we find it?
27
LCS – dynamic programming solution
A = a1 a2an B = b1 b2bm c(i,k) - length of LCS of a1 a2ai and b1 b2bk 0, if i = 0 or k = 0 c(i–1,k–1) +1, if i, k > 0 and ai = bk max{c(i, k–1), c(i–1, k)}, if i, k > 0 and ai bk c(i, k) =
28
LCS – example A = GADTAMAWGRAMMA B = GAGAWKIAMM
29
LCS - example G A D T M W R G A W K I M
30
LCS - example G A D T M W R G A W K I M 1
31
LCS - example G A D T M W R G A W K I M 1 2
32
LCS - example G A D T M W R G A W K I M 1 2 3
33
LCS - example G A D T M W R G A W K I M 1 2 3 4
34
LCS - example G A D T M W R G A W K I M 1 2 3 4
35
LCS - example G A D T M W R G A W K I M 1 2 3 4
36
LCS - example G A D T M W R G A W K I M 1 2 3 4
37
LCS - example G A D T M W R G A W K I M 1 2 3 4 5
38
LCS - example G A D T M W R G A W K I M 1 2 3 4 5 6
39
LCS - example G A D T M W R G A W K I M 1 2 3 4 5 6 7
40
LCS - example G A D T M W R G A W K I M 1 2 3 4 5 6 7
41
LCS - example LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA
G A D T M W R G A W K I M 1 2 3 4 5 6 7 LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA GAG----AWKI—AMM-
42
Edit distance Levenshtein 1966
Minimal number of operations that transforms one sequence into another insert, delete, substitute (1 simbols) Edit distance is 0 (sequences are identical) or positive For example “AIMS” & “AMOS”: (distance=2 for all three solutions) AIMS AMOS AIM-S A-MOS AIMS AMOS [Adapted from D.Gilbert]
43
Edit distance Given two sequences A and B find a the smallest possible number of Insertion, Deletion and Substitution operations that chnages A to B Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA [G][G]AT[A]T[C][C][C][C]CG[G-C][G-C]C[G-C]C[A]T[A] ED = 12? How can we find it?
44
Edit distance A = a1 a2an B = b1 b2bm
e(i,k) – lenght of ED for sequences a1 a2ai and b1 b2bk i, if k = 0 k, if i = 0 e(i–1,k–1), if i, k > 0 and ai = bk min{e(i–1,k–1),e(i,k–1),e(i–1,k)}+1,if i,k > 0 and ai bk e(i, k) =
45
{ ED - modifications e(i,0) = i e(0,j) = j e(i-1,j)+ t e(i,j-1)+ t
If you interested in result «up to a sign» it does not matter whether min or max is used. min is more natural for ED, max for LCS. max is also the usual choice for sequence comparison. tij – probability that aa ai changes to aa bj e(i,j)= min { e(i-1,j)+ t e(i,j-1)+ t e(i-1,j-1) + t(ai,bj) e(i,0) = i e(0,j) = j For ED: t = 1 t(ai,bj) = 0 if ai=bj t(ai,bj) = 1 if ai bj For «inverse» LCS: t = 0 t(ai,bj) = 1 if ai=bj t(ai,bj) = 0 if ai bj
46
Substitution (similarity) matrices
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V Similarity Matrix Most popular: PAM Blossum Gonnet The one shown is BLOSSOM 62 (almost :) "Traditional assumption": substitution score > 0 for substitutions that are more frequent as random ones, and < 0 for less frequent than random ones.
47
Sequence similarity as the longest path problem
We can treat matrix as graph with weighted edges. The problem then translates to finding path with the largest/smallest weight in Directed Acyclic Graph.
48
Complexity of similarity computation
Size of matrix: nm Computing of value for each cell: const Total time: (nm) Total memory: (nm) Notice that if we want just score only two rows are needed. In this case the required memory: (nm) However, if we also need the alignment (and we usually do)?
49
Edit distance in linear space?
50
Interpretation of comparison results
Alignment grid (edit graph). Every alignment is a path from (0,0) to (n,m).
51
Interpretation of comparison results
52
Needleman-Wunsch algorithm
53
Global and local alignments
54
Global and local alignments
Using LCS the best local alignment will have the same score as the best global alignment (however the alignment might be «better») Using ED best local alignments are likely to be for sequences with length 1 Local comparison/alignment «starts to work» if scoring is somewhere between the two above – there are extra points for each match and penalty points for each mismatch
55
Computing local aligments
Just allow a «free ride» from each node to the top-left vertex
56
Computing local aligments
57
Computing local aligments
In this case global alignment has better score, but misses «conserved domain»
58
Global and local comparisons
GLOBAL best alignment of entirety of both sequences For optimum global alignment, we want best score in the final row or final column Are these sequences generally the same? Needleman Wunsch find alignment in which total score is highest, perhaps at expense of areas of great local similarity LOCAL best alignment of segments, without regard to rest of sequence For optimum local alignment, we want best score anywhere in matrix Do these two sequences contain high scoring subsequences Smith Waterman find alignment in which the highest scoring subsequences are identified, at the expense of the overall score [Adapted from R Altman]
59
Alignment with gap penalties
60
Alignment with gap penalties
61
Finding gapped alignments
Add yet another «gap» edge between each vetex and each of its «gap predecessors» However there are (nm(m+n)) of them
62
Finding gapped alignments
63
Finding gapped alignments
64
Finding gapped alignments
65
Finding local gapped alignements
[Adapted from M.Craven]
66
Gap penalties vin general case
The computation requires time though O(n3)... [Adapted from M.Craven]
67
Substitutiom matrices
Margaret Dayhoff ( ) First woman in the field of Bioinformatics
68
Substitutiom matrices
Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found. pp. 345–358.
69
Frequencies (probabilities) of amino acids
1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Frequencies (probabilities) of amino acids
70
Substitution matrices as mutation probabilities
Therefore: score 1 (or log score 0) the better score, the better alignment
71
Substitution matrices as mutation probabilities
Currently we assume that there are no gaps in alignments [Adapted from M.Craven]
72
Substitution matrices as mutation probabilities
[Adapted from M.Craven]
73
Substitution matrices as mutation probabilities
This gives extra ”scoring points” for each matched symbol. [Adapted from M.Craven]
74
PAM matrices [Adapted from M.Craven]
75
PAM matrices [Adapted from M.Craven]
76
PAM matricas [Adapted from M.Craven]
77
PAM matrices A PAM (Percent Accepted Mutation) is one
accepted point mutation on the path between two sequences, per 100 residues. Most frequently used PAM250 Obtained from PAM1 by matrix multiplication...
78
PAM matrices - problems
Qh Qm T years ancestor “The common ancestori", is actually unknown
79
PAM matrices - problems
ancestor ~ A C Q shT PAMs Evolution distances will be different for different pairs...
80
PAM-250
81
BLOSOM matrices BLOSUM62 is the BLAST default [Adapted from M.Craven]
82
How to align two sequences - BLAST
Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold
83
BLOSOM matrices
84
Relationship between PAM & BLOSUM
85
Substitution matrices
The best matrix depends from evolutionary distance In general the similarity score should be proportional to logarithm of probability of having common ancestor The exact «right procedure» for computation of matrices is not trivial
86
Protein evolution rates
Mutation frequencies are fairly stable, but still could differ for different groups of proteins: fibrinopeptides > hemoglobin > cytochrome > Hystone For longer proteins mutations rates might be different in different sequence regions.
87
Protein evolution rates
88
Substitution matrices matrices - problems
If we observe a substitution a b between two sequences this actually could mean: a b a x b a x y b As a result the computed probabilities will not be “exactly right"...
89
Nucleotide substitution matrices
Probably describe better to mutation processes, but not the mutations that could survive during evolution. These tend to be much simpler, since they can not reflect the role of specific nucleotide position.
90
What about aligment scores?
p-value – probability that alignment with given score or higher is found by chance E-value – average number of alignments with given score and higher Assuming all match probabilities to be equal to p, p-value could be computed using the fact that probability to have k matches from n is equal to: 𝑛 𝑘 𝑝 𝑘 (1−𝑝) 𝑛−𝑘 . Such probabilities correspond to binomial distribution. E-value can be derived from p-value and database size.
91
Probability distributions
For larger n binomial distribution can be aslo well approximated by normal distribution. Both of them are easy to use to compute p-values. However...
92
Computing of p-values However situation are made much more complex by:
Use of local, not global, alignments Use of similarity matrices with different probabilities for matches/mismatches Use of gap penalties The computations of exact distributions becomes non-realistic, still a good approximations exist that deals with all these additional requirements. Still, these rely on assumption that probabilities of protein sequences are determined just by probabilities of amino acids. This may result to p-values being considerably lover than «real probabilities» to reach specific alignment score.
93
P-Values P(s > S) = .01 Likewise for P=.001 and so on.
P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time Likewise for P=.001 and so on. [Adapted from M.Gerstein]
94
ROC (Receiver Operating Characterisctic) curves
We will consider proteins to be homologs, if their similarity exceed some threshold t true positives (tp) - s(a,b) t and a, b are homologs false positives (fp) - s(a,b) t and a, b are not homologs true negatives (tn) - s(a,b) < t and a, b are not homologs false negatives (fn) - s(a,b) < t and a, b are homologs Sensitivity = tp/n = tp/(tp+fn) Specificity = tn/n = tn(tn+fp)
95
ROC curves
96
ROC curves 100% Coverage (roughly, fraction of sequences that one confidently “says something” about) Thresh=10 Thresh=20 [sensitivity=tp/n=tp/(tp+fn)] Thresh=30 Different score thresholds Error rate (fraction of the “statements” that are false positives) Two “methods” (red is more effective) [Specificity = tn/n =tn/(tn+fp)] error rate = 1-specificity = fp/n [Adapted from M.Gerstein]
97
Homology (>10%) clusters of CATH 2
98
Multiple sequence alignments
Very similar DP recurions work, however in time (nN), where N is number of sequences Not tractable for usual requirements on N In practice many heuristic methods are used that «work well» but do not guarantee the optimal result
99
Time required for sequence comparisons
Smith-Waterman algorithm (1981) Local comparisons Linear gap penalties Use of substitution matrices Is (n2) time practically acceptable? Protein length - around 300 aa Comparison of 2 proteins: op. ( sec at 1 GHz) Protein database entries Database search: op, 100 sec Comparison of two databases op, 25 years :(
100
Heuristic methods - FASTA
101
Heuristic methods - FASTA
Hash table of short words in the query sequence Go through DB and look for matches in the query hash (linear in size of DB) K-tuple determines word size (k-tup 1 is single aa) Lipman & Pearson 1985 VLICT = _ VLICTAVLMVLICTAAAVLICTMSDFFD [Adapted from M.Gerstein]
102
The FASTA Algorithm 4 steps:
use lookup table to find all identities at least ktup long, find regions of identities rescan 10 regions (diagonals) with highest density of identities using PAM250 join regions if possible without decreasing score below threshold rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment)
103
FASTA parameters ktup = 2 for proteins, 6 for DNA
init1 Score after rescanning with PAM250 (or other) initn Score after joining regions opt Score after Smith-Waterman
104
FASTA algorithm [Fig.1 of Pearson and Lipman 1988]
105
FASTA algorithm [Adapted from D Brutlag]
106
Heuristic methods - BLAST
Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, Indexes query and DB Starts with all overlapping words from query Calculates “neighborhood” of each word using PAM matrix and probability threshold matrix and probability threshold Looks up all words and neighbors from query in database index Extends High Scoring Pairs (HSPs) left and right to maximal length Finds Maximal Segment Pairs (MSPs) between query and database Blast 1 does not permit gaps in alignments [Adapted from M.Gerstein]
107
Heuristic methods - BLAST
108
BLAST algorithm Keyword search of all words of length w from the in the query of length n in database of length m with score above threshold w = 11 for nucleotide queries, 3 for proteins Do local alignment extension for each found keyword Extend result until longest match above threshold is achieved Running time O(nm) [Adapted from S.Daudenarde]
109
BLAST algorithm keyword Neighborhood words neighborhood
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 Neighborhood words neighborhood score threshold (T = 13) extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) [Adapted from S.Daudenarde]
110
Original BLAST Dictionary Alignment Output All words of length w
Ungapped extensions until score falls below some threshold Output All local alignments with score > statistical threshold [Adapted from S.Daudenarde]
111
Original BLAST: Example
w = 4 Exact keyword match of GGTC Extend diagonals with mismatches until score is under 50% Output result: GTAAGGTCC GTTAGGTCC A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
112
Gapped BLAST
113
Gapped BLAST
114
Gapped BLAST: Example GTAAGGTCC-AGT GTTAGGTCCTAGT
Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
115
Gapped BLAST : Example GTAAGGTCC-AGT GTTAGGTCCTAGT
Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)
116
BLAST - Programms blastp
compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive
117
PSI-BLAST – position-specific matrices and transitive sequence comparison
118
PSSM – position-specific scoring matrix
119
Iterated PSI-BLAST
120
Iterated PSI-BLAST
121
General Protein Search Principles
Choose between local or global search algorithms Use most sensitive search algorithm available Original BLAST for no gaps Smith-Waterman for most sensitivity FASTA with k-tuple 1 is a good compromise Gapped BLAST for well delimited regions PSI-BLAST for families Initially BLOSUM62 and default gap penalties If no significant results, use BLOSUM30 and lower gap penalties FASTA cutoff of .01 Blast cutoff of .0001 Examine results between exp and 10 for biological significance Ensure expected score is negative Beware of hits on long sequences or hits with unusual aa composition Reevaluate results of borderline significance using limited query region Segment long queries ³ 300 amino acids Segment around known motifs (some text adapted from D Brutlag)
122
Links to databases and search tools
UniProt/Swiss-Prot – “main” protein sequence database: Text queries, BLAST search, etc (EBI site) (PIR site) Text queries, SSearch, ClustalW Search tools: (pair alignments) FASTA, BLAST, SSearch etc (multiple alignments) ClustalW, Tea-Coffee etc FASTA, SSearch – searches and software downloads
123
Links to databases and search tools
Protein structure database (PDB): Ensembl genome browser:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.