Alignment of biological sequences

Alignment of biological sequences
Bioinformatics Alignment of biological sequences UL, 2017, Juris Viksna

Topics Short review about sequence comparison:
biological motivation to compare sequences sequence similarity criteria DP basic algorithm for distance computation between sequences Global and local sequence comparisons similarity matrices and gap penalties modified algorithms that use gap penalties local sequence comparison Similarity matrices how to obtain them relations between similarity matrices and sequence evolution suitability for matrices for specific sequences

Comparison of biological sequences
Two sequence comparisons (pairwise alignment): the formulation of the problem DP algorithm (match = 1, mismatch = 1, gap = 2) gloabal and local comparisons affine gap penalties similarity matrices Multiple alignment the formulation of the problem (SOP) Star alignment relation with phylogenetic trees, progressive alignment Sequence classification: profiles and moitifs profile matrices HMM (Hidden Markov Models)

Why we need to compare sequences?
Genome is already sequenced (assume...) There are methods that predict DNA coding regions (genes) What are biological functions of these genes?? We can find out what protein (sequence) gene encodes But we still do not know what this protein does... However we can search for known proteins with similar sequences and such that functions of these proteins are known We want to find out something about proteins in humans The best approach is “experimental”, but tricky with humans... But we can try to use similar protein (e.g. in mice) and start our experiments with them

Basic assumptions Will consider proteins/RNA/DNA just as sequences in correspondingly 20 and 4 letter alphabets Aims of comparison: to find out how similar the sequences are (some similarity measure) to find “common motif” of sequences (alignment) Regarding algorithmic complexity two distinctive cases: comparison of two sequences (relatively easy) simultaneous comparison of n sequences (complexity grows exponentially with n) In this lecture we will consider the problem of comparison of two sequences

Nucleotides and DNA [Watson, Crick 1953]
For us DNA is a sequence in 4 letter alphabet [Adapted from Y.Guo]

Proteins For our purposes we will treat proteins as sequences in 20 symbol alphabet [Adapted from R.Shamir]

From DNA to proteins Each codon consists of 3 nucleotides Mutations:
Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon

Genetic code Genetic code Completely worked out in 1962

Evolution of sequences
Mutations are a natural process of DNA evolution DNA replication errors: substitutions insertions deletions Similarity between sequences: indicates their common ancestral origin indicates similarity of biological functions Well, this is of course simplification: the change of protein function will determine whether the organism will have offsprings and the changed gene will survive Protein sequence similarity is closely associated with similarity of DNA coding regions } indels

Sequence evolution Each codon consists of 3 nucleotides Mutations:
Substitution: (changes a single aa) Insertion / Deletion: “frame shift” (change all subsequent aa) NB! Insertion / Deletion might be a multiple of 3... “Silent mutation” – DNA changed, but not aa “Nonsense mutation” - creates “stop” codon

Sequence evolution ggcatt agcatt agcata agcatg agccta aggatt gacatt

Sequence homology Homologs - evolved from the common ancestor
Orthologs - the same function in different organisms Paralogs - similar function in the same organism

Orthologs vs paralogs [Adapted from R.Shamir]

How to compare sequences?
Given two proteins: >sp|P69905|HBA_HUMAN Hemoglobin VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPT TKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR >tr|Q61287|Q61287_MOUSE Hemoglobin MVLSGEDKSNIKAAWGKIGGHGAEYVAEALERMFASFP TTKTYFPHFDVSHGSAQVKGHGKKVADALASAAGHLDD LPGALSALSDLHAHKLRVDPVNFKLLSHCLLVTLASHH PADFTPAVHASLDKFLASVSTVLTSKYR How to assess their similarity?

Sequence alignment - BLAST

Sequence alignment – the results we expect

Sequence alignment - SSEARCH

Sequence alignment - scores
sequence similarity/identity (%) This is well-defined for aligned sequence parts “Score” (usually very method-specific in absolute value) p-value – probability that alignment with given score or higher is found by chance Normally the given values are only approximations Expect(E)-value (a parameter that describes the number of hits one can "expect" to see by chance when searching a database of a particular size) The lower the better. But short similar sequences might have comparatively high values (E-value decreases exponentially with Score) Z-score – number of standard deviations from mean value

Z-score

How to align two sequences - BLAST
Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold Anyway, how we should do this “correctly”?

The “Manhattan Tourist” problem
Visit as many sights as possible starting from top-left corner and moving just down or right

Longest common subsequence
Given two sequences A and B find a longest possible sequence C that is subsequence of both A and B (such C does not need to be unique) Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA C = ATTCGCA or ATTAGCT How can we find it?

LCS – dynamic programming solution
A = a1 a2an B = b1 b2bm c(i,k) - length of LCS of a1 a2ai and b1 b2bk 0, if i = 0 or k = 0 c(i–1,k–1) +1, if i, k > 0 and ai = bk max{c(i, k–1), c(i–1, k)}, if i, k > 0 and ai  bk c(i, k) =

LCS – example A = GADTAMAWGRAMMA B = GAGAWKIAMM

LCS - example  G A D T M W R  G A W K I M

LCS - example  G A D T M W R  G A W K I M 1

LCS - example  G A D T M W R  G A W K I M 1 2

LCS - example  G A D T M W R  G A W K I M 1 2 3

LCS - example  G A D T M W R  G A W K I M 1 2 3 4

LCS - example  G A D T M W R  G A W K I M 1 2 3 4 5

LCS - example  G A D T M W R  G A W K I M 1 2 3 4 5 6

LCS - example  G A D T M W R  G A W K I M 1 2 3 4 5 6 7

LCS - example LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA
 G A D T M W R  G A W K I M 1 2 3 4 5 6 7 LCS: GAWAAMM Alignment: GA-DTAMAW—GRAMMA GAG----AWKI—AMM-

Edit distance  Levenshtein 1966
Minimal number of operations that transforms one sequence into another insert, delete, substitute (1 simbols) Edit distance is 0 (sequences are identical) or positive For example “AIMS” & “AMOS”: (distance=2 for all three solutions) AIMS AMOS  AIM-S A-MOS AIMS AMOS [Adapted from D.Gilbert]

Edit distance Given two sequences A and B find a the smallest possible number of Insertion, Deletion and Substitution operations that chnages A to B Example: A = GGATATCGGGCGAT B = ATTCCCCCGCCCTA [G][G]AT[A]T[C][C][C][C]CG[G-C][G-C]C[G-C]C[A]T[A] ED = 12? How can we find it?

Edit distance A = a1 a2an B = b1 b2bm
e(i,k) – lenght of ED for sequences a1 a2ai and b1 b2bk i, if k = 0 k, if i = 0 e(i–1,k–1), if i, k > 0 and ai = bk min{e(i–1,k–1),e(i,k–1),e(i–1,k)}+1,if i,k > 0 and ai  bk e(i, k) =

{ ED - modifications e(i,0) = i e(0,j) = j e(i-1,j)+ t e(i,j-1)+ t
If you interested in result «up to a sign» it does not matter whether min or max is used. min is more natural for ED, max for LCS. max is also the usual choice for sequence comparison. tij – probability that aa ai changes to aa bj e(i,j)= min { e(i-1,j)+ t e(i,j-1)+ t e(i-1,j-1) + t(ai,bj) e(i,0) = i e(0,j) = j For ED: t = 1 t(ai,bj) = 0 if ai=bj t(ai,bj) = 1 if ai  bj For «inverse» LCS: t = 0 t(ai,bj) = 1 if ai=bj t(ai,bj) = 0 if ai  bj

Substitution (similarity) matrices
A R N D C Q E G H I L K M F P S T W Y V A R N D C Q E G H I L K M F P S T W Y V Similarity Matrix Most popular: PAM Blossum Gonnet The one shown is BLOSSOM 62 (almost :) "Traditional assumption": substitution score > 0 for substitutions that are more frequent as random ones, and < 0 for less frequent than random ones.

Sequence similarity as the longest path problem
We can treat matrix as graph with weighted edges. The problem then translates to finding path with the largest/smallest weight in Directed Acyclic Graph.

Complexity of similarity computation
Size of matrix: nm Computing of value for each cell: const Total time: (nm) Total memory: (nm) Notice that if we want just score only two rows are needed. In this case the required memory: (nm) However, if we also need the alignment (and we usually do)?

Edit distance in linear space?

Interpretation of comparison results
Alignment grid (edit graph). Every alignment is a path from (0,0) to (n,m).

Interpretation of comparison results

Needleman-Wunsch algorithm

Global and local alignments

Global and local alignments
Using LCS the best local alignment will have the same score as the best global alignment (however the alignment might be «better») Using ED best local alignments are likely to be for sequences with length 1  Local comparison/alignment «starts to work» if scoring is somewhere between the two above – there are extra points for each match and penalty points for each mismatch

Computing local aligments
Just allow a «free ride» from each node to the top-left vertex

In this case global alignment has better score, but misses «conserved domain»

Global and local comparisons
GLOBAL best alignment of entirety of both sequences For optimum global alignment, we want best score in the final row or final column Are these sequences generally the same? Needleman Wunsch find alignment in which total score is highest, perhaps at expense of areas of great local similarity LOCAL best alignment of segments, without regard to rest of sequence For optimum local alignment, we want best score anywhere in matrix Do these two sequences contain high scoring subsequences Smith Waterman find alignment in which the highest scoring subsequences are identified, at the expense of the overall score [Adapted from R Altman]

Alignment with gap penalties

Finding gapped alignments
Add yet another «gap» edge between each vetex and each of its «gap predecessors» However there are (nm(m+n)) of them 

Finding gapped alignments

Finding local gapped alignements
[Adapted from M.Craven]

Gap penalties vin general case
The computation requires time though O(n3)... [Adapted from M.Craven]

Substitutiom matrices
Margaret Dayhoff ( ) First woman in the field of Bioinformatics

Substitutiom matrices
Dayhoff, M.O., Schwartz, R. and Orcutt, B.C. (1978). "A model of Evolutionary Change in Proteins". Atlas of protein sequence and structure (volume 5, supplement 3 ed.). Nat. Biomed. Res. Found. pp. 345–358.

Frequencies (probabilities) of amino acids
1978 1991 L 0.085 0.091 A 0.087 0.077 G 0.089 0.074 S 0.070 0.069 V 0.065 0.066 E 0.050 0.062 T 0.058 0.059 K 0.081 0.059 I 0.037 0.053 D 0.047 0.052 R 0.041 0.051 P 0.051 0.051 N 0.040 0.043 Q 0.038 0.041 F 0.040 0.040 Y 0.030 0.032 M 0.015 0.024 H 0.034 0.023 C 0.033 0.020 W 0.010 0.014 Frequencies (probabilities) of amino acids

Substitution matrices as mutation probabilities
Therefore: score  1 (or log score  0) the better score, the better alignment

Currently we assume that there are no gaps in alignments [Adapted from M.Craven]

[Adapted from M.Craven]

This gives extra ”scoring points” for each matched symbol. [Adapted from M.Craven]

PAM matrices [Adapted from M.Craven]

PAM matricas [Adapted from M.Craven]

PAM matrices A PAM (Percent Accepted Mutation) is one
accepted point mutation on the path between two sequences, per 100 residues. Most frequently used PAM250 Obtained from PAM1 by matrix multiplication...

PAM matrices - problems
Qh Qm T years ancestor “The common ancestori", is actually unknown

PAM matrices - problems
ancestor ~  A C Q shT PAMs Evolution distances will be different for different pairs...

PAM-250

BLOSOM matrices BLOSUM62 is the BLAST default [Adapted from M.Craven]

How to align two sequences - BLAST
Find two exact similarity regions (usually 4 aa each) Try to join and extend these match until score falls below threshold

BLOSOM matrices

Relationship between PAM & BLOSUM

Substitution matrices
The best matrix depends from evolutionary distance In general the similarity score should be proportional to logarithm of probability of having common ancestor The exact «right procedure» for computation of matrices is not trivial

Protein evolution rates
Mutation frequencies are fairly stable, but still could differ for different groups of proteins: fibrinopeptides > hemoglobin > cytochrome > Hystone For longer proteins mutations rates might be different in different sequence regions.

Protein evolution rates

Substitution matrices matrices - problems
If we observe a substitution a  b between two sequences this actually could mean: a  b a  x  b a  x  y  b As a result the computed probabilities will not be “exactly right"...

Nucleotide substitution matrices
Probably describe better to mutation processes, but not the mutations that could survive during evolution. These tend to be much simpler, since they can not reflect the role of specific nucleotide position.

What about aligment scores?
p-value – probability that alignment with given score or higher is found by chance E-value – average number of alignments with given score and higher Assuming all match probabilities to be equal to p, p-value could be computed using the fact that probability to have k matches from n is equal to: 𝑛 𝑘 𝑝 𝑘 (1−𝑝) 𝑛−𝑘 . Such probabilities correspond to binomial distribution. E-value can be derived from p-value and database size.

Probability distributions
For larger n binomial distribution can be aslo well approximated by normal distribution. Both of them are easy to use to compute p-values. However...

Computing of p-values However situation are made much more complex by:
Use of local, not global, alignments Use of similarity matrices with different probabilities for matches/mismatches Use of gap penalties The computations of exact distributions becomes non-realistic, still a good approximations exist that deals with all these additional requirements. Still, these rely on assumption that probabilities of protein sequences are determined just by probabilities of amino acids. This may result to p-values being considerably lover than «real probabilities» to reach specific alignment score.

P-Values P(s > S) = .01 Likewise for P=.001 and so on.
P-value of .01 occurs at score threshold S (392 below) where score s from random comparison is greater than this threshold 1% of the time Likewise for P=.001 and so on. [Adapted from M.Gerstein]

ROC (Receiver Operating Characterisctic) curves
We will consider proteins to be homologs, if their similarity exceed some threshold t true positives (tp) - s(a,b)  t and a, b are homologs false positives (fp) - s(a,b)  t and a, b are not homologs true negatives (tn) - s(a,b) < t and a, b are not homologs false negatives (fn) - s(a,b) < t and a, b are homologs Sensitivity = tp/n = tp/(tp+fn) Specificity = tn/n = tn(tn+fp)

ROC curves

ROC curves 100% Coverage (roughly, fraction of sequences that one confidently “says something” about) Thresh=10 Thresh=20 [sensitivity=tp/n=tp/(tp+fn)] Thresh=30 Different score thresholds Error rate (fraction of the “statements” that are false positives) Two “methods” (red is more effective) [Specificity = tn/n =tn/(tn+fp)] error rate = 1-specificity = fp/n [Adapted from M.Gerstein]

Homology (>10%) clusters of CATH 2

Multiple sequence alignments
Very similar DP recurions work, however in time (nN), where N is number of sequences Not tractable for usual requirements on N In practice many heuristic methods are used that «work well» but do not guarantee the optimal result

Time required for sequence comparisons
Smith-Waterman algorithm (1981) Local comparisons Linear gap penalties Use of substitution matrices Is (n2) time practically acceptable? Protein length - around 300 aa Comparison of 2 proteins: op. ( sec at 1 GHz) Protein database entries Database search: op, 100 sec Comparison of two databases op, 25 years :(

Heuristic methods - FASTA

Heuristic methods - FASTA
Hash table of short words in the query sequence Go through DB and look for matches in the query hash (linear in size of DB) K-tuple determines word size (k-tup 1 is single aa) Lipman & Pearson 1985 VLICT = _ VLICTAVLMVLICTAAAVLICTMSDFFD [Adapted from M.Gerstein]

The FASTA Algorithm 4 steps:
use lookup table to find all identities at least ktup long, find regions of identities rescan 10 regions (diagonals) with highest density of identities using PAM250 join regions if possible without decreasing score below threshold rescore ala Smith-Waterman 32 residues around initial region (Note: doesn’t save alignment)

FASTA parameters ktup = 2 for proteins, 6 for DNA
init1 Score after rescanning with PAM250 (or other) initn Score after joining regions opt Score after Smith-Waterman

FASTA algorithm [Fig.1 of Pearson and Lipman 1988]

FASTA algorithm [Adapted from D Brutlag]

Heuristic methods - BLAST
Altschul, S., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. (1990). Basic local alignment search tool. J. Mol. Biol. 215, Indexes query and DB Starts with all overlapping words from query Calculates “neighborhood” of each word using PAM matrix and probability threshold matrix and probability threshold Looks up all words and neighbors from query in database index Extends High Scoring Pairs (HSPs) left and right to maximal length Finds Maximal Segment Pairs (MSPs) between query and database Blast 1 does not permit gaps in alignments [Adapted from M.Gerstein]

Heuristic methods - BLAST

BLAST algorithm Keyword search of all words of length w from the in the query of length n in database of length m with score above threshold w = 11 for nucleotide queries, 3 for proteins Do local alignment extension for each found keyword Extend result until longest match above threshold is achieved Running time O(nm) [Adapted from S.Daudenarde]

BLAST algorithm keyword Neighborhood words neighborhood
Query: KRHRKVLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLKIFLENVIRD GVK 18 GAK 16 GIK 16 GGK 14 GLK 13 GNK 12 GRK 11 GEK 11 GDK 11 Neighborhood words neighborhood score threshold (T = 13) extension Query: 22 VLRDNIQGITKPAIRRLARRGGVKRISGLIYEETRGVLK 60 +++DN +G + IR L G+K I+ L+ E+ RG++K Sbjct: 226 IIKDNGRGFSGKQIRNLNYGIGLKVIADLV-EKHRGIIK 263 High-scoring Pair (HSP) [Adapted from S.Daudenarde]

Original BLAST Dictionary Alignment Output All words of length w
Ungapped extensions until score falls below some threshold Output All local alignments with score > statistical threshold [Adapted from S.Daudenarde]

Original BLAST: Example
w = 4 Exact keyword match of GGTC Extend diagonals with mismatches until score is under 50% Output result: GTAAGGTCC GTTAGGTCC A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST

Gapped BLAST: Example GTAAGGTCC-AGT GTTAGGTCCTAGT
Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

Gapped BLAST : Example GTAAGGTCC-AGT GTTAGGTCCTAGT
Original BLAST exact keyword search, THEN: Extend with gaps in a zone around ends of exact match until score < threshold then merge nearby alignments Output result: GTAAGGTCC-AGT GTTAGGTCCTAGT A C G A A G T A A G G T C C A G T C T G A T C C T G G A T T G C G A From lectures by Serafim Batzoglou (Stanford)

BLAST - Programms blastp
compares an amino acid query sequence against a protein sequence database blastn compares a nucleotide query sequence against a nucleotide sequence database blastx compares a nucleotide query sequence translated in all reading frames against a protein sequence database tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. Please note that tblastx is extremely slow and cpu-intensive

PSI-BLAST – position-specific matrices and transitive sequence comparison

PSSM – position-specific scoring matrix

Iterated PSI-BLAST

General Protein Search Principles
Choose between local or global search algorithms Use most sensitive search algorithm available Original BLAST for no gaps Smith-Waterman for most sensitivity FASTA with k-tuple 1 is a good compromise Gapped BLAST for well delimited regions PSI-BLAST for families Initially BLOSUM62 and default gap penalties If no significant results, use BLOSUM30 and lower gap penalties FASTA cutoff of .01 Blast cutoff of .0001 Examine results between exp and 10 for biological significance Ensure expected score is negative Beware of hits on long sequences or hits with unusual aa composition Reevaluate results of borderline significance using limited query region Segment long queries ³ 300 amino acids Segment around known motifs (some text adapted from D Brutlag)

Links to databases and search tools
UniProt/Swiss-Prot – “main” protein sequence database: Text queries, BLAST search, etc (EBI site) (PIR site) Text queries, SSearch, ClustalW Search tools: (pair alignments) FASTA, BLAST, SSearch etc (multiple alignments) ClustalW, Tea-Coffee etc FASTA, SSearch – searches and software downloads

Links to databases and search tools
Protein structure database (PDB): Ensembl genome browser:

Alignment of biological sequences

Similar presentations

Presentation on theme: "Alignment of biological sequences"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alignment of biological sequences

Similar presentations

Presentation on theme: "Alignment of biological sequences"— Presentation transcript:

Similar presentations

About project

Feedback