Pairwise Alignment Sándor Pongor

Pairwise Alignment Sándor Pongor
Sequence alignment I. Sándor Pongor With slides adapted from David Judge, Jack Leunissen and Christoph Sensen September 27, 2016 © 2001, Jack A.M. Leunissen

Last lectures Representations (unstructured, structured, mixed).
Core operations: Comparison gives 1) Proximity measures (similarities, distances) 2) Motifs (from pairwise and multiple alignment of sequences). Main distance and similarity measures Aggregation of numbers, vectors, sequences (distance matrices, trees, heatmaps, multiple alignments) Projections onto sequences 1D plots

This lecture (Part I-2) Edit distance (refresh)
Substitution matrices (PAM, BLOSUM, how to build your own The two basic sequence alignment algorithms Algorithm types according to how we do it 1: exhaustive and heuristic, global and local. Algorithm types according to what we compare 2: two sequences, seq. vs dbase, seq vs. genome, many seqs vs. genome, etc

The tree of bioinformatics: core, branches leaves
Applications Bioinfo algorithms Core data, core principles New branches and leaves every year…

Communicaton in bacteria
Application example Communicaton in bacteria

The input of a sequence alignment algorithm
1) Two sequences 2) A scoring scheme (a score formula AND a scoring (substitution) matrix)* *This is applicable also to the comparison of 3D structures of macromolecules, or any other type of linear object descriptions (namely: macromolecules have a linear backbone)

The steps of sequence alignment
1) Find all possible alignments (Mappings) between two sequences and find the best one according to some “quick” score. * 2) Calculate a final quantitative score* * for the best alignment (Matching) * This sketch symbolizes local alignment to indicate that there are many possible mappings. The same is true for global alignments as well. * * Usually an approximate edit distance with a scoring matrix, that we discussed last time

The results of sequence alignment
1) A score (similarity score or distance) 2) A motif (common subsequence, consensus description…) The human mind describes similarity also in terms of patterns and scores. But the patterns are stored in the human memory ina smart way…

Range of alignment or High Scoring Pair (HSP)
Reminder Motif: AGACXTGA.CTGA Sequence similarity score Range of alignment or High Scoring Pair (HSP) The score S is a sum of costs assigned to identities and mismatches, minus a penalty for gaps. Costs are stored in the substitution matrix. Gap usually a sum of gap opening and gap-extension costs. TÁMOP – /2/A/KMR 9

Alignment score Score = (Gap) penalty = Pairwise Alignment
© 2001, Jack A.M. Leunissen 10

Gap penalty functions Linear Affine Probabilistic Other functions
Pairwise Alignment Gap penalty functions Linear Penalty rises monotonous with length of gap Affine Penalty has a gap-opening and a separate length component Probabilistic Penalties may depend upon the neighboring residues Other functions Penalty first rises fast, but levels off at greater length values No dramatic differences. Affine gaps are widely used. © 2001, Jack A.M. Leunissen 11

A simple example (alignment without gaps):
For a match/mismatch we look up the value in the substitution matrix. The matrix is a lookup table… TÁMOP – /2/A/KMR 12

Substitution matrices in details
The susbstitution matrix (also called scoring matrix) contains costs for amino acid identities and substitutions in an alignment. For amino acids, it is a 20x20 symmetrical matrix that can be constructed from pairwise alignments of related sequences “Related” means either a) evolutionary relatedness described by an “approved” evolutionary tree (Dayhoff’s PAM matrices) b) any sequence similarity as described in the PROSITE database (Hennikoff’s BLOSUM matrices) Groups of related sequences can be organized into a multiple alignment for calculation of the matrix elements. TÁMOP – /2/A/KMR 13

Substitution matrices (cost matrices)
Calculation of scoring matrices from multiple alignments. Matrix elements are calculated from the observed and expected frequencies (using a “log odds” principle). E.g. for S/T (indicated by red): ASDESKLVV | ATDDATLSI | | ASDSERITV S/T denotes that S is aligned with T or T with S. The values are calculated from many multiple alignments (not just one).The log odds values in the matrix are then normalized to a given range depending on the application. (e.g. -5 to +15, for historical reasons. The range does not matter much) f(S/T)=3 f(S)=5, f(T)=3 14

The problem of making a substitution matrix
Problem: To make a matrix you need a multiple alignment, but to make a multiple alignment you need a matrix. The first generation solution: Make multiple alignments by hand, using known proteins. Very tedious  this gives the so-called PAM matrix. The second generation solution is to make multiple alignments with a program using the PAM matrix, and then extract a large statistics from conserved regions  this is the so-called BLOSUM matrix A Münchausen problem All entries  104 TÁMOP – /2/A/KMR 15

Pam_1 = 1% of amino acids mutate
Pam_30 = (Pam_1)30 (matrix multiplication) PAM 250 (the higher the numbers the higher the divergence found) small polar Note: chemically similar amino acids are near each other … basic large aromatic TÁMOP – /2/A/KMR 16

Scoring Matrices used today
BLOSUM Matrices (most often used) Developed by Henikoff & Henikoff (1992) BLOcks SUbstitution Matrix Derived from the BLOCKS database PAM Matrices Developed by Schwarz and Dayhoff (1978) Point Accepted Mutation Derived from manual alignments of closely related proteins

PAM versus BLOSUM Much later entry to matrix “sweepstakes”
No evolutionary model is assumed Built from sequence blocks taken from PROSITE (functionally similer segments of proteins) Uses much larger, more diverse set of protein sequences (30% - 90% ID) First useful scoring matrix for protein Assumed a Markov Model of evolution (I.e. all sites equally mutable and independent) Derived from small, closely related proteins with ~15% divergence

PAM versus BLOSUM Lower BLOSUM numbers to detect more remote sequence similarities Higher BLOSUM numbers to detect high similarities Sensitive to structural and functional substitution Errors in BLOSUM arise from errors in alignment Higher PAM numbers to detect more remote sequence similarities Lower PAM numbers to detect high similarities 1 PAM ~ 1 million years of divergence Errors in PAM 1 are scaled 250X in PAM 250

PAM Matrices PAM 40 - prepared by multiplying PAM 1 by itself for a total of 40 times best for short alignments with high similarity PAM prepared by multiplying PAM 1 by itself for a total of 120 times best for general alignment PAM prepared by multiplying PAM 1 by itself for a total of 250 times best for detecting distant sequence similarity

BLOSUM Matrices BLOSUM 90 - prepared from BLOCKS sequences with >90% sequence ID best for short alignments with high similarity BLOSUM 62 - prepared from BLOCKS sequences with >62% sequence ID best for general alignment (default) BLOSUM 30 - prepared from BLOCKS sequences with >30% sequence ID best for detecting weak local alignments

Scores V D S – C Y V E T L C F BLOSUM62 +4 +2 +1 -12 +9 +3 7
Database Searching Database Searching Scores V D S – C Y V E T L C F BLOSUM PAM Slide by David Landsman, NCBI © 2007, Jack A.M. Leunissen © 2007, Jack A.M. Leunissen 22 22

Nucleic acid matrices A 10 0 0 0 C 0 10 0 0 G 0 0 10 0 T 0 0 0 10
Pairwise Alignment Nucleic acid matrices A C G T A C G T Needleman-Wunsch * A C G T A C G T Smith-Waterman * The magnitude of the elements are relative, can be scaled. Other heuristic matrices can be easily constructed. Identity matirx: diagonal =1, rest=0. Ore, one can penalize certain associations assigning a large negative value to them, etc. *These are the names of two classical algorithms, to be discussed in the next section. © 2001, Jack A.M. Leunissen 23

A method of visualizing matching positions in biological sequences
Dot plots Visual comparison of sequences A method of visualizing matching positions in biological sequences This presentation was created by David and Paul Judge

Graphical comparison of sequences using “Dotplots”. Basic Principles.
3) If the two sequences are identical, the diagonal will be red. x(i) = y(i) all along the sequences 1) Write sequences of length n on two diagonal axes. This defines an n x n matrix, called the dot matrix or alignment matrix. 2) Immagine that we put now a red dot to those positions where the nucleotides x(i) and y(i) are identical. ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT Dotplots: Basic Principles: A dotplot provides a graphical comparison of two sequences (protein or DNA, the illustration will be DNA) First consider the sequences to be compared written out along Cartesian axes. ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG

Graphical comparison of sequences using “Dotplots”. Basic Principles.
4) If 10 nucleotides are deleted in sequence y at a certain position, but the two are otherwise identical, then after the point of deletion y(i) = x(I +10) ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT We can view this two ways|: y(i) = x(i +10) insertion in x or y(i-10) = x(i) deletion in y 10 nt ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG

Graphical comparison of sequences using “Dotplots”.
Basic Principles. A “word size” (11 say) A “Scoring scheme” (1 for a match, 0 for a mismatch, say) Diagonal runs of dots indicate similar regions a ATGCTTATAGG A T G C A T G C =9 l ACCTGCCCTGTCCAGCTTACATGCATGCTTATAGGGGCATTTTACAT Summary: Dotplots provide a comprehensive overview but NO detail. A “Cut-off score” (8 say) Dotplots: Basic Principles: A dotplot provides a graphical comparison of two sequences (protein or DNA, the illustration will be DNA) First consider the sequences to be compared written out along Cartesian axes. These sequences will be compared in fixed length sections (“words”, of 11 base pairs, say). Each word from the horizontal sequence is compared with each word of the vertical sequence. Words are compared by considering corresponding residues. Using a scoring scheme (+1 for a matched pair of bases, 0 for a mismatched pair, say), a similarity score is computed for each pair of compared words. Significant similarity is defined by the selection of a cut-off score (8, say, indicating 8 or more matched bases between words of size 11 to be significant). Significantly matching pairs of words are indicated by drawing a dot in a position representing the middles of the matched words. Once all possible words of the horizontal sequence have been compared to all possible words of the vertical sequence, a number of dots should have been plotted. Roughly diagonal runs of dots indicate roughly similar regions in the two sequences being compared. Dotplots provide a comprehensive overview, however, textual alignments are required to examine sequence similarities in detail. ACCTGCCGATTCCATATTACGCATGCTTCTGGGTTACCGTTCAGGGCATTTTACATGTGCTG ATGCTTCTGGG

Matching bit or character-strings
REMINDER Matching bit or character-strings The Hamming distance is the number of exchanges necessary to turn one string of bits or characters into another one (the number of positions not connected with a straight line). The two strings are of identical length and no alignment is done. The exchanges in character strings can have different costs, stored in a lookup table. In this case the value of the Hamming distance will be the sum of costs, rather than the number of the exchanges. WE USE THIS IN DOT PLOT

Scoring Schemes. A T G C A T G C DNA: Simplest Scheme is the Identity Matrix. More complex matrices can be used. For example, the default EMBOSS DNA scoring matrix is: A T G C A T G C The use of negative numbers is only pertinent when these matrices are used for computing textual alignments. Using a wider spread of scores eases the Expansion of the scoring matrix to sensibly include ambiguity codes. Dotplots: Scoring schemes: DNA: DNA sequences are generally compared on the assumption that alphabetically matching bases indicate similarity and alphabetically mismatched bases indicate dissimilarity. This is clearly a simplification, particularly for coding DNA (mismatched bases in codon position 3 will often have identical “meaning” for example). Nevertheless, this assumption requires only the simplest of scoring schemes (e.g. 1 point for each matched pair of bases, 0 for each mismatched pair of bases). Many packages use marginally more complex scoring schemes. For example, the default DNA scoring scheme for the EMBOSS package scores +5 for each matched pair of bases and -4 for each mismatched pair of bases. The same scoring schemes are used by programs computing textual alignments. The use of negative numbers is really an issue for certain of the textual alignment programs (further discussion later).

Scoring Schemes. A C G T S W R Y K M B V H D N U A C G T S W R Y K M B V H D N U A B C D E F G H I K L M N P Q R S T V W Y Z A B C D E F G H I K L M N P Q R S T V W Y Z IUB DNA Alphabet Code Meaning A C G T/U M `aMino` A|C R `puRine` A|G W `Weak` A|T S `Strong` C|G Y `pYrimidine` C|T K `Keto` G|T V `not T` A|C|G H `not G` A|C|T D `not C` A|G|T B `not A` C|G|T N `aNy` A|C|G|T Using a wider spread of scores eases the expansion of the scoring matrix to sensibly include ambiguity codes. Dotplots: Scoring schemes: DNA: Using a wider spread of scores eases the expansion of the scoring matrix to sensibly include ambiguity codes. All commonly used bioinformatics packages use the IUB DNA alphabet to represent DNA sequences. This alphabet offers a code for every possible ambiguity that could occur in a DNA sequence (see illustration). The full IUB DNA alphabet is particularly useful when representing restriction enzyme cut-sites (where ambiguous positions are the norm) and incompletely sequenced DNA (N being the most commonly employed code in this context). Providing scoring schemes that allow the use of ambiguity codes is clearly a useful idea. The Emboss example (illustrated) tries to select scores for matches including ambiguities to reflect carefully the statistical probability of a “real” match. Other packages are less precise (e.g. GCG where a match between N and N, which has only a 0.25 probability of representing a “real” match, is scored as if it was a certain match). However, irrespective of the precision with which ambiguity codes are treated in a scoring scheme, comparing DNA sequences that include more a very few ambiguity codes will not work well. Without ambiguity codes there is a 0.25 probability that bases match by chance. Adding too many ambiguities will swiftly increase noise effects to a level beyond which useful comparisons can be expected. The inclusion of ambiguity codes in scoring schemes is to allow the comparison programs to work with sequences that contain a few such codes, not to align sensibly sequences containing many ambiguities. Protein: For Protein sequence dotplots more complex scoring schemes are required. Clearly, scores must reflect far more than alphabetic identity. Matched amino acids that are not identical but are known to be similar cannot just be regarded as a “mismatch”. As will be discussed in full later, the numbers for the most commonly used protein scoring schemes are derived from studies of aligned protein families. They reflect the frequency with which amino acid pairs are observed to successfully substitute for each other and the expected abundance of particular amino acids. For Protein sequence dotplots more complex scoring schemes are required. Scores must reflect far more than alphabetic identity.

Faster plots for perfect matches. To detect perfectly matching words, a dotplot program has a choice of strategies A T G C A T G C 1) Select a scoring scheme and a word size (11, say) For every pair of words, compute a word match score in the normal way ATGCTTATAGG ATGCTTATAGG l If the maximum possible cut-off score (still 11) is not achieved Only if the maximum possible cut-off score (11) is achieved a r  Faster plots for perfect matches. If a dotplot is required only to detect words that are identical, it is possible to write code that will produce the dotplot picture much faster than for the general case. You could, of course, use a scoring scheme, compute a word match score for all word pairs, compare that score to a cut-off score (which would be the maximum score it is possible to obtain). =11 =9 ATGCTTCTGGG Celebrate with a dot ATGCTTCTGGG Do not celebrate with a dot

a r  OR l 2) Graphical comparison of sequences using “Dotplots”.
Faster plots for perfect matches. To detect perfectly matching words, a dotplot program has a choice of strategies OR 2) For every pair of words, ……… see if the letters are exactly the same ATGCTTATAGG ATGCTTATAGG l If they are If they are not a r  aaaaaaaaaaa aaaaaararaa ATGCTTCTGGG Celebrate with a dot Do not celebrate with a dot ATGCTTCTGGG To detect exactly matching words, fast character string matching can replace laborious computation of match scores to be compared with a cut-off score Faster plots for perfect matches. OR Why not just compare the words as character strings? Computers can do this much much faster than it can compute word match scores and compare them to a cut-off score. Many packages include a dotplot option specifically for detecting exactly matching words. For example, the program dottup in the EMBOSS package. Such programs are clearly less versatile the general dotplot programs. However, they are very much faster. This is of particular advantage when looking for strong matches between very long DNA sequences. Technically, exact word matching can be applied to protein sequences, though it is much less useful. In general, the word size would need to be very small to work at all. Protein sequences are normally not long enough to make the speed gain significant. Note also that, using normal protein scoring schemes, exact word matching for proteins would not be a special case of a normal dotplot as the maximum word match score is not constant. It is a function of the words being matched. Many packages include a dotplot option specifically for detecting exactly matching words. Particular advantage when seeking strong matches in long DNA sequences.

The scoring scheme. The cut-off score The word size
Graphical comparison of sequences using “Dotplots”. Dotplot parameters. There are three parameters to consider for a dotplot: The scoring scheme. The cut-off score The word size Dotplot parameters. Dotplots are very simple. There are only 3 parameters to consider: 1) The scoring scheme. 2) The cut-off score 3) The size of window Here we will consider each in turn.

The Scoring scheme. DNA Protein
Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Scoring scheme. DNA Usually, DNA Scoring schemes award a fixed reward for each matched pair of bases and a fixed penalty for each mismatched pair of bases. Choosing between such scoring schemes will affect only the choice of a sensible cut-off score and the way ambiguity codes are treated. Protein Varying the Scoring scheme DNA All commonly used DNA Scoring schemes award a fixed reward for each matched pair of bases and a fixed penalty for each mismatched pair of bases. Typically, choosing between such scoring schemes would only affect the choice of a sensible cut-off score and, marginally, affect the way ambiguity codes are treated. Protein For the protein scoring schemes predominantly used currently, the choice is primarily determined by the estimated evolutionary distance between the two sequences being compared. I would suggest that the choice was rarely crucial for dotplot programs (further discussion later). Protein scoring schemes differ in the evolution distance assumed between the proteins being compared. The choice is rarely crucial for dotplot programs.

Dotplot parameters. The Cut-off score. The higher the cut-off score the less dots will be plotted. But, each dot is more likely to be significant. The lower the cut-off score the more dots will be plotted. But, dots are more likely to indicate a chance match (noise). Varying the Cut-off score: The higher the cut-off score the less dots will be plotted. However, each dot is more likely to be significant. The lower the cut-off score the more dots will be plotted. However, each dot is more likely to indicate a chance match (i.e. be noise). There is a dotplot program called dotter (by Eric Sonhammer) that allows the cut-off score to be effectively adjusted dynamically. Dots are plotted for all positions using a grey scale. The lighter the grey, the lower the word match score. The darker the dot, the higher the word match score. The user is able to dynamically adjust the ends of the “grey scale” (i.e. the word match scores represented by white and black). In this way, the dotplot can be viewed with varying cut-off specifications without the requirement of recomputing the plot.

Dotplot parameters. The Cut-off score. 30 20 10 5 Scoring Scheme: PAM 250, Word Size: 25, Cut-off score: More “features”, probably noise, appear obscuring the original 4 clear regions. Cut-off now clearly too low. Too much noise to see interesting regions. 4 regions become clearer, some other weaker features appear 4 clear strong regions apparent Varying the Cut-off score: Illustrated is a dotplot between 2 flavodoxin proteins from the swissprot protein sequence database (flav_anasp and flav_desvh) using a constant word size of 25, the PAM 250 scoring scheme and a variety of cut-off scores. With a cut-off score of 30: 4 regions of promising similarity are clearly visible. There are a few other dots which could indicate interesting regions, but are not distinct enough to be sure that they are not noise. With a cut-off score of 20: The original 4 promising regions grow a little longer and look a little more convincing. Some new regions appear and some of the original small groups of dots grow a little. Still the best guess is that there are only 4 regions of real promise. With a cut-off score of 10: Things begin to get a bit messy. The original 4 matching regions are still apparent, but beginning to be obscured by dots that most probably represent nothing interesting. With a cut-off score of 5: Well, this just shows you can try for too many dots. The 4 regions that are probably “real” are now very difficult to see amongst the noise. Note: In general, there is no “correct” cut-off score for a dot-plot. It is often necessary to try several values. This makes dotter (see previous slide) a particularly valuable tool, particularly as in differentiates between dots representing “strong” matches and dots and representing “weak” matches by using lighter or darker dots.

The Word size. Smaller words pick up smaller features.
Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Word size. Smaller words pick up smaller features. Large words can miss small matches. The smallest “features” are often just “noise”. Varying the word size: Arguably, the word size is the most interesting dotplot parameter to vary. Features can be missed when plotting with a word size significantly larger than the feature. The smaller the word size, the smaller the feature that can be detected. Smaller word size gives more detailed plots. To try for too much detail by using a very small word size normally results in a very “noisy” picture. Very small “features” are usually not real (“noise”). Appropriate choice of cut-off score for each size of word is important and can reduce unnecessary feature loss when using larger word sizes. Spin (of the Staden package, used for all dotplots featured in this presentation) will automatically adjust the default cut-off score when a new word size is selected.

The Word size. For sequences with regions of small matching features.
Graphical comparison of sequences using “Dotplots”. Dotplot parameters. The Word size. For sequences with regions of small matching features. Small words pick small features Individually. Larger words show matching regions more clearly. Varying the word size: Selecting a small word size is analogous to asking the program to paint what it sees meticulously, from close range with a fine paint brush. The advantage of this is that individual features can often be identified. The drawbacks are that major regions become less apparent and there is often a build up of “noise” which makes the interpretation of the dotplot difficult. Using a larger word size is like asking the dotplot program to stand well back and paint the generality of what it sees with a broad brush. The detail of individual features may be lost, but the main matching regions are picked out with optimal clarity. “Noise” is reduced, but it is possible that some small “real” features might be missed. The lack of detail can be an advantage

The Word size. Graphical comparison of sequences using “Dotplots”.
Dotplot parameters. The Word size. Superimposing a plot with a smaller word size of 11 shows the emergence of extra dots. In this case probably all noise. Displaying the word 11 plot alone shows that major features are drawn in more “carefully”. Arguably, less usefully if a broad overview is the objective. Using a relatively large word size of 25, features are drawn with a broad brush. Detail can be missed Varying the word size: Here we investigate dotplots comparing 2 NADH-ubiquinone oxidoreductase chain 5 proteins, one Human (NU5M_HUMAN) and one from Tobacco (NU5C_TOBAC). Throughout, the PAM 250 scoring scheme and the default cut-off score for the chosen word size are employed. First, the sequences are compared with a word size of 25 (cut-off default, 28). 3 strongly matching regions are clearly defined, plus a couple of shorter features (one either end) that might be of interest. Secondly, we superimpose a dotplot using a word size of 11 (cut-off default, 24). There is no real effect to the 3 main regions or the two smaller features, but there is a considerable increase in the number of dots elsewhere. In this instance, it is probable that all the additional dots are just “noise” (it might have been more exciting to find small real features missed by the larger word size). Finally, we remove the original word size 25 plot and leave just the word size 11 plot. Now the 3 main regions and 2 small features are represented as 8 features. Whilst it could be argued that this is a more accurate representation of the similarity between these two sequences, it is not necessarily the most useful. Dotplots are for providing an overview showing the main regions of interesting similarity between two sequences. Other methods (primarily textual alignments) exist to study the detail.

Other uses of dotplots. Detection of Repeats The detection of repeats: Repeat regions within a single sequence can be detected by computing a dotplot of that sequence compared to itself. Lines of dots parrallel to the leading diagonal should be generated indicating the A pair of matching repeat regions should generate a line of dots parallel to the leading diagonal indicating their positions. The leading diagonal itself will, of course, be drawn indicating that the whole sequence looks very similar to itself! The whole plot will, again of course, be symetrical about the leading diagonal. Some dotplot programs (including spin from the Staden package) will detect that the 2 sequences being compared are identical and will then omit to plot the leading diagonal and all the dotplot above the leading diagonal (as it will be identical to the plot below the leading diagonal).

Other uses of dotplots. Detection of Repeats The detection of repeats: Here we see an example of a dotplot revealing a single long repeat in the protein CARB_ARCFU. As can be seen from the sequence Feature Table (part of the annotation for this sequence), the second half of the protein is a repeat of the first half. This is clearly illustrated by the dotplot. Some dotplot programs (including spin from the Staden package) allow the aligned sequence to be viewed from selected positions of the plot. From the aligned sequence, the repeat is also clear.

Other uses of dotplots. Detection of Repeats The detection of repeats: Multiple repeats are detectable in a similar way. When there are many repeats, the plots can become quite involved.

Other uses of dotplots. Detection of Repeats The detection of repeats: Here we see an example of a human muskelin protein (MKLN_HUMAN) with 5 matching repeat regions. & 515

Detection of Stem Loops
Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Stem Loops Detection of Stem Loops: To detect stem loops using a dotplot one must compare the sequence containing the stem loop with its reverse and complement. In the illustration, we consider a DNA sequence including a stem loop, one arm of which is represented as a blue section, the other as a red section. By definition (of a stem loop), the blue section should be very similar to the reverse and complement of the red section and vice versa. The gap in between the red and blue stems is the portion of DNA that forms the loop. First consider the stem loop sequence compared with itself. A dotplot of any sequence with itself would include an unbroken line of dots up the leading diagonal. This line of dots would be conveying the message “Gosh! The whole of this sequence looks awesomely like itself!”. If we then reverse the vertical copy of the sequence and the dotplot will change to include an unbroken line of dots down the trailing diagonal. This line of dots would hold the message “Gosh! The vertical sequence is stunning similar to the horizontal sequence … only backwards”. Finally, if we complement the reversed vertical sequence copy of the sequence, the blue and read stems can be considered to effectively change positions (as the complement of the blue stem must be similar to the red stem and vice versa). So the resultant dot plot will now include two aligned series of dots straddling the trailing diagonal indicating the position of the matching stems. This being the dotplot signature of a stem loop. More generally, inverted repeats will show up in a similar fashion, but as these may be further apart, they need not be so close to the trailing diagonal.

Detection of Stem Loops
Graphical comparison of sequences using “Dotplots”. Other uses of dotplots. Detection of Stem Loops Detection of Stem Loops: Here we show an example of a stem loop visualised using a dotplot. The EMBL entry TASVSSL (Tomato apical stunt viroid-S stem-loop RNA) has a stem loop between positions 84 and 111, as can be verified by looking at the Feature Table within the EMBL annotation. By computing a dotplot comparing the EMBL entry (Horizontal axis) with its reverse and complement (Vertical axis), this stem loop becomes visible as a line of dots straddling the trailing diagonal of the dotplot. Also illustrated is the sequence alignment of the stem loop.

What you should know What dot plot is
Parameters (scoring scheme, cut-off score, word size) Appearance of related regions (between sequences) Repeats within sequences Palindromes (within sequences) Programs

The End. Back to 0.3a

Example Bacterial sensor protein binds communication signal, then binds to DNA and initiates transcription Signal binding Signal DNA-binding RNA polymerase DNA The “normal” domain architecture of the sensor protein Signal binding DNA-binding

Example Normal and shuffled (inverted) bacterial sensor proteins can fulfill the same function Do we see the difference by dot plot? Inverted sequence Normal sequence Normal sequence Dot Plot a Inverted sequence

The two fundamental sequence alignment algorithms
Global and local alignment

Pairwise alignment – the simplest case
Why? Pairwise alignment – the simplest case Why? We have two (protein or DNA) sequences originating from a common ancestor The purpose of an alignment is to line up all positions that originate from the same position in the ancestral sequence © 2001, Jack A.M. Leunissen 51

Why? Pairwise alignment – the simplest case The purpose of an alignment is to line up all residues that were derived from the same residue position in the ancestral gene or protein in two sequences gap = insertion or deletion © 2001, Jack A.M. Leunissen 53

Types of algorithms according to how we do it: Global and local
Pairwise Alignment How (1)? Types of algorithms according to how we do it: Global and local Global Local Local similarities e.g.between multidomain proteins… Global similarities e.g.between single domains… © 2001, Jack A.M. Leunissen 54

Global alignment Align two sequences from “head to toe”, i.e.
Pairwise Alignment Global alignment Align two sequences from “head to toe”, i.e. from 5’ ends to 3’ ends from N-termini to C-termini Exhaustive algorithm published by: Needleman, S.B. and Wunsch, C.D. (1970) “A general method applicable to the search for similarities in the amino acid sequence of two proteins” J. Mol. Biol. 48: “Exhaustive” means: all cases tested so the result (the alignment) is guaranteed to be optimal. © 2001, Jack A.M. Leunissen 55

Pairwise Alignment Local alignment Locate region(s) with high degree of similarity in two sequences Exhaustive algorithm published by: Smith, T.F. and Waterman, M.S. (1981) “Identification of common molecular subsequences” J. Mol. Biol. 147: © 2001, Jack A.M. Leunissen 56

Global Alignment Simple rules: Match (i,j) = Gap = 1
1, if residue (i) = residue (j); else 0 Gap = 1 Score (i,j) = Maximum of Score (i+1,j+1) + Match (i,j) Score (i+1,j) + Match (i,j) - Gap Score (i,j+1) + Match (i,j) - Gap

Global Alignment a a c t t g a g c - c -6 t -5 g -4 a -3 g -2 t -1

Global Alignment a a c t t g a g c - c -6 t -5 g -4 a -3 g -2 t 0 -1

Global Alignment a a c t t g a g c - c -6 t -5 g 1 0 -3 -4 a 2 0 -2 -3

Global Alignment a a c t t g a g c - c 3 4 5 4 -2 -1 -1 -2 -4 -6
a a c t t g a g c - c t - g a g t

Local Alignment Simple rules: Match (i,j) = Gap = 1
1, if residue (i) = residue (j); else 0 Gap = 1 Score (i,j) = Maximum of Score (i+1,j+1) + Match (i,j) Score (i+1,j) + Match (i,j) - Gap Score (i,j+1) + Match (i,j) - Gap

Local Alignment a a c t t g a g c - c 3 4 5 4 3 1 0 0 1 0
c t t g a g c t - g a g

Local Alignment a a c t t g a g c - c 3 4 5 4 3 1 0 0 1 0
c t t g a g c - t g a g

C 12 S T P A G N D E Q H R K M I L V F Y W C S T P A G N D E Q H R K M I L V F Y W PAM250

Global Alignment Advanced rules: Match (i,j) = Gap =
W(i,j), where W is a score matrix, like PAM250 Gap = Gap_init + Gap_length  length_of_gap Score (i,j) = Maximum of Score (i+1,j+1) + Match (i,j) Score (i+1,j) + Match (i,j) - Gap Score (i,j+1) + Match (i,j) - Gap

Local Alignment Advanced rules: Match (i,j) = Gap =
W(i,j), where W is a score matrix, like PAM250 Gap = Gap_init + Gap_length  length_of_gap Score (i,j) = Maximum of Score (i+1,j+1) + Match (i,j) Score (i+1,j) + Match (i,j) - Gap Score (i,j+1) + Match (i,j) - Gap

Concepts learnt in the this lecture
Alignments can be exhaustive or heuristic Exhaustive, also called dynamic programming, if we do not need much of resources (e.g. we have few sequences to align) Heuristic, for realistic problems where time is an issue Alignments can be global and local Global: from beginning to end Local: pinpoint highly similar regions (more realistic)

What methods to select according to the time/resources we have?
How (2)? What methods to select according to the time/resources we have? If we have time/resources, we can try exhaustive algorithms. This is an option with supercomputers or GPU implementations… For realistic problems (and realistic resources) we need heuristic alignments that restrict the search space to a manageable size…. at a price of loosing some accuracy.

Alignment heuristics (examples)
Search space reduction 1: Pre-filter sequences to be aligned. Rationale: comparing very different sequences make no biological sense. Brute force filtering is efficient. Search space reduction 2: Filter out obviously useless alignments. Means leaving out the corners of the SW or NW search matrices Only those around the diagonal make sense. The corners look like this:

Bacterial sensor protein binds communication signal, then binds to DNA and initiates transcription
Signal binding Signal DNA-binding RNA polymerase DNA The “normal” domain architecture of the sensor protein Signal binding DNA-binding

Local alignment (Smith Waterman) Global alignment (Needleman-Wunsch)
Normal and shuffled (inverted) bacterial sensor proteins can fulfill the same function Do we see the difference by simple (raw) pairwise alignment? Local alignment (Smith Waterman) Normal Inverted Global alignment (Needleman-Wunsch) Identical sequences match at each amino acid – we show them as a series of “|” symbols Pairwise alignment by itself sees only the similarity of the larger domain, the smaller one is lost (empty line, no hits(

Normal and shuffled (inverted) bacterial sensor proteins can fulfill the same function Do we see the difference by dot plot? Inverted sequence Normal sequence Normal sequence Dot Plot a Inverted sequence

The distance matrix has the info, just the alignment algorithm does not pick it up!!!
Heat map of Smith Waterman matrix Dot Plot Normal sequence Inverted sequence Inverted sequence Normal sequence

What have we learnt? Sequence scoring matrices (PAM, BLOSUM, unitary, and how to make one’s own…) Dot plots The two basic algorithms Global alignment (Needleman Wunsch), local alignment (Smith-Waterman) Classifying alignment methods (how to align): exhaustive, heuristic, local, global Global alignment, exhaustive: Needleman-Wunsch Local alignment, exhaustive: Smith-Waterman Heuristics: simple examples

Pairwise Alignment Sándor Pongor

Similar presentations

Presentation on theme: "Pairwise Alignment Sándor Pongor"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pairwise Alignment Sándor Pongor

Similar presentations

Presentation on theme: "Pairwise Alignment Sándor Pongor"— Presentation transcript:

Similar presentations

About project

Feedback