DNA, RNA and protein are an alien language

Slides:



Advertisements
Similar presentations
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Previous Lecture: Probability
A Hidden Markov Model for Progressive Multiple Alignment Ari Löytynoja and Michel C. Milinkovitch Appeared in BioInformatics, Vol 19, no.12, 2003 Presented.
Introduction to Bioinformatics Burkhard Morgenstern Institute of Microbiology and Genetics Department of Bioinformatics Goldschmidtstr. 1 Göttingen, March.
S. Maarschalkerweerd & A. Tjhang1 Probability Theory and Basic Alignment of String Sequences Chapter
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
DNA Alignment. Dynamic Programming R. Bellman ~ 1950.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
CISC667, F05, Lec6, Liao CISC 667 Intro to Bioinformatics (Fall 2005) Pairwise sequence alignment Smith-Waterman (local alignment)
Alignment II Dynamic Programming
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
CECS Introduction to Bioinformatics University of Louisville Spring 2003 Dr. Eric Rouchka Lecture 3: Multiple Sequence Alignment Eric C. Rouchka,
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Sequence comparison: Local alignment
Chapter 5 Multiple Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms
Needleman Wunsch Sequence Alignment
Sequence Alignment.
Sequence Alignment and Phylogenetic Prediction using Map Reduce Programming Model in Hadoop DFS Presented by C. Geetha Jini (07MW03) D. Komagal Meenakshi.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
. Sequence Alignment and Database Searching 2 Biological Motivation u Inference of Homology  Two genes are homologous if they share a common evolutionary.
Pairwise Sequence Alignment
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
Gapped BLAST and PSI- BLAST: a new generation of protein database search programs By Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui.
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
COT 6930 HPC and Bioinformatics Multiple Sequence Alignment Xingquan Zhu Dept. of Computer Science and Engineering.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Techniques for Protein Sequence Alignment and Database Searching (part2) G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
INTRODUCTION TO BIOINFORMATICS
Sequence comparison: Local alignment
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence Alignment 11/24/2018.
Using Dynamic Programming To Align Sequences
Intro to Alignment Algorithms: Global and Local
Lecture 14 Algorithm Analysis
Sequence Alignment Tutorial #2
Presentation transcript:

DNA, RNA and protein are an alien language DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we want to decipher both its meaning and its history …

We do not have to understand the languaje to identify patterns: Fortunate the genetic code is alphabetic … susceptible to perform string comparisons and pattern recognition We do not have to understand the languaje to identify patterns: “klaatu barada nikto” (El Día que la Tierra se Paralizó) Miescher 1892

Pairwise Sequence Alignment

Pairwise Sequence Alignment Principles of pairwise sequence comparison global / local alignments scoring systems gap penalties Methods of pairwise sequence alignment window-based methods dynamic programming approaches These two methods are generally used to obtain alignments. They serve as a basis for many other operations in computational biology. For homology searches in databases both methods are combined. We will come to this later in our db search session.

Pairwise Sequence Alignment: How to? A T T C A C A T A T A C A T T A C G T A C Sequence 2 Sequence 1

Dotplot: A dotplot gives an overview of all possible alignments A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 In the following I will often use dotplots and alignment matrices to explain alignment algorithms. The dotplot technique: a dotplot allows visual inspection of all possible alignments. The two sequences to be aligned are written out as column and row headings of a so called alignment matrix. Note that the vertical sequence is read from bottom to top. Dots are put in the matrix when the symbols of the two sequences are identical. Sequence 1

Dotplot: In a dotplot each diagonal corresponds to a possible (ungapped) alignment A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 2 A dotplot gives an overview of all possible alignments of two sequences. Each diagonal represents one possible alignment. Sequence 1 T A C A T T A C G T A C A T A C A C T T A One possible alignment:

Pairwise Sequence Alignment Principles of pairwise sequence comparison global / local alignments scoring systems gap penalties Methods of pairwise sequence alignment window-based methods dynamic programming approaches These two methods are generally used to obtain alignments. They serve as a basis for many other operations in computational biology. For homology searches in databases both methods are combined. We will come to this later in our db search session.

Window-based Approaches Word Size Window / Stringency Windows-based approaches are quick methods used for database searches There are two different approaches: - word size algorithm, searching for short identities - window/stringency, searching for short similar regions, without gaps Neither one of the methods uses gap penalties!

Word Size Algorithm T A C G G T A T G Word Size = 3 A C A G T A T C C T A T  G A C A T A C G G T A T G T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C A window with a user defined word size slides across the aligned sequences. With the word size of 3 a dot is drawn only if three neighbouring nucleotides match. The search for short identities of all possible alignments. Note that all items within words must match and the “word” cannot be splitted by gaps. This gives problems in comparing protein sequences. The word algorithm is not very sensitive. It is not suited to detect weak homologies. T A C G G T A T G A C A G T A T C 

Window / Stringency T A C G G T A T G Window = 5 / Stringency = 4 T C A G T A T C Window = 5 / Stringency = 4 C T A  T  G  A C A T A C G G T A T G T A C G G T A T G T C A G T A T C  T A C G G T A T G T C A G T A T C  The problem with the sensitivity can be overcome with the permission of mismatches in a word. Simply by defining a word size and a minmal number of matches. GCG programs call this stringency. Dotplots generated this way are more sensitive. T A C G G T A T G T C A G T A T C 

Considerations The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). The smaller the window, the larger the weight of statistical (unspecific) matches. With large windows the sensitivity for short sequences is reduced. Insertions/deletions are not treated explicitly.

Insertions / Deletions in a Dotplot Sequence 2 T A C G T A C T G T T C A T This alignment contains one gap. In the corresponding dotplot the diagonals of the alignment are drawn and then they are shifted one position. Sequence 1 T A C T G - T C A T | | | | | | | | | T A C T G T T C A T

Dotplot (Window = 130 / Stringency = 9) Hemoglobin -chain Output of the programs Compare and DotPlot With the programs Compare and dotplot you can create a visual alignment. If you run Compare with the default parameters aligning very similar sequences the dotplot gets very crowded. You can filter these results either by reducing the windowsize or by increasing the stringency. Hemoglobin -chain

Dotplot (Window = 18 / Stringency = 10) Hemoglobin -chain Output of the programs Compare and DotPlot Here we changed the size of the window from 30 to 18 and we changed the stringency from 9 to 10. Hemoglobin -chain

Pairwise Sequence Alignment Principles of pairwise sequence comparison global / local alignments scoring systems gap penalties Methods of pairwise sequence alignment window-based approaches dynamic programming approaches Needleman and Wunsch Smith and Waterman Window based approaches are quick methods for the identification of sequence similarities. However, for computing an optimal alignment of two sequences one has to use another approach: dynamic programming.

Dynamic Programming Automatic procedure that finds the best alignment with an optimal score depending on the chosen parameters. Recursive solutions. We solve smaller problems first, and use those solutions to solve larger problems. Intermediate solutions are stored in a tabular matrix. As we have seen in the last section, the GCG program Gap uses the Needleman & Wunsch algorithm to compute a global alignment. The GCG programs Similarity and Bestfit compute local alignment. They use the Smith & Waterman algorithm to identify a region (or regions) of highest similarity. The Needleman & Wunsch algorithm aligns a pair of sequences over their entire lengths while the Smith-Waterman algorithm finds the best matching regions in the same pair of sequences. Global algorithms are often not effective for highly diverged sequences and do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Very often two sequences share only a single functional domain.

Basic principles of dynamic programming - Initialization of alignment matrix: the scoring model - Stepwise calculation of score values (creation of an alignment path matrix) - Backtracking (evaluation of the optimal path) The basic principles of dynamic programming. Basically there are three steps: - Creation of a alignment path matrix - Stepwise calculation of score values - Backtracking: evaluation of the optimal path

Initialization of Matrix (BLOSUM 50): A distance metric H E A G A W G H E E P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1 A -2 -1 5 0 5 -3 0 -2 -1 -1 W -3 -3 -3 -3 -3 15 -3 -3 -3 -3 H 10 0 -2 -2 -2 -3 -2 10 0 0 E 0 6 -1 -3 -1 -3 -3 0 6 6 The score matrix for the two example sequences showing the BLOSUM50 values for each aligned residue pair. Positive scores are in bold

Needleman and Wunsch (global alignment) Sequence 1: H E A G A W G H E E Sequence 2: P A W H E A E Scoring parameters: BLOSUM50 matrix Gap penalty: Linear gap penalty of 8 First, we will take a closer look at the Needleman-Wunsch algorithm. We will align these two simple sequences. Because we introduced the scoring scheme as log-odds ratio, the scores are additive and better alignments will have higher scores. For simplicity, we will use a linear gap penalty.

Creation of an alignment path matrix Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj Build F(i,j) recursively beginning with F(0,0) = 0 E H - E - A P G - A W G - H E - A Optimal global alignment:

Creation of an alignment path matrix H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19 E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1 HEAGAWGHE-E --P-AW-HEAE Optimal global alignment:

Creation of an alignment path matrix F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i-1, j-1) F(i, j-1) F(i-1,j) F(i, j) HEAGAWGHE-E --P-AW-HEAE s(xi ,yj) -d -d

Creation of an alignment path matrix If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j) Three possibilities: xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj) xi is aligned to a gap, F(i,j) = F(i-1,j) - d yj is aligned to a gap, F(i,j) = F(i,j-1) - d The best score up to (i,j) will be the largest of the three options

Creation of an alignment path matrix H E A G A W G H E E P A W H E -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 -8 -16 -24 -32 -40 -48 -56 Boundary conditions F(i, 0) = -i d F(j, 0) = -j d To fill the top row and the left column we need some boundary conditions. Top row: j=0 so F(i,j-1) and F(i-1,j-1) do not exist. Since the F(i,0) values represent gaps we can define: F(i,0) = -id. When filling the matrix we will keep a pointer in each cellback to the cell from which its F(i,j) was derived. Left column: i=0, so F0,j) = -jd.

Stepwise calculation of score values H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 A -16 W -24 H -32 E -40 A -48 E -56 F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = max F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d P-H=-2 E-P=-1 H-A=-2 E-A=-1 -2 -9 -10 -3 F(0,0) + s(xi ,yj) = 0 -2 = -2 F(1,1) = max F(0,1) - d = -8 -8= -16 = -2 F(1,0) - d = -8 -8= -16 F(1,0) + s(xi ,yj) = -8 -1 = -9 F(2,1) = max F(1,1) - d = -2 -8 = -10 = -9 F(2,0) - d = -16 -8= -24 Filling the alignment path matrix step by step. -8 -2 = -10 F(1,2) = max -16 -8 = -24 = -10 -2 -8 = -10 -2 -1 = -3 F(2,2) = max -10 -8 = -18 = -3 -9 -8 = -17

Backtracking H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19 E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1 -8 -16 -17 -25 -20 -5 -13 -3 3 The alignment path matrix is now filled completely. The value of the final cellof the matrix F(10,7) at the bottom right corner is by definition the best score for the global alignment of our two sequences. To find the alignment itself we must find the path of choices that lead to this final value. The procedure to do this is called backtracking. - Build the alignment in reverse, starting from the final cell following the pointers that we stored when building the matrix. - At each step we add a pair of symbols to the front end of the alignment. -5 1 E H - E - A P G - A W G - H E - A Optimal global alignment:

Smith and Waterman (local alignment) Two differences: 1. 2. An alignment can now end anywhere in the matrix F(i, j) = F(i-1, j-1) + s(xi ,yj) F(i, j) = F(i-1, j) - d F(i, j) = F(i, j-1) - d F(i, j) = max Whith the Smith Waterman algorithm we can look for the best alignment between subsequences of sequence x and sequence y. This arises for example when we suspect two sequences to share a commen domain or when we compare extended stretches of genomic DNA. It is also the sensitive method to detect highly diverged sequences. There are two differences to the Needleman and Wunsch algorithm. 1. An extra possibilityof 0 is added to the equation. The value taking 0 corresponds to starting a new alignment. As a consequence the top row and the left column are filled with 0. 2. An alignment can end anywhere in the matrix. So we can look for the highest value over the whole and start a backtracking from there. A traceback ends when a cell with value 0 is reached, which corresponds to the start of the alignment. Example: Sequence 1 H E A G A W G H E E Sequence 2 P A W H E A E Scoring parameters: Log-odds ratios Gap penalty: Linear gap penalty of 8

Smith Waterman alignment H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 0 0 0 W 0 0 0 0 2 0 20 12 4 0 0 H 0 10 2 0 0 0 12 18 22 14 6 E 0 2 16 8 0 0 4 10 18 28 20 A 0 0 8 21 13 5 0 4 10 20 27 E 0 0 6 13 18 12 4 0 4 16 26 5 20 12 22 28 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below. E AA WW G- HH Optimal local alignment:

Extended Smith & Waterman To get multiple local alignments: delete regions around best path repeat backtracking

Extended Smith & Waterman H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 0 0 0 0 0 W 0 0 0 0 2 0 0 0 H 0 10 2 0 0 0 E 0 2 16 8 0 0 A 0 0 8 21 13 5 0 E 0 0 6 13 18 12 4 0 5 20 12 4 12 18 22 14 6 4 10 18 28 20 4 10 20 27 4 16 26 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below.

Extended Smith & Waterman H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0 P 0 0 0 0 0 0 0 0 0 0 A 0 0 0 5 0 0 0 0 0 0 W 0 0 0 0 2 0 0 0 H 0 10 2 0 0 0 E 0 2 16 8 0 0 A 0 0 8 21 13 5 0 E 0 0 6 13 18 12 4 0 10 16 Our example sequences aligned with the Smith Waterman algorithm. The optimal local alignment is shown below. 21 AA H EE Second best local alignment:

Further Extensions of Dynamic Programming Overlap matches Alignment with affine gap scores The dynamic programming algorithms can be extended to deal with overlap matches e.g. when comparing genomic DNA fragments to each another. And we can include affine gap penaties. Basically these are variations one the same theme. Who wants to know more about it could dive into the literature.

Pairwise Sequence Alignment Pairwise sequence comparison global / local alignments parameters scoring systems insertions / deletions Methods of pairwise sequence alignment dotplot windows-based methods dynamic programming algorithm complexity

End.of.pa.irwise..sequence | | | | | align.ment.cours.e

Methods of Pairwise Comparison Progressive Alignment: step Multiple Alignment 1. Methods of Pairwise Comparison Programs perform global alignments: Needleman & Wunsch: (Pileup, Tree, Clustal) Word Size Method: (Clustal) X. Huang (MAlign) (modified N-W)

Construction of a Guide Tree Progressive Alignment: step Multiple Alignment 2. Construction of a Guide Tree 1 2 3 4 5 Sequence 1 2 3 4 5 Similarity Matrix: displays scores of all sequence pairs. The similarity matrix is transformed into a distance matrix . . . . .

Construction of a Guide Tree Progressive Alignment: step Multiple Alignment 2. Construction of a Guide Tree Guide Tree 1 5 Distance Matrix 2 3 4 Neighbour-Joining Method or UPGMA (unweighted pair group method of arithmetic averages)

3. Multiple Alignment 2 1 Multiple Alignment Guide Tree 1 5 2 3 4 Progressive Alignment: step Multiple Alignment 3. Multiple Alignment Guide Tree 1 5 2 3 2 4 1

Columns - once aligned - are never changed Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - C A G G T T - C G C C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G

Columns - once aligned - are never changed Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - C A G G T T - C G C C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G . . . . and new gaps are inserted.

Columns - once aligned - are never changed Progressive Alignment: step Multiple Alignment 3. Columns - once aligned - are never changed G T C C G - - C A G G T T - C G C - C - G G G T C C G - - C A G G T T - C G C - C - G G T T A C T T C C A G G T T A C T T C C A G G A T C - T - - C A A T C T G - T C C C T A G A T C T - - C A A T C T G T C C C T A G

Sub-sequence alignments

A K-means like clustering problem

Clustering resulting model

Clustering predictions

Assignments Describe a pairwise alignment with a different gap penalization. Provide an example and perform a multiple global alignment. Describe the recipe. Provide an example and and perform a multiple alignment of subsequences. Describe the recipe. Algorithms Order (polynomial, exponential, NP)

Algorithmic Complexity How does an algorithm‘s performance in CPU time and required memory storage scale with the size of the problem? Needleman & Wunsch Storing (n+1)x(m+1) numbers Each number costs a constant number of calculations to compute (three sums and a max) Algorithm takes O(nm) memory and O(nm) time Since n and m are usually comparable: O(n2) It is useful to know how an algorithm‘s performance in CPU time and required memory storage will scale with the size of the problem. The Needleman and Wusch algorithm stores (n+1)x(m+1) numbers. Each number costs a constant number of calculations to compute (three sums and a max) Algorithm takes O(nm) memory and O(nm) time Since n and m are usually comparable: O(n2) This is called the <big O> notation. The algorithm is of the order nm. With biological sequences and standard computers O(n2) algorithms are feasible but a little slow, while O(n3)algorithms are only feasible for very short sequences.