4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Slides:



Advertisements
Similar presentations
Sequence Alignment I Lecture #2
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Hidden Markov Model in Biological Sequence Analysis – Part 2
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
A new method of finding similarity regions in DNA sequences Laurent Noé Gregory Kucherov LORIA/UHP Nancy, France LORIA/INRIA Nancy, France Corresponding.
1 Chapter 2 Data Searches and Pairwise Alignments 暨南大學資訊工程學系 黃光璿 2004/03/08.
Profile Hidden Markov Models Bioinformatics Fall-2004 Dr Webb Miller and Dr Claude Depamphilis Dhiraj Joshi Department of Computer Science and Engineering.
 CpG is a pair of nucleotides C and G, appearing successively, in this order, along one DNA strand.  CpG islands are particular short subsequences in.
JM - 1 Introduction to Bioinformatics: Lecture XIII Profile and Other Hidden Markov Models Jarek Meller Jarek Meller Division.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Lecture 6, Thursday April 17, 2003
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Hidden Markov Models. Two learning scenarios 1.Estimation when the “right answer” is known Examples: GIVEN:a genomic region x = x 1 …x 1,000,000 where.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
1-month Practical Course Genome Analysis (Integrative Bioinformatics & Genomics) Lecture 3: Pair-wise alignment Centre for Integrative Bioinformatics VU.
DNA Alignment. Dynamic Programming R. Bellman ~ 1950.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Protein Multiple Sequence Alignment Sarah Aerni CS374 December 7, 2006.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Protein Sequence Comparison Patrice Koehl
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Variants of HMMs. Higher-order HMMs How do we model “memory” larger than one time point? P(  i+1 = l |  i = k)a kl P(  i+1 = l |  i = k,  i -1 =
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Heuristic methods for sequence alignment in practice Sushmita Roy BMI/CS 576 Sushmita Roy Sep 27 th,
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Introduction to Profile Hidden Markov Models
CSCE555 Bioinformatics Lecture 6 Hidden Markov Models Meeting: MW 4:00PM-5:15PM SWGN2A21 Instructor: Dr. Jianjun Hu Course page:
Protein Sequence Alignment and Database Searching.
Hidden Markov Models for Sequence Analysis 4
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Content of the previous class Introduction The evolutionary basis of sequence alignment The Modular Nature of proteins.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Hidden Markov Models Yves Moreau Katholieke Universiteit Leuven.
Introduction to Bioinformatics Biostatistics & Medical Informatics 576 Computer Sciences 576 Fall 2008 Colin Dewey Dept. of Biostatistics & Medical Informatics.
Construction of Substitution Matrices
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
PGM 2003/04 Tirgul 2 Hidden Markov Models. Introduction Hidden Markov Models (HMM) are one of the most common form of probabilistic graphical models,
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
MGM workshop. 19 Oct 2010 Some frequently-used Bioinformatics Tools Konstantinos Mavrommatis Prokaryotic Superprogram.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Introduction to Sequence Alignment. Why Align Sequences? Find homology within the same species Find clues to gene function Practical issues in experiments.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Profile Hidden Markov Models PHMM 1 Mark Stamp. Hidden Markov Models  Here, we assume you know about HMMs o If not, see “A revealing introduction to.
INTRODUCTION TO BIOINFORMATICS
Presentation transcript:

4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.

Lecture 4.21 Objectives to understand a key class of algorithm in bioinformatics: the dynamic programming. to apply this class of algorithm to a variety of biologically relevant problems. to understand the relation between the theoretical formulation of the model and its implementation.

Lecture 4.22 Historical application In bioinformatics, dynamic programming is most known for its application to sequence alignment: –Smith-Waterman / Needleman-Wunsch. It was extended to do database searches and gene prediction. It has been extensively used for secondary structure prediction in RNA: –mfold (Zucker); –pknot (Rivas and Eddy). Hidden markov model (HMM) are closely related to dynamic programming: –Viterbi algorithm (see Durbin et al.)

Lecture 4.23 Algorithmic description The principle of optimality: “in a sequence of choices, each subsequence must also be optimal.” –An example where it applies: sequence alignment. –With protein folding, it doesn’t apply! If it applies, dynamic programming will provide the fastest exact algorithm!

Lecture 4.24 Algorithmic description (cont.) An example: What is the shortest path from Toronto to Montréal?

Lecture 4.25 Algorithmic description (cont.) With recursion: where D ( i ) is the length of the shortest route from Montréal to the city i, d ij is the distance between cities i and j, and V j represent the neighborhood of city j. With “memory functions”: –Create a table D with an entry for each city. Before computing D(j), check in the table and if it is already there, just return that value. Iteratively: –In some situation, it is possible to order the entries in the table such that each entry only depends on the previous ones. Then you can compute each entry one after the other.

Lecture 4.26 The sequence alignment problem Given two sequences a and b, find the best alignment considering that an insertion or deletion costs I and matching a i and b j costs M ( a i, b j ): –Alignment of ATGAGCGGG vs. GCGCTAGCG would return: ATGAGC--GGG --GCGCTAGCG The score quantifies the similarity between the two sequences.

Lecture 4.27 Needleman - Wunsch Finds the best global alignment of two sequences using dynamic programming. There are two formulations depending of the semantic of M and I. Here we are minimizing the cost of the alignment, we could maximize a score instead. Sometime erroneously referred as Smith-Waterman, which is a variation of the Needleman - Wunsch algorithm that identifies the best local alignment.

Lecture with affine gap costs It was proposed that the opening of a new gap should cost more than extending an existing one: where C ( x ) is the cost of a gap of length x. The use of three matrices becomes necessary:

Lecture 4.29 Dynamic programming version Each matrix stores the best cost achieve for each subsequence assuming the alignment ends with a match ( M ), an insertion ( I ) or a deletion ( D ). Is the principle of optimality preserved? M:M: I:I: D:D:

Lecture HMM - crash course! An example: red/blue balls in two jars: The R state represents the red jar filled by 90% of red balls while the B state is the blue jar filled by 90% of blue balls. What is the most likely explanation of an observed sequence: bbrbbbbbbrbbbbbrrrbrrrrrrrrrbrbbb

Lecture HMM - crash course! (cont.) The Viterbi algorithm can be used to find the sequence of states that is the most likely to generate the observed sequence. The Viterbi algorithm is defined by the recurrence: where v k ( i ) is the probability of ending in state k with observation i, e k ( x i ) is the probability of observing x i in state e k and a kl is the transition probability from state k to l. We can get rid of the “nasty” multiplications by seeking the log- probability of the sequence: That’s dynamic programming!

Lecture HMM - crash course! (cont.) The Viterbi algorithm is dynamic programming:

Lecture Alignment as a HMM Assume the following alignment is the observed sequence produced by a HMM: ATGAGC--GGG --GCGCTAGCG As far as the Viterbi algorithm is concerned, scores would work the same way as log-probabilities.

Lecture Alignment as a HMM (cont.) The sequence produced corresponds to the alignment, it is in a two-dimensional space. The Viterbi matrix will have three dimensions: sequence a, sequence b and the three states.

Lecture Alignment as a HMM (cont.) Does it remind you something??? M:M: I:I: D:D: States:

Lecture Alignment as a HMM (cont.) What is so special about HMM? –They are very flexible. It is easy to represent a complex sequence analysis problem using the HMM form. Examples: cDNA vs. gDNA alignment, DNA vs. protein, gene prediction, intron prediction, etc. –Their parameters can be learned. The most difficult task is often to parameterize the final algorithm. With HMM, parameters can be optimized from a set of observed sequences! (see Baum-Welch algorithm, Durbin et al., p. 63) Their mathematical “flavor” scares a lot of people!

Lecture Linking matches Very fast algorithms are available to identify matches without insertions or deletions between two large sequences. –Aho-corasick engine: à la BLAST. –Dictionary: à la FASTA. –Suffix trees Linking these matches together to identify the most similar region is performed using dynamic programming.

Lecture Linking matches (cont.)

Lecture Linking matches (cont.) S ( j ): Optimal score up to match j. T ij : Score of the transition between matches i and j. Should be negative, can be -∞. M j : Score of match j. Should be positive! Why?

Lecture Some other classes of algorithms Greedy algorithms: –At each step the “best” choice is made and optimally is guaranteed. Probabilistic algorithms: –At each step choices are made randomly. –Sometime used to avoid a frequent worst case. Heuristic algorithms: –Optimality of the solution is not guarantied. –Heuristics are often probabilistic: monte carlo. –Very useful for “difficult” problems or when the formulation is already an approximation.

Lecture References R. Durbin, S. Eddy, A. Krogh and G. Mitchison, Biological Sequence Analysis : Probabilistic Models of Proteins and Nucleic Acids, Cambridge University Press, G. Brassard and P. Bratley, Fundamentals of Algorithms, Prentice Hall, 1996.