Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling.

Slides:



Advertisements
Similar presentations
Motivation “Nothing in biology makes sense except in the light of evolution” Christian Theodosius Dobzhansky.
Advertisements

Whole genome alignments Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Phylogenetic Trees Lecture 4
1 General Phylogenetics Points that will be covered in this presentation Tree TerminologyTree Terminology General Points About Phylogenetic TreesGeneral.
Patterns, Profiles, and Multiple Alignment.
Hidden Markov Models: Applications in Bioinformatics Gleb Haynatzki, Ph.D. Creighton University March 31, 2003.
Bioinformatics Finding signals and motifs in DNA and proteins Expectation Maximization Algorithm MEME The Gibbs sampler Lecture 10.
Tree Reconstruction.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Maximum Likelihood. Historically the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Its slow uptake by the scientific community.
Bioinformatics and Phylogenetic Analysis
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Lecture 9 Hidden Markov Models BioE 480 Sept 21, 2004.
Realistic evolutionary models Marjolijn Elsinga & Lars Hemel.
Approaches to Sequence Analysis s2s2 s3s3 s4s4 s1s1 statistics GT-CAT GTTGGT GT-CA- CT-CA- Parsimony, similarity, optimisation. Data {GTCAT,GTTGGT,GTCA,CTCA}
Similar Sequence Similar Function Charles Yan Spring 2006.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Class 3: Estimating Scoring Rules for Sequence Alignment.
Finding the optimal pairwise alignment We are interested in finding the alignment of two sequences that maximizes the similarity score given an arbitrary.
CISC667, F05, Lec16, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) Phylogenetic Trees (III) Probabilistic methods.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Probabilistic methods for phylogenetic trees (Part 2)
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Phylogeny Estimation: Traditional and Bayesian Approaches Molecular Evolution, 2003
BINF6201/8201 Molecular phylogenetic methods
Protein Sequence Alignment and Database Searching.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
BINF6201/8201 Molecular phylogenetic methods
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2010.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Phylogenetic Prediction Lecture II by Clarke S. Arnold March 19, 2002.
Calculating branch lengths from distances. ABC A B C----- a b c.
Function preserves sequences Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Using traveling salesman problem algorithms for evolutionary tree construction Chantal Korostensky and Gaston H. Gonnet Presentation by: Ben Snider.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
1 Sequence Alignment Input: two sequences over the same alphabet Output: an alignment of the two sequences Example: u GCGCATGGATTGAGCGA u TGCGCCATTGATGACCA.
EVOLUTIONARY HMMS BAYESIAN APPROACH TO MULTIPLE ALIGNMENT Siva Theja Maguluri CS 598 SS.
Parsimony-Based Approaches to Inferring Phylogenetic Trees BMI/CS 576 Colin Dewey Fall 2015.
The generalization of Bayes for continuous densities is that we have some density f(y|  ) where y and  are vectors of data and parameters with  being.
Approaches to Sequence Analysis Data {GTCAT,GTTGGT,GTCA,CTCA} GT-CAT GTTGGT GT-CA- CT-CA- s2s2 s3s3 s4s4 s1s1 statistics Parsimony, similarity, optimisation.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Burkhard Morgenstern Institut für Mikrobiologie und Genetik Molekulare Evolution und Rekonstruktion von phylogenetischen Bäumen WS 2006/2007.
Ayesha M.Khan Spring Phylogenetic Basics 2 One central field in biology is to infer the relation between species. Do they possess a common ancestor?
1 Probability Review E: set of equally likely outcomes A: an event E A Conditional Probability (Probability of A given B) Independent Events: Combined.
Probabilistic methods for phylogenetic tree reconstruction BMI/CS 576 Colin Dewey Fall 2015.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Maximum likelihood (ML) method
Goals of Phylogenetic Analysis
Chapter 19 Molecular Phylogenetics
Sequence Analysis Alan Christoffels
Presentation transcript:

Statistical Alignment: Computational Properties, Homology Testing and Goodness-of-Fit J. Hein, C. Wiuf, B. Knudsen, M.B. Moller and G. Wibling

Main Objective of the paper To show how to accelerate the statistical alignment algorithms several orders of magnitude using the model of insertion and deletions by Thorne, Kishino, and Felsenstein in 1991 (TKF91 model). To propose a new homology test based on the model. To describe a goodness-of-fit test that allows testing the proposed insertion-deletion process inherent to the model.

Why isn’t statistical alignment popular? Computationally VERY SLOW –Authors of the paper accelerated the statistical alignment algorithms several orders of magnitude compared with the TKF91 algorithm. Lack of user-friendly software? –Usually written in Fortran or C, or the compiled program only works in UNIX environment, but most biologists don’t know much about it. –Authors of the paper have provided a web interface to the program

parsimony and similarity alignments parsimony and similarity alignments Parsimony strategy: minimizing the distance For example: Similarity strategy: maximizing the similarity score For example: BLAST

TKF91 model of substitutions continuous time Markov model on the state space of nucleotides or amino acids Rate matrix Q is specified –Describes the intensity of different substitution events over an infinitesimal time period. –Probability that i has changed to j after time t is The process is assumed to be time reversible:

TKF91 model of the indel process Can be view as a Markov model with all sequences as possible states indel part of the model –links connecting the letters of the sequences –each has a mortal link on the right –left end has an immortal link –For example:  A  G  G  If the type of the nucleotide is ignored, can be represented as    

TKF91 model –mortal link can give birth to a new mortal link or die out –immortal link can also give birth but would not die –Therefore, the rates can be written as:  A  G  G  I 0 S 1 I 1 /D 1 S 2 I 2 /D 2 S 3 I 3 /D 3 where I is the birth rate D is the death rate, D>I S is the substitution rate

TKF91 model To calculate the probability of a particular alignment: s (1) :  A  T  - s (2) :  C  T  G  P(s (1), s (2), alignment) = (p 1 ’’)(  A  P 1  P AC )(  T  P 2  P TT  G )

Calculating the probability of two sequences Without conditioning on the alignment, it is necessary to sum over all alignments weighted with their probabilities according to the TKF91 process. Confine likelihood calculations to a band close to the similarity based alignment allows an efficient numerical optimization algorithm for finding the maximum likelihood estimate The recursions originally presented by Thorne, Kishino and Felsenstein can be simplified.