CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work.

Slides:



Advertisements
Similar presentations
Approximations of points and polygonal chains
Advertisements

Bayesian Evolutionary Distance P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1— 17, 1996.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Probabilistic Modeling of Molecular Evolution Using Excel, AgentSheets, and R Jeff Krause (Shodor)
The Divide-and-Conquer Strategy
CS223 Advanced Data Structures and Algorithms 1 Divide and Conquer Neil Tang 4/15/2010.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
8/29/06CS 6463: AT Computational Geometry1 CS 6463: AT Computational Geometry Spring 2006 Convex Hulls Carola Wenk.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Advanced Topics in Algorithms and Data Structures Page 1 Parallel merging through partitioning The partitioning strategy consists of: Breaking up the given.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Parallel Merging Advanced Algorithms & Data Structures Lecture Theme 15 Prof. Dr. Th. Ottmann Summer Semester 2006.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Maximum Likelihood Flips usage of probability function A typical calculation: P(h|n,p) = C(h, n) * p h * (1-p) (n-h) The implied question: Given p of success.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Phylogeny Tree Reconstruction
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Alignment II Dynamic Programming
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Annotation and Alignment of the Drosophila Genomes.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Parametric Inference for Biological Sequence Analysis Lior Pachter and Bernd Sturmfels Mathematics Dept., U.C. Berkeley.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Developing Pairwise Sequence Alignment Algorithms
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Lecture 3: Markov models of sequence evolution Alexei Drummond.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Dynamic Programming. Well known algorithm design techniques:. –Divide-and-conquer algorithms Another strategy for designing algorithms is dynamic programming.
Phylogenetic Analysis. General comments on phylogenetics Phylogenetics is the branch of biology that deals with evolutionary relatedness Uses some measure.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Multiple alignment: Feng- Doolittle algorithm. Why multiple alignments? Alignment of more than two sequences Usually gives better information about conserved.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
CS 8833 Algorithms Algorithms Dynamic Programming.
More statistical stuff CS 394C Feb 6, Today Review of material from Jan 31 Calculating pattern probabilities Why maximum parsimony and UPGMA are.
Comp. Genomics Recitation 9 11/3/06 Gene finding using HMMs & Conservation.
. Fasta, Blast, Probabilities. 2 Reminder u Last classes we discussed dynamic programming algorithms for l global alignment l local alignment l Multiple.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Phylogeny Ch. 7 & 8.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Sequence Alignment.
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Probabilistic Approaches to Phylogenies BMI/CS 576 Sushmita Roy Oct 2 nd, 2014.
Building Phylogenies Maximum Likelihood. Methods Distance-based Parsimony Maximum likelihood.
Common Intersection of Half-Planes in R 2 2 PROBLEM (Common Intersection of half- planes in R 2 ) Given n half-planes H 1, H 2,..., H n in R 2 compute.
Building Phylogenies. Phylogenetic (evolutionary) trees Human Gorilla Chimp Gibbon Orangutan Describe evolutionary relationships between species Cannot.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Introduction to Algorithms: Divide-n-Conquer Algorithms
Multiple sequence alignment (msa)
The ideal approach is simultaneous alignment and tree estimation.
Intro to Alignment Algorithms: Global and Local
Pairwise Sequence Alignment (cont.)
Multiple Sequence Alignment (I)
Presentation transcript:

CPM '05 Sensitivity Analysis for Ungapped Markov Models of Evolution David Fernández-Baca Department of Computer Science Iowa State University (Joint work with Balaji Venkatachalam)

CPM '05 Motivation Alignment scoring schemes are often based on Markov models of evolution Optimum alignment depends on evolutionary distance Our goal: Understand how optimum alignments are affected by choice of evolutionary distance

CPM '05 Ungapped local alignments Only matches and mismatches — no gaps An ungapped local alignment of sequences X and Y is a pair of equal-length substrings of X and Y X Y

CPM '05 Ungapped local alignments P. Agarwal and D.J. States. Bayesian evolutionary distance. Journal of Computational Biology 3(1):1—17, matches 2 mismatches 34 matches 11 mismatches A:A: B:B:

CPM '05 Which alignment is better? Score =  ∙ #matches +  ∙ #mismatches In practice, scoring schemes depend on evolutionary distance score(B) // -11/9 score(A) > 0< 0

CPM '05 Log-odds scoring Let q X =  base frequency of nucleotide X m XY (t) =  Prob(X  Y mutation in t time units) A be an alignment X 1 X 2 X 3  X n Y 1 Y 2 Y 3  Y n Then, Log odds score of A =

CPM '05 Log-odds scoring Simplest model: –m XX (t) = r(t) for all X –m XY (t) = s(t) for all X  Y –q X = ¼ for all X Log-odds score of alignment:  (t) ∙ #matches +  (t) ∙ #mismatches where  (t) = 4 + log r(t)  (t) = 4 + log s(t)

CPM '05 Scores depend nonlinearly on evolutionary distance

CPM '05 This talk An efficient algorithm to compute optimum alignments for all evolutionary distances Techniques –Linearization –Geometry –Divide-and-conquer

CPM '05 Related Work Combinatorial/linear scoring schemes: –Waterman, Eggert, and Lander 1992: Problem definition –Gusfield, Balasubramanian, and Naor 1994: Bounds on number of optimality regions for pairwise alignment –F-B, Seppäläinen, and Slutzki 2004: Generalization to multiple and phylogenetic alignment Sensitivity analysis for statistical models: –P. Agarwal and D.J. States 1996 –L. Pachter and B. Sturmfels 2004a & b: connections between linear scoring and Markov models

CPM '05 A simple Markov model of evolution Sites evolve independently through mutation according to a Markov process For each site: –Transition probability matrix: M = [m ij ], i, j  {A, C, T, G} where m ij = Prob(i  j mutation in 1 time unit) –Transition matrix for t time units is M (t)

CPM '05 Jukes-Cantor transition probability matrix where

CPM '05    versus  t = +∞ t = 0  (t) = 4 + log r(t)  (t) = 4 + log s(t)

CPM '05 Linearization Allow  and  to vary arbitrarily, ignoring that they –are functions of t and –must satisfy laws of probability Result is a linear parametric problem Recall: Score(A) =  ∙ #matches +  ∙ #mismatches

CPM '05 Theorem (ii) The parameter space decomposition looks like this: Let n be the length of the shorter sequence. Then,   (i) The number of distinct optimal solutions over all values of  and  is O(n 2/3 ).

CPM '05 Re-introducing distance The  vs.  curve intersects every boundary line with slope  (-∞, +1] The optimum solutions for t = 0 to +  are found by varying  /  from -  to 1 Non-linear problem in t reduces to a linear one-parameter problem in  / 

CPM '05 An algorithm 1.Start with a simple, but highly parallel, algorithm for fixed-parameter problem 2.Lift the fixed-parameter algorithm Lifted algorithm runs simultaneously for all parameter values in linearized problem Output: A decomposition of parameter space into optimality regions 3.Construct solution to original problem by finding the optimality regions intersected by the  (t),  (t) curve

CPM '05 A naïve dynamic programming algorithm Let C be the matrix where C ij = score of opt alignment ending at X i and Y j Subdiagonals correspond to alignments Diagonals are independent of each other –Process each diagonal separately –Pick best answer over all diagonals Total time: O(nm) caatttgtcacttttt... C aattcaattcaatc... X Y

CPM '05 Divide and conquer for diagonals Split diagonal in half, solve each side recursively, and combine answers. E.g.: X Y Y (1) X (1) X (2) Y (2) Y (1) X (1) X (2) Y (2) Y (1) X (1) X (2) Y (2) T(N) = 2 T(N/2) + O(1)  T(N) = O(N) length of diagonal #subproblems

CPM '05 Lifting Run naïve DP algorithm for all parameter values by manipulating piecewise linear functions instead of numbers: –“+”  “+” for piecewise linear functions –“max”  “max” of piecewise linear functions

CPM '05 Adding piecewise linear functions f g f + g Time = O(total number of segments)

CPM '05 Computing the maximum of piecewise linear functions f g max (f,g) Time = O(total number of segments)

CPM '05 Analysis Processing a diagonal: –T(n) = 2 T(n/2) + O(n 2/3 )   T(n) = O(n) Merging score functions for diagonals: –O(n 2/3 ) line segments per function, m+n-1 diagonals –Total time:O(mn + mn 2/3 lg m) #(optimum solutions for diagonal)

CPM '05 Further Results (1): Parametric ancestral reconstruction Given a phylogeny, find most likely ancestors AATACTAGC AAT AAC Sensitive to edge lengths Result: O(n) algorithm for uniform model (all edge lengths equal)

CPM '05 Further Results (2) Bounds on number of regions for gapped alignment (indels are allowed) –Lead to algorithms, but not as efficient as ungapped case

CPM '05 Open Problems Tight bounds on size of parameter space decomposition Evolutionary trees with different branch lengths Efficient sensitivity analysis for gapped models Evaluation of sensitivity to changes in structure and parameters –Useful in branch-swapping

CPM '05 Thanks to National Science Foundation –CCR CCR –EF EF