Computational Genomics Lecture #3a

Slides:



Advertisements
Similar presentations
Everyday Multiplication Everyday Multiplication Created By: Group 1.
Advertisements

Maximal Independent Subsets of Linear Spaces. Whats a linear space? Given a set of points V a set of lines where a line is a k-set of points each pair.
CIS: Compound Importance Sampling for Binding Site p-value Estimation The Hebrew University, Jerusalem, Israel Yoseph Barash Gal Elidan Tommy Kaplan Nir.
Sequence Alignment I Lecture #2
Bioinformatics (4) Sequence Analysis. figure NA1: Common & simple DNA2: the last 5000 generations Sequence Similarity and Homology.
4.1 Introduction to Matrices
. Lecture #8: - Parameter Estimation for HMM with Hidden States: the Baum Welch Training - Viterbi Training - Extensions of HMM Background Readings: Chapters.
1 Lecture 5 PRAM Algorithm: Parallel Prefix Parallel Computing Fall 2008.
5 x4. 10 x2 9 x3 10 x9 10 x4 10 x8 9 x2 9 x4.
Parallel algorithms for expression evaluation Part1. Simultaneous substitution method (SimSub) Part2. A parallel pebble game.
Computational Facility Layout
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
. Sequence Alignment I Lecture #2 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
. Sequence Alignment III Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Markov Chains Lecture #5
. Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 7a Presentation partially taken.
. Class 5: Multiple Sequence Alignment. Multiple sequence alignment VTISCTGSSSNIGAG-NHVKWYQQLPG VTISCTGTSSNIGS--ITVNWYQQLPG LRLSCSSSGFIFSS--YAMYWVRQAPG.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Hidden Markov Models Lecture #5 Prepared by Dan Geiger. Background Readings: Chapter 3 in the text book (Durbin et al.).
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Defining Scoring Functions, Multiple Sequence Alignment Lecture #4
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
1 Markov Chains Algorithms in Computational Biology Spring 2006 Slides were edited by Itai Sharon from Dan Geiger and Ydo Wexler.
Multiple Sequence alignment Chitta Baral Arizona State University.
BNFO 602 Multiple sequence alignment Usman Roshan.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
. Comput. Genomics, Lecture 5b Character Based Methods for Reconstructing Phylogenetic Trees: Maximum Parsimony Based on presentations by Dan Geiger, Shlomo.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
. Pairwise and Multiple Alignment Lecture #4 This class has been edited from Nir Friedman’s lecture which is available at Changes.
Presented by Liu Qi An introduction to Bioinformatics Algorithms Qi Liu
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture which is available at Changes made by.
Trees, Stars, and Multiple Biological Sequence Alignment Jesse Wolfgang CSE 497 February 19, 2004.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignments
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
Cpt S 471/571: Computational Genomics Spring 2015, 3 cr. Where: Sloan 9 When: M WF 11:10-12:00 Instructor weekly office hour for Spring 2015: Tuesdays.
. EM with Many Random Variables Another Example of EM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes.
1 Data structure:Lookup Table Application:BLAST. 2 The Look-up Table Data Structure A k-mer is a string of length k. A lookup table is a table of size.
Multiple Sequence Alignments Craig A. Struble, Ph.D. Department of Mathematics, Statistics, and Computer Science Marquette University.
Multiple Sequence Alignment Kun-Mao Chao ( 趙坤茂 ) Department of Computer Science and Information Engineering National Taiwan University, Taiwan WWW:
Chapter 3 Computational Molecular Biology Michael Smith
. Correctness proof of EM Variants of HMM Sequence Alignment via HMM Lecture # 10 This class has been edited from Nir Friedman’s lecture. Changes made.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
AdvancedBioinformatics Biostatistics & Medical Informatics 776 Computer Sciences 776 Spring 2002 Mark Craven Dept. of Biostatistics & Medical Informatics.
Expected accuracy sequence alignment Usman Roshan.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
1 Repeats!. 2 Introduction  A repeat family is a collection of repeats which appear multiple times in a genome.  Our objective is to identify all families.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
More on HMMs and Multiple Sequence Alignment BMI/CS 776 Mark Craven March 2002.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
MTH108 Business Math I Lecture 20.
Bioinformatics Overview
Computational Genomics Lecture #2b
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Cpt S 471/571: Computational Genomics
CSE 589 Applied Algorithms Spring 1999
CS100: Discrete structures
Multiple Sequence Alignment (I)
Multiple Sequence Alignment
Computational Genomics Lecture #3a
Multiple Sequence Alignment
Presentation transcript:

Computational Genomics Lecture #3a Multiple sequence alignment Background Readings: Chapters 2.5, 2.7 in the text book, Biological Sequence Analysis, Durbin et al., 2001. Chapters 3.5.1- 3.5.3, 3.6.2 in Introduction to Computational Molecular Biology, Setubal and Meidanis, 1997.  Chapter 15 in Gusfield’s book. p. 81 in Kanehisa’s book Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor.

Ladies and Gentlemen Boys and Girls the holy grail Multiple Sequence Alignment

Multiple Sequence Alignment S1=AGGTC Possible alignment A - T G C S2=GTTCG S3=TGAAC Possible alignment A G - T C

Multiple Sequence Alignment Aligning more than two sequences. Definition: Given strings S1, S2, …,Sk a multiple (global) alignment map them to strings S’1, S’2, …,S’k that may contain blanks, where: |S’1|= |S’2|=…= |S’k| The removal of spaces from S’i leaves Si

Multiple alignments We use a matrix to represent the alignment of k sequences, K=(x1,...,xk). We assume no columns consists solely of blanks. The common scoring functions give a score to each column, and set: score(K)= ∑i score(column(i)) x1 x2 x3 x4 M Q _ I L R - K P V For k=10, a scoring function has 2k -1 > 1000 entries to specify. The scoring function is symmetric - the order of arguments need not matter: score(I,_,I,V) = score(_,I,I,V).

SUM OF PAIRS M Q _ I L R - K P V A common scoring function is SP – sum of scores of the projected pairwise alignments: SPscore(K)=∑i<j score(xi,xj). M Q _ I L R - K P V Note that we need to specify the score(-,-) because a column may have several blanks (as long as not all entries are blanks). In order for this score to be written as ∑i score(column(i)), we set score(-,-) = 0. Why ? Because these entries appear in the sum of columns but not in the sum of projected pairwise alignments (lines).

SUM OF PAIRS M Q _ I L R - K P V Definition: The sum-of-pairs (SP) value for a multiple global alignment A of k strings is the sum of the values of all projected pairwise alignments induced by A where the pairwise alignment function score(xi,xj) is additive. M Q _ I L R - K P V

Example Consider the following alignment: a c - c d b - 3 3 +4 - c - a d b d 3 + 4 + 5 = 12 a - b c d a d Using the edit distance and for , this alignment has a SP value of

Multiple Sequence Alignment Given k strings of length n, there is a natural generalization of the dynamic programming algorithm that finds an alignment that maximizes SP-score(K) = ∑i<j score(xi,xj). Instead of a 2-dimensional table, we now have a k-dimensional table to fill. For each vector i =(i1,..,ik), compute an optimal multiple alignment for the k prefix sequences x1(1,..,i1),...,xk(1,..,ik). The adjacent entries are those that differ in their index by one or zero. Each entry depends on 2k-1 adjacent entries.

The idea via K=2 V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1] Recall the notation: and the following recurrence for V: V[i,j] V[i+1,j] V[i,j+1] V[i+1,j+1] Note that the new cell index (i+1,j+1) differs from previous indices by one of 2k-1 non-zero binary vectors (1,1), (1,0), (0,1).

Multiple Sequence Alignment Given k strings of length n, there is a generalization of the dynamic programming algorithm that finds an optimal SP alignment. Computational Cost: Instead of a 2-dimensional table we now have a k-dimensional table to fill. Each dimension’s size is n+1. Each entry depends on 2k-1 adjacent entries. Number of evaluations of scoring function : O(2knk)

Complexity of the DP approach Number of cells nk. Number of adjacent cells O(2k). Computation of SP score for each column(i,b) is o(k2) Total run time is O(k22knk) which is totally unacceptable ! Maybe one can do better?

But MSA is Intractable Not much hope for a polynomial algorithm because the problem has been shown to be NP complete (proof is quite Tricky and recent. Some previous proofs were bogus). Look at Isaac Elias presentation of NP completeness proof. Need heuristic or approximation to reduce time.

Multiple Sequence Alignment – Approximation Algorithm Now we will see an O(k2n2) multiple alignment algorithm for the SP-score that approximate the optimal solution’s score by a factor of at most 2(1-1/k) < 2.

Star-score(K) = ∑j>0score(S1,Sj). Star Alignments Rather then summing up all pairwise alignments, select a fixed sequence S1 as a center, and set Star-score(K) = ∑j>0score(S1,Sj). The algorithm to find optimal alignment: at each step, add another sequence aligned with S1, keeping old gaps and possibly adding new ones (i.e. keeping old alignment intact).

Multiple Sequence Alignment – Approximation Algorithm Polynomial time algorithm: assumption: the function δ is a distance function: (triangle inequality) Let D(S,T) be the value of the minimum global alignment between S and T.

Multiple Sequence Alignment – Approximation Algorithm (cont.) Polynomial time algorithm: The input is a set Γ of k strings Si. 1. Find “center string” S1 that minimizes 2. Call the remaining strings S2, …,Sk. 3. Add a string to the multiple alignment that initially contains only S1 as follows: Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1. Add Si by running dynamic programming algorithm on S’1 and Si to produce S’’1 and S’i. Adjust S’2, …,S’i-1 by adding gaps to those columns where gaps were added to get S’’1 from S’1. Replace S’1 by S’’1.

Multiple Sequence Alignment – Approximation Algorithm (cont.) Time analysis: Choosing S1 – running dynamic programming algorithm times – O(k2n2) When Si is added to the multiple alignment, the length of S1 is at most i* n, so the time to add all k strings is

Multiple Sequence Alignment – Approximation Algorithm (cont.) Performance analysis: M - The alignment produced by this algorithm. d(i,j) - the distance M induces on the pair Si,Sj. M* - optimal alignment. For all i, d(1,i)=D(S1,Si) (we performed optimal alignment between S’1 and Si and )

Multiple Sequence Alignment – Approximation Algorithm (cont.) Performance analysis: Triangle inequality Definition of S1

Multiple Sequence Alignment – Approximation Algorithm Algorithm relies heavily on scoring function being a distance. It produced an alignment whose SP score is at most twice the minimum. What if scoring function was similarity? Can we get an efficient algorithm whose score is half the maximum? Third of maximum? … We dunno !

Tree Alignments Assume that there is a tree T=(V,E) whose leaves are the input sequences. Want to associate a sequence in each internal node. Tree-score(K) = ∑(i,j)Escore(xi,xj). Finding the optimal assignment of sequences to the internal nodes is NP Hard. We will meet this problem again in the study of phylogenetic trees (it is related to the parsimony problem).

Multiple Sequence Alignment Heuristics Example - 4 sequences A, B, C, D. A. B D A C A B C D Perform all 6 pair wise alignments. Find scores. Build a “similarity tree”. distant similar B. Multiple alignment following the tree from A. B Align most similar pairs allowing gaps to optimize alignment. D A Align the next most similar pair. C Now, “align the alignments”, introducing gaps if necessary to optimize alignment of (BD) with (AC).

(modified from Speed’s ppt presentation, see p. 81 in Kanehisa’s book) The tree-based progressive method for multiple sequence alignment, used in practice (Clustal) (a) a tree (dendrogram) obtained by “cluster analysis” (b) pairwise alignment of sequences’ alignments. (a) (b) L W R D G R G A L Q L W R G G R G A A Q D W R - G R T A S G DEHUG3 DEPGG3 DEBYG3 DEZYG3 DEBSGF L R R - A R T A S A L - R G A R A A A E (modified from Speed’s ppt presentation, see p. 81 in Kanehisa’s book)

Visualization of Alignment