4 - 1 Chap 4 The Sequence Alignment Problem. 4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment.

Slides:



Advertisements
Similar presentations
Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Advertisements

Improved Algorithms for Inferring the Minimum Mosaic of a Set of Recombinants Yufeng Wu and Dan Gusfield UC Davis CPM 2007.
Longest Common Subsequence
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Review of some graph algorithms Graph G(V,E) (Chapter 22) –Directed, undirected –Representation Adjacency-list, adjacency-matrix Breadth-first search.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
. Phylogeny II : Parsimony, ML, SEMPHY. Phylogenetic Tree u Topology: bifurcating Leaves - 1…N Internal nodes N+1…2N-2 leaf branch internal node.
Dynamic Programming Technique. D.P.2 The term Dynamic Programming comes from Control Theory, not computer science. Programming refers to the use of tables.
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
4 -1 Chapter 4 The Sequence Alignment Problem The Longest Common Subsequence (LCS) Problem A string : S 1 = “ TAGTCACG ” A subsequence of S 1 :
Introduction to Bioinformatics Algorithms Sequence Alignment.
5 - 1 Chap 5 The Evolution Trees Evolutionary Tree.
DNA Alignment. Dynamic Programming R. Bellman ~ 1950.
Computational Genomics Lecture #3a Much of this class has been edited from Nir Friedman’s lecture which is available at Changes.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Multiple sequence alignment
Multiple Sequence alignment Chitta Baral Arizona State University.
BNFO 602 Multiple sequence alignment Usman Roshan.
Multiple Sequence Alignment Mult-Seq-Align allows to detect similarities which cannot be detected with Pairwise-Seq-Align methods. Detection of family.
Implementation of Planted Motif Search Algorithms PMS1 and PMS2 Clifford Locke BioGrid REU, Summer 2008 Department of Computer Science and Engineering.
1 Combinatorial Dominance Analysis Keywords: Combinatorial Optimization (CO) Approximation Algorithms (AA) Approximation Ratio (a.r) Combinatorial Dominance.
Chapter 5 The Evolution Trees.
9-1 Chapter 9 Approximation Algorithms. 9-2 Approximation algorithm Up to now, the best algorithm for solving an NP-complete problem requires exponential.
Introduction to Bioinformatics Algorithms Sequence Alignment.
. Sequence Alignment Tutorial #3 © Ydo Wexler & Dan Geiger.
PAM250. M. Dayhoff Scoring Matrices Point Accepted Mutations or PAM matrices Proteins with 85% identity were used -> the function is not significantly.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
NP-complete and NP-hard problems. Decision problems vs. optimization problems The problems we are trying to solve are basically of two kinds. In decision.
Alignment II Dynamic Programming
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Multiple sequence alignment methods 1 Corné Hoogendoorn Denis Miretskiy.
Dynamic Programming. Pairwise Alignment Needleman - Wunsch Global Alignment Smith - Waterman Local Alignment.
Multiple Sequence Alignment
. Multiple Sequence Alignment Tutorial #4 © Ilan Gronau.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
Incorporating Bioinformatics in an Algorithms Course Lawrence D’Antonio Ramapo College of New Jersey.
Phylogenetic trees Tutorial 6. Distance based methods UPGMA Neighbor Joining Tools Mega phylogeny.fr DrewTree Phylogenetic Trees.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Approximating Minimum Bounded Degree Spanning Tree (MBDST) Mohit Singh and Lap Chi Lau “Approximating Minimum Bounded DegreeApproximating Minimum Bounded.
1 Generalized Tree Alignment: The Deferred Path Heuristic Stinus Lindgreen
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Construction of Substitution Matrices
Bioinformatics Multiple Alignment. Overview Introduction Multiple Alignments Global multiple alignment –Introduction –Scoring –Algorithms.
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Introduction to Algorithms Jiafen Liu Sept
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Lectures on Greedy Algorithms and Dynamic Programming
Expected accuracy sequence alignment Usman Roshan.
Tutorial 5 Phylogenetic Trees.
MINRMS: an efficient algorithm for determining protein structure similarity using root-mean-squared-distance Andrew I. Jewett, Conrad C. Huang and Thomas.
Construction of Substitution matrices
Multiple Sequence Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Finding Motifs Vasileios Hatzivassiloglou University of Texas at Dallas.
Graph Algorithms Minimum Spanning Tree (Chap 23)
The Evolution Trees (Part I)
Bioinformatics: The pair-wise alignment problem
Computational Biology Lecture #6: Matching and Alignment
Computational Biology Lecture #6: Matching and Alignment
Multiple Sequence Alignment
Phylogeny.
Computational Genomics Lecture #3a
Clustering.
Presentation transcript:

4 - 1 Chap 4 The Sequence Alignment Problem

4 - 2 The Sequence Alignment Problem Introduction –What, Who, Where, Why, When, How The Sequence Alignment Problem The Local Alignment Problem The Affine Gap Penalty

4 - 3 Introduction What –Input: Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. –Output: The alignment of S 1, S 2, …, S n, which has the optimal score. Who –Biologists want to know the secrets of DNA sequences. –Computerists take it as an interesting problem.

4 - 4 Introduction (Cont’) Where –Bioinformatics. Why –To determine how close two species are. –Data compression. When –Constructing evolutionary trees. How –This is why we are here.

4 - 5 The Sequence Alignment Problem S 1 =GAACTG, S 2 =GAGCTG, A scoring function f is –+2 if S 1 i is aligned with S 2 j, and S 1 i = S 2 j –-1 if otherwise. GAACTG--- GA---GCTG Score = 3 x (+2)+6 x (-1) =0 GAACTG GAGCTG Score = 5 x (+2)+1 x (-1) =9

4 - 6 The Dynamic Programming Approach

4 - 7 The Dynamic Programming Approach(Cont’)

4 - 8 The Local Alignment Problem Input:Two (or more) sequences S 1, S 2, …, S n, and a scoring function f. Output: Subsequences S i ’ of S i such that the score obtained by aligning S i ’ is highest, among all possible subsequences of S i. (1<= i <=n) S 1 = abbbcc S 2 = adddcc Score=3x2+3x(-1)=3 S 1 ’ = cc S 2 ’ = cc Score=2x2=4

4 - 9 The Local Alignment Problem(Cont’)

The Affine Gap Penalty Consider the following two sequences –S 1 =ACTTGATCC –S 2 =AGTTAGTAGTCC An optimal alignment of the above pair of sequences is as follows. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC Original Score=12 Gap concerned alignment is as follows. –S 1 =ACTT---GATCC –S 2 =AGTTAGTAGTCC Original Score=6

The Affine Gap Penalty(Cont’) A gap is caused by a mutational event which removed a sequence of residues. A simple mutational event is more likely than several events. Therefore a long gap is often more preferable than several gaps. An affine gap penalty is defined as P g +kP e for a gap with k, k>=1, spaces where P g,P e >= 0.

The Affine Gap Penalty(Cont’) Using our previous scoring function and further let P g =4 and P e =1. –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score = 8x2-1-3x(4+1x1)= =0 –S 1 =ACTT-G-A-TCC –S 2 =AGTTAGTAGTCC –Score=6x2-3x1-(4+3x1)=12-3-7=2

The Multiple Sequence Alignment Problem Consider the following case where three sequence are involved. S 1 = ATTCGAT S 2 = TTGAG S 3 = ATGCT

In two sequences alignment problem. In three sequences alignment problem.

Avery good alignment of these three sequence is now shown as follows. S 1 = ATTCGAT S 2 = -TT-GAG S 3 = AT--GCT It is noted that the alignment between every pair of sequence is quite good.

The Gusfield Approximation Algorithm for the Sum of Pairs Multiple Sequence Alignment Problem We define The distance between the two sequences induced by the alignment is define as

d(S i,S j ) has the following characteristics: (1)d(S i,S i ) = 0 (2)d(S i,S j )+ d(S i,S k ) d(S j,S k ) Give two sequences S i and S j, the minimum induced distance is denoted as D(S i,S j ).

S 1 = ATGCTC S 2 = AGAGC S 3 = TTCTG S 4 = ATTGCATGC We align the for sequence in pair. S 1 = ATGCTC S 2 = A-GAGC D(S 1,S 2 ) = 3 S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3

S 1 = AT-GC-T-C S 4 = ATTGCATGC D(S 1,S 4 ) = 3 S 2 = AGAGC S 3 = TTCTG D(S 2,S 3 ) = 5 S 2 = A--G-A-GC S 4 = ATTGCATGC D(S 2,S 4 ) = 4

S 3 = -TT-C-TG- S 4 = ATTGCATGC D(S 3,S 4 ) = 4 D(S 1,S 2 )+D(S 1,S 3 )+D(S 1,S 4 ) = 9 D(S 2,S 1 )+D(S 2,S 3 )+D(S 3,S 4 ) = 12 D(S 3,S 1 )+D(S 3,S 2 )+D(S 3,S 4 ) = 12 D(S 4,S 1 )+D(S 4,S 2 )+D(S 4,S 3 ) = 11 Give a set S of k sequences, the center of this set of sequences is the sequences which minimizes

Align S 2 with S 1 S 1 = ATGCTC S 2 = A-GAGC Add S 3 by aligning S 3 with S 1 S 1 = ATGCTC S 3 = -TTCTG =>S 1 = ATGCTC S 2 = A-GAGC S 3 = -TTCTG

Add S 4 by aligning S 4 with S 1 S 1 = AT-GC-T-C S 4 = ATTGCATGC =>S 1 = AT-GC-T-C S 2 = A--GA-G-C S 3 = -T-TC-T-G S 4 = ATTGCATGC App 2Opt.

The Minimal Spanning Tree Preservation Approach for Multiple Sequences Alignment S 1 = ATGCTC S 2 = ATGAGC S 3 = TTCTG S 4 = ATTGCATGC Step1 finds the pair wise distances optimally by the dynamic programming algorithm. S 1 = ATGCTC S 2 = ATGAGC D(S 1,S 2 ) = 2

S 1 = ATGCTC S 3 = TT-CTG D(S 1,S 3 ) = 3 S 1 = ATGC-T-C S 4 = ATGCATGC D(S 1,S 4 ) = 2 S 2 = ATGAGC S 3 = TTCTG- D(S 2,S 3 ) = 4

S 2 = ATG-A-GC S 4 = ATGCATGC D(S 2,S 4 ) = 2 S 3 = -TTC-TG- S 4 = ATGCATGC D(S 3,S 4 ) = 4 Table: The Distance Matrix D

S1S1 S2S2 S4S4 S3S A minimal spanning tree MST(D) For e(S 1, S 2 ) S 1 = ATGCTC S 2 = ATGAGC For e(S 2, S 4 ) S 1 =(ATG-C-TC) S 2 = ATG-A-GC S 4 = ATGCATGC

For e(S 1, S 3 ) S 1 = ATG-C-TC S 2 =(ATG-A-GC) S 3 = TT--C-TG S 4 =(ATGCATGC) Table: The Distance Matrix D m

S1S1 S2S2 S3S A minimal spanning tree MST(D m ) S4S4 Theorem: MST(D) is equal to MST(D m ). Corollary: Let e(a,b) and e(c,d) be two edges on MST(D). If D(a,b) < D(c,d), then D m (a,b) < D m (c,d).