Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.

Slides:



Advertisements
Similar presentations
Longest Common Subsequence
Advertisements

DYNAMIC PROGRAMMING ALGORITHMS VINAY ABHISHEK MANCHIRAJU.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Dynamic Programming: Sequence alignment
Chapter 7 Dynamic Programming.
Outline The power of DNA Sequence Comparison The Change Problem
Chapter 7 Dynamic Programming 7.
§ 8 Dynamic Programming Fibonacci sequence
Dynamic Programming: Edit Distance
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Dynamic Programming Reading Material: Chapter 7..
Longest Common Subsequence (LCS) Dr. Nancy Warter-Perez.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11 sections4-7 Lecturer:
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2005.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Introduction to Bioinformatics Algorithms Block Alignment and the Four-Russians Speedup Presenter: Yung-Hsing Peng Date:
Longest Common Subsequence (LCS) Dr. Nancy Warter-Perez June 22, 2005.
Longest Common Subsequence (LCS) - Scoring Dr. Nancy Warter-Perez June 25, 2003.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Algorithms for Regulatory Motif Discovery Xiaohui Xie University of California, Irvine.
7 -1 Chapter 7 Dynamic Programming Fibonacci Sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Developing Sequence Alignment Algorithms in C++ Dr. Nancy Warter-Perez May 21, 2002.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Class 2: Basic Sequence Alignment
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
Dynamic Programming I Definition of Dynamic Programming
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Pairwise Alignment, Part I Constructing the Values and Directions Tables from 2 related DNA (or Protein) Sequences.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Dynamic Programming: Sequence alignment CS 466 Saurabh Sinha.
An Introduction to Bioinformatics 2. Comparing biological sequences: sequence alignment.
7 -1 Chapter 7 Dynamic Programming Fibonacci sequence Fibonacci sequence: 0, 1, 1, 2, 3, 5, 8, 13, 21, … F i = i if i  1 F i = F i-1 + F i-2 if.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Dynamic Programming: Manhattan Tourist Problem Lecture 17.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Gene Prediction: Similarity-Based Methods (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 15, 2005 ChengXiang Zhai Department of Computer Science.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Sequence Comparison I519 Introduction to Bioinformatics, Fall 2012.
9/27/10 A. Smith; based on slides by E. Demaine, C. Leiserson, S. Raskhodnikova, K. Wayne Adam Smith Algorithm Design and Analysis L ECTURE 16 Dynamic.
Introduction to Bioinformatics Algorithms Dynamic Programming: Edit Distance.
Core String Edits, Alignments, and Dynamic Programming.
Dr Nazir A. Zafar Advanced Algorithms Analysis and Design Advanced Algorithms Analysis and Design By Dr. Nazir Ahmad Zafar.
Divide & Conquer Algorithms
Sequence comparison: Dynamic programming
Sequence Alignment.
Sequence Alignment Using Dynamic Programming
Intro to Alignment Algorithms: Global and Local
Sequence Alignment.
Multiple Sequence Alignment (I)
Dynamic Programming-- Longest Common Subsequence
CSE 5290: Algorithms for Bioinformatics Fall 2009
Pairwise Sequence Alignment (II)
Presentation transcript:

Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Many slides are taken/adapted from

Comparing Genes in Two Genomes Small islands of similarity corresponding to similarities between exons Such comparisons are quite common in biology research

Alignment of sequences is one of the most basic and most important problems in bioinformatics…

Outline Defining the problem of alignment The longest common subsequence problem Dynamic programming algorithms for alignment

Aligning Two Strings Given the strings: v = ATGTTAT w = ATCGTAC One possible alignment of the strings: AT_GTTAT_ ATCGT_A_C 1 st row – string v with with space symbols “-” inserted 2 nd row – string w with with space symbols “-” inserted

Aligning Two Strings (cont’d) Another way to represent each row shows the number of symbols of the sequence present up to a given position. For example the above sequences can be represented as: AT_GTTAT_ ATCGT_A_C

Alignment Matrix Both rows of the alignment can be represented in the resulting matrix: AT_GTTAT_ ATCGT_A_C

Alignment as a Path in the Edit Graph A T _ G T T A T _ A T _ G T T A T _ A T C G T _ A _ C A T C G T _ A _ C (0,0), (1,1)

Alignment as a Path in the Edit Graph A T _ G T T A T _ A T _ G T T A T _ A T C G T _ A _ C A T C G T _ A _ C (0,0), (1,1), (2,2)

Alignment as a Path in the Edit Graph A T _ G T T A T _ A T _ G T T A T _ A T C G T _ A _ C A T C G T _ A _ C (0,0), (1,1), (2,2), (2,3), (3,4)

Alignment as a Path in the Edit Graph A T _ G T T A T _ A T _ G T T A T _ A T C G T _ A _ C A T C G T _ A _ C (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) - End Result -

Alignment as a Path in the Edit Graph Every path in the edit graph corresponds to an alignment:

How to Score an Alignment? Simplest –Every match scores 1 –Every mismatch scores 0 –An alignment is scored based on the number of common symbols –Lead to the longest common subsequence problem More sophisticated –? –To be covered later

Alignments in Edit Graph (cont’d) and represent indels in v and w Score 0. represent exact matches. Score 1.

Alignments in Edit Graph (cont’d) The score of the alignment path in the graph is 5.

The Longest Common Subsequence (LCS) Problem Find the longest subsequence common to two strings. Input: Two strings, v and w. Output: The longest common subsequence of v and w. A subsequence is not necessarily consecutive v = ATGTTAT w = ATCGTAC v = AT GTTAT | | | | | “ATGTA” w = ATCGT AC Longest common subsequence  Best alignment

How to solve the LCS problem efficiently?

Brute Force Approach Enumerate all the sequences up to length min(|v|,|w|) For each one, check to see if it is a subsequence of v and w Very expensive…. (How many sequences do we have to enumerate? )

The Idea of Dynamic Programming Think of an alignment as a path in an edit graph We only need to keep track of the best alignment (i.e., the longest common subsequence) Score a longer alignment based on shorter alignments

Alignment as a Path in the Edit Graph v= AT_GTTAT_ w= ATCGT_A_C (0,0), (1,1), (2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6), (7,7) Use each cell to store the best alignment so far…

Alignment: Dynamic Programming Use this scoring algorithm s i,j = s i-1, j-1 +1 if v i = w j max s i-1, j s i, j-1

Dynamic Programming Example There are no matches in the beginning of the sequence Label column i=1 to be all zero, and row j=1 to be all zero

Dynamic Programming Example S i,j = S i-1, j-1 max S i-1, j S i, j-1  value from NW +1, if v i = w j  value from North (top)  value from West (left) Keep track of the best alignment score and the path contributing to it

Alignment: Backtracking Arrows show where the score originated from. if from the top if from the left if v i = w j

Dynamic Programming Example Continuing with the scoring algorithm gives this result.

LCS Algorithm 1. LCS(v,w) 2. for i  1 to n 3. S i,0  0 4. for j  1 to m 5. S 0,j  0 6. for i  1 to n 7. for j  1 to m 8. s i-1,j 9. s i,j  max s i,j s i-1,j-1 + 1, if v i = w j 11. “ “ if s i,j = s i-1,j b i,j  “ “ if s i,j = s i,j-1 “ “ if s i,j = s i-1,j return (s n,m, b)

Now What? LCS(v,w) created the alignment grid Now we need a way to read the best alignment of v and w Follow the arrows backwards from the (|v|,|w|) cell

LCS Runtime To create the nxm matrix of best scores from vertex (0,0) to all other vertices, it takes O(nm) time. Why O(nm)? The pseudocode consists of a nested “for” loop inside of another “for” loop to set up a nxm matrix.

How do we improve the scoring of alignments? Can we still find an alignment efficiently? We’ll talk about these later…

The LCS Recurrence Revisited The formula can be rewritten by adding zero to the edges that come from an indel, since the penalty of indels are 0: s i-1, j-1 +1 if v i = w j s i,j = max s i-1, j + 0 s i, j Insertion/deletion score Matching score

What You Should Know How an alignment corresponds to a path in an edit graph How the LCS problem corresponds to alignment with a simple scoring method How the dynamic programming algorithm solves the LCS problem (= simple alignment)