CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.

Slides:



Advertisements
Similar presentations
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Global Sequence Alignment.
Dynamic Programming.
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
Welcome to CS262!. Goals of this course Introduction to Computational Biology  Basic biology for computer scientists  Breadth: mention many topics &
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
Sequence Alignment Algorithms in Computational Biology Spring 2006 Edited by Itai Sharon Most slides have been created and edited by Nir Friedman, Dan.
CS 5263 Bioinformatics Lecture 5: Affine Gap Penalties.
Sequence Alignment. Scoring Function Sequence edits: AGGCCTC  MutationsAGGACTC  InsertionsAGGGCCTC  DeletionsAGG. CTC Scoring Function: Match: +m Mismatch:
Sequence Alignment. CS262 Lecture 2, Win06, Batzoglou Complete DNA Sequences More than 300 complete genomes have been sequenced.
Sequence Alignment Cont’d. Sequence Alignment -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Definition Given two strings.
Sequence Alignment.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Sequence Alignment Cont’d. Evolution Scoring Function Sequence edits: AGGCCTC  Mutations AGGACTC  Insertions AGGGCCTC  Deletions AGG.CTC Scoring Function:
This material in not in your text (except as exercises) Sequence Comparisons –Problems in molecular biology involve finding the minimum number of edit.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
Introduction to Sequence Alignment PENCE Bioinformatics Research Group University of Alberta May 2001.
CS 6293 Advanced Topics: Current Bioinformatics Lectures 3-4: Pair-wise Sequence Alignment.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Alignment II Dynamic Programming
Sequence Alignment Lecture 2, Thursday April 3, 2003.
Sequence Alignment. -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC Given two strings x = x 1 x 2...x M, y = y 1 y 2 …y N,
Class 2: Basic Sequence Alignment
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Developing Pairwise Sequence Alignment Algorithms
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Comp. Genomics Recitation 2 12/3/09 Slides by Igor Ulitsky.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Cédric Notredame (19/10/2015) Using Dynamic Programming To Align Sequences Cédric Notredame.
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
CS 5263 Bioinformatics CS 4593 AT:Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
Expected accuracy sequence alignment Usman Roshan.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Sequence Alignment Tanya Berger-Wolf CS502: Algorithms in Computational Biology January 25, 2011.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
CS38 Introduction to Algorithms Lecture 10 May 1, 2014.
DNA, RNA and protein are an alien language
CS 5263 Bioinformatics Lecture 3: Dynamic Programming and Sequence Alignment.
CS 5263 Bioinformatics Lectures 3-6: Pair-wise Sequence Alignment.
Learning to Align: a Statistical Approach
Uniformed Search (cont.) Computer Science cpsc322, Lecture 6
CS502: Algorithms in Computational Biology
Lecture 5: Local Sequence Alignment Algorithms
Uniformed Search (cont.) Computer Science cpsc322, Lecture 6
CS 6293 Advanced Topics: Translational Bioinformatics
Dynamic Programming General Idea
Using Dynamic Programming To Align Sequences
Pairwise Alignment Global & local alignment
Dynamic Programming-- Longest Common Subsequence
Dynamic Programming General Idea
A T C.
Space-Saving Strategies for Analyzing Biomolecular Sequences
Linear space LCS algorithm
Dynamic Programming.
Presentation transcript:

CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms

Roadmap Review of last lecture More global sequence alignment algorithms

Given a scoring scheme, –Match: m –Mismatch: -s –Gap: -d We can easily compute an optimal alignment by dynamic programming

In a completed alignment between a pair of sequences X = x 1 x 2 …x M, Y = y 1 y 1 …y N If we look at any column of the alignment, there are only three possibilities –x i is aligned to y j –x i is aligned to a gap –y j is aligned to a gap

Since the alignment score F(M, N) is a sum of all aligned columns, it can be broken down to: F(M-1, N-1) +  (x M, y N ) F(M, N) = max F(M-1, N) - d F(M, N-1) - d

And recursively: F(i-1, j-1) +  (x i, y j ) F(i, j) = max F(i-1, j) - d F(i, j-1) - d

F(0,0) F(M,N)

F(0,0) F(M,N)

AAAA G-G- TTTT AAAA Trace-back AGTA A10 -2 T 0010 A-3 02 F(i,j) j = i = AAAA G-G- TTTT AAAA

Graph representation (0,0) (3,4) A G TA A A T S1 = S2 = Number of steps: length of the alignment Path length: alignment score Alignment: find the longest path from (0, 0) to (3, 4) General longest path problem cannot be found with DP. Longest path on this graph can be found by DP since no cycle is possible.  : a gap in the 2 nd sequence  : a gap in the 1 st sequence : match / mismatch Values on vertical/horizontal line: -d Values on diagonal: m or -s

Question If we change the scoring scheme, will the optimal alignment be changed? –Original: Match = 1, mismatch = gap = -1 –New: match = 2, mismatch = gap = 0 –New: Match = 2, mismatch = gap = -2?

Number of alignments Is equal to the number of distinct paths from (0, 0) to (m, n) A BCBC A BCBC A BCBC A BCBC A BCBC A- BC A-- -BC --A BC- -A- B-C -A BC

How to count? –Homework assignment –Hint: dynamic programming –Or analytically

However Biologically meaningful “distinct” alignments may be much less –All three may be considered equivalent –A, B, and C all aligned to gaps A BCBC A BCBC A BCBC A-- -BC --A BC- -A- B-C

Number of alignments We only care about who is aligned to whom, not the gaps For two sequences of length m, n, there may be k matches, k = 0 to min(m, n) Number of alignments:

Furthermore A BCBC A BCBC A- BC A-- -BC Alternating gaps are discouraged / prohibited. With most scoring scheme, alternating gaps will never happen. (as long as 2d > s) => -d m or -s

A BCBC A BCBC A BCBC A BCBC A BCBC A- BC A-- -BC --A BC- -A- B-C -A BC Special trick? No. In most scoring scheme this is achieved automatically –2d > s

Number of alignments Homework assignment Dynamic programming –Multiple matrices –Three states: Came from diagonal. Can go any of the three directions

Number of alignments Homework assignment Dynamic programming –Multiple matrices –Three states: Came from diagonal. Can go any of the three directions Came from left, cannot go down

Number of alignments Homework assignment Dynamic programming –Multiple matrices –Three states: Came from diagonal. Can go any of the three directions Came from left, cannot go down Came from above, cannot turn right

Given two sequences of length M, N Time: O(MN) –ok Space: O(MN) –bad –1Mb seq x 1Mb seq = 1000G memory Can we do better?

In biology, this kind of alignment is unlikely to be meaningful abcde vwxyz

Good alignment should appear near the diagonal

Bounded Dynamic Programming If we know that x and y are very similar Assumption: # gaps(x, y) < k xixi Then,|implies | i – j | < k yj yj

Bounded Dynamic Programming Initialization: F(i,0), F(0,j) undefined for i, j > k Iteration: For i = 1…M For j = max(1, i – k)…min(N, i+k) F(i – 1, j – 1)+  (x i, y j ) F(i, j) = max F(i, j – 1) – d, if j > i – k F(i – 1, j) – d, if j < i + k Termination:same x 1 ………………………… x M y N ………………………… y 1 k

Analysis Time: O(kM) << O(MN) Space: O(kM) with some tricks 2k M => M

What if we don’t know k? Iterate: –For k = 2, 4, 8, 16, … –For each k, we can have an optimal bounded alignment with score S k –Stop when ((min(N, M)-k) * m – 2kd) < S k, since we will not be able to get a higher score with larger k

Given two sequences of length M, N Time: O(MN) –ok Space: O(MN) –bad –1mb seq x 1mb seq = 1000G memory Can we do better?

Linear space algorithm If all we need is the alignment score but not the alignment, easy! We only need to keep two rows (if you are crafty enough, you only need one row) But how do we get the alignment?

Linear space algorithm When we finish, we know how we have aligned the ends of the sequences Naïve idea: Repeat on the smaller subproblem F(M-1, N-1) Time complexity: O((M+N)(MN)) XMYNXMYN

Hirschberg’s idea Divide and conquer! M/2 F(M/2, k) represents the best alignment between x 1 x 2 …x M/2 and y 1 y 2 …y k Forward algorithm Align x 1 x 2 …x M/2 with Y X Y

Backward Algorithm M/2 B(M/2, k) represents the best alignment between reverse(x M/2 x M/2+1 …x M ) and reverse(y k y k+1 …y N ) Backward algorithm Align reverse(x M/2 x M/2+1 …x M ) with reverse(Y) Y X

Lemma F(M/2, k) + B(M/2, k) is the best alignment under the constraint that x M/2 must be aligned to y k F(M, N) = max k=0…N ( F(M/2, k) + B(M/2, k) ) x y M/2 k*k* F(M/2, k) B(M/2, k)

Longest path from (0, 0) to (6, 6) is max_k (LP(0,0,3,k) + LP(3,k,6,6) (0,0) (6,6) (3,2) (3,4)(3,6)(3,0)

Linear-space alignment Now, using 2 rows of space, we can compute for k = 1…N, F(M/2, k), B(M/2, k) M/2

Linear-space alignment Now, we can find k * maximizing F(M/2, k) + B(M/2, k) Also, we can trace the path exiting column M/2 from k * Conclusion: In O(NM) time, O(N) space, we found optimal alignment path at row M/2

Linear-space alignment Iterate this procedure to the two sub-problems! N-k * M/2 k*k*

Analysis Memory: O(N) for computation, O(N+M) to store the optimal alignment Time: –MN for first iteration –k M/2 + (N-k) M/2 = MN/2 for second –… k N-k M/2

MNMN/2MN/4 MN/8 MN + MN/2 + MN/4 + MN/8 + … = MN (1 + ½ + ¼ + 1/8 + 1/16 + …) = 2MN = O(MN)