Introduction to Sequence Alignment PENCE Bioinformatics Research Group University of Alberta May 2001.

Slides:



Advertisements
Similar presentations
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Advertisements

Lecture 19: Parallel Algorithms
Dynamic Programming Nithya Tarek. Dynamic Programming Dynamic programming solves problems by combining the solutions to sub problems. Paradigms: Divide.
Types of Algorithms.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Overview What is Dynamic Programming? A Sequence of 4 Steps
Computational Methods for Management and Economics Carla Gomes Module 8b The transportation simplex method.
CPSC 311, Fall 2009: Dynamic Programming 1 CPSC 311 Analysis of Algorithms Dynamic Programming Prof. Jennifer Welch Fall 2009.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 12: Refining Core String.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Space Efficient Alignment Algorithms and Affine Gap Penalties
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez June 24, 2005.
1 Lecture 25: Parallel Algorithms II Topics: matrix, graph, and sort algorithms Tuesday presentations:  Each group: 10 minutes  Describe the problem,
C T C G T A GTCTGTCT Find the Best Alignment For These Two Sequences Score: Match = 1 Mismatch = 0 Gap = -1.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
CPSC 411 Design and Analysis of Algorithms Set 5: Dynamic Programming Prof. Jennifer Welch Spring 2011 CPSC 411, Spring 2011: Set 5 1.
Dynamic Programming1. 2 Outline and Reading Matrix Chain-Product (§5.3.1) The General Technique (§5.3.2) 0-1 Knapsack Problem (§5.3.3)
1 Dynamic Programming Andreas Klappenecker [based on slides by Prof. Welch]
Pairwise Alignment Global & local alignment Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis.
Multiple Sequence alignment Chitta Baral Arizona State University.
Backtracking.
Space Efficient Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence comparison: Local alignment
Dynamic Programming Introduction to Algorithms Dynamic Programming CSE 680 Prof. Roger Crawfis.
Needleman Wunsch Sequence Alignment
Sequence Alignment.
Bioiformatics I Fall Dynamic programming algorithm: pairwise comparisons.
CS 5263 Bioinformatics Lecture 4: Global Sequence Alignment Algorithms.
Traceback and local alignment Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington.
Sequence Alignment Algorithms Morten Nielsen Department of systems biology, DTU.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Pairwise Sequence Alignment (I) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 22, 2005 ChengXiang Zhai Department of Computer Science University.
ADA: 7. Dynamic Prog.1 Objective o introduce DP, its two hallmarks, and two major programming techniques o look at two examples: the fibonacci.
Fundamentals of Algorithms MCS - 2 Lecture # 7
September 23, 2014Computer Vision Lecture 5: Binary Image Processing 1 Binary Images Binary images are grayscale images with only two possible levels of.
Parallel Characteristics of Sequence Alignments Kyle R. Junik.
Chapter 3 Computational Molecular Biology Michael Smith
1 CPSC 320: Intermediate Algorithm Design and Analysis July 28, 2014.
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Dynamic Programming.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Introduction to Bioinformatics Algorithms Divide & Conquer Algorithms.
Space Efficient Alignment Algorithms and Affine Gap Penalties Dr. Nancy Warter-Perez.
Chapter 9 Sorting. The efficiency of data handling can often be increased if the data are sorted according to some criteria of order. The first step is.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Dynamic Programming.  Decomposes a problem into a series of sub- problems  Builds up correct solutions to larger and larger sub- problems  Examples.
Review Quick Sort Quick Sort Algorithm Time Complexity Examples
Chapter 9 Recursion © 2006 Pearson Education Inc., Upper Saddle River, NJ. All rights reserved.
Dynamic Programming for the Edit Distance Problem.
Dynamic Programming Typically applied to optimization problems
Sequence comparison: Dynamic programming
Sequence comparison: Local alignment
Sequence comparison: Traceback and local alignment
CSCE 411 Design and Analysis of Algorithms
Dynamic Programming.
Dynamic Programming.
Sequence Alignment 11/24/2018.
Unit-2 Divide and Conquer
Sequence comparison: Dynamic programming
Fast Sequence Alignments
Longest Common Subsequence
Find the Best Alignment For These Two Sequences
Sequence Alignment Algorithms Morten Nielsen BioSys, DTU
Dynamic Programming Finds the Best Score and the Corresponding Alignment O Alignment: Start in lower right corner and work backwards:
Space-Saving Strategies for Analyzing Biomolecular Sequences
Presentation transcript:

Introduction to Sequence Alignment PENCE Bioinformatics Research Group University of Alberta May 2001

©Duane Szafron Outline Sequence Alignment Full Matrix Algorithms Hirschberg’s Algorithm The FastLSA Algorithm Leading and Trailing Blanks

©Duane Szafron Sequence Alignment Sequence alignment reduces to a problem of matching two strings by introducing gaps to maximize a scoring function. The scoring function favors similar characters in the same position, penalizes dissimilar characters and penalizes gaps. AGTATGCA ATTGATA AGT-ATGCA ATTGAT--A

©Duane Szafron = 3 2 Scoring Function There are many different scoring functions. Here is a simple one suitable for illustration, but not actually used: AGT-ATGCA ATTGAT--A –Exact match: +2 points –Different characters: -1 point –Gap: -2 points

©Duane Szafron Scoring Ties There can be several optimal alignment solutions due to scoring ties. There are actually three optimal solutions in our example alignment: AGT-ATGCA ATTGAT--A = 3 AGTATG-CA A-T-TGATA = 3 AGTATGC-A A-T-TGATA = 3

©Duane Szafron Alignment Algorithms The goal is to find an optimal alignment for a given scoring function as quickly as possible, using a minimum amount of storage. We will look at three different kinds of algorithms: –Full Matrix algorithms like Needleman-Wunch and Smith-Waterman –The Hirschberg Algorithm –Fast linear space alignment (FastLSA)

©Duane Szafron Matrix Representation A matrix is used to represent all possible alignments for a pair of sequences. There is a sequence along each axis. Each path from the top left corner to the bottom right corner represents an alignment solution.

©Duane Szafron Alignments as Matrix Paths - A G T A T G C A - A T G A T A A A G T T T - G A A T T G - C - A A

©Duane Szafron Other Alignment Matrix Paths - A G T A T G C A - A T G A T A A A G T T T - G A A T T G - C - A A A A G - T T A - T T G G - A C T A A

©Duane Szafron Other Alignment Matrix Paths - A G T A T G C A - A T G A T A A A G T T T - G A A T T G - C - A A A A G - T T A - T T G G C A - T A A A A G - T T A - T T G G - A C T A A

©Duane Szafron Matrix Alignment Algorithms A matrix algorithm uses a dynamic programming matrix to find an optimal solution. There are two phases to the algorithms: –FindScore –FindPath

©Duane Szafron FindScore Description The FindScore phase applies the scoring matrix to all paths from the upper left to the lower right. Values are propagated left-to-right, from top-to-bottom. At the end, the lower right corner is the optimal score.

©Duane Szafron FindScore Example -2 0 AGTATGCA- A T T G A T A

©Duane Szafron FindPath Description The FindPath phase starts in the lower right corner. At each box, a direction is picked: up, left or diagonal based on the highest score that entered the box from those three directions. If two (or three) directions have equal scores both (all) are optimal paths.

©Duane Szafron FindPath Example -2 0 AGTATGCA- A T T G A T A

©Duane Szafron Cost of Full Matrix Algorithms A full matrix algorithm maintains the entire matrix in memory during both phases (FindScore and FindPath) of the algorithm. For sequences of length n and m, this takes nxm entries in memory. FindScore takes nxm operations (time). FindPath takes m+n operations (time). If we want to align two sequences of length 10,000, the storage space is prohibitive (100,000,000 entries).

©Duane Szafron The Hirschberg Algorithm - 1 The Hirschberg algorithm is designed to take less space, but find the same optimal solutions. It splits one sequence into two and performs the FindScore algorithm on each half, working backwards on the second half sequence. It does not store all of the results in memory, just the current row of each half matrix (2xn entries instead of mxn entries).

©Duane Szafron Hirschberg’s Algorithm At the end of the two FindScore computations, the final rows of each half matrix are used to find the optimal “crossing-point” of the two “half- alignments”. The complete algorithm is then called again on the two pairs of half sequences. This recursion continues until the lengths of the sequences being aligned is 1.

©Duane Szafron Hirschberg FindScore Example AGTATGCA- A T T G A T A

©Duane Szafron Hirschberg FindScore Example AGTATGCA- A T T G A T A

©Duane Szafron ATGCA GATA GCA GATA GCA GATA ATT AGT ATT AGTAT ATT AGTAT Hirschberg Example Sub-problems There are two optimal splits of the sequences, colored pink and blue. However, the blue split generates two different optimal solutions, blue and white. GAT--A -ATGCA A-T-T AGTATG-CA GATA GC-A GATA ATT AGT A-T-T AGTAT

©Duane Szafron Hirschberg Recursion

©Duane Szafron Hirschberg’s Algorithm Hirschberg’s algorithm takes only linear space - 2xn, instead of quadratic space - mxn. This means that aligning two sequences of length 10,000 would only require 20,000 entries instead of 100,000,000 entries. The disadvantage of this algorithm is that the time goes from mxn operations to about 2xmxn operations since many matrix computations must be redone.

©Duane Szafron FastLSA Idea FastLSA improves Hirschberg by reducing the number of re-computations that need to be done. This makes the algorithm faster. There are three improvements to reduce computations: –Sequences are split on both axes, not just one. –Sequences are not just bisected, they are cut into several smaller pieces. –Scores on splitting lines are maintained.

©Duane Szafron FastLSA - Algorithm Each sequences is split on both axes. FindScore is called on a region consisting of 3 quadrants (excluding the lower right). Scores are kept only on the bisecting lines. FastLSA is called recursively on the lower right quadrant and the optimal path is eventually returned for this quadrant. Recursive calls are made on part of 1 or 2 of the other 3 quadrants, depending on the path returned from the lower right quadrant.

©Duane Szafron FastLSA - Stopping the Recursion When a block has size u*v < some B, stop the recursion and apply a full matrix algorithm to solve the block.

©Duane Szafron FastLSA - Using Bisection FastLSA(DPM,rs,re,cs,ce) if ((re-rs)*(ce-cs) < B) FullMatrix(DPM,rs,re,cs,ce); return; else rm = (rs+re)/2; cm = (cs+ce)/2; FindScores(DPM,rs,rm,re,cs,cm,ce); FastLSA(DPM,rm,re,cm,ce); if (direction == diagonal) FastLSA(rs,rm,cs,cm) else if (direction == side) re = path.end.row; FastLSA(rm,re,cs,cm); if (direction == up) ce = path.end.column; FastLSA(rs,rm,cs,ce) else // direction == up ce = path.end.column; FastLSA(rs,rm,cm,ce); if (direction == side) re = path.end.row; FastLSA(rs,re,cs,cm); E E E E E E 6 6 E E E 9E 1E

©Duane Szafron FastLSA - cuts (k) = 4

©Duane Szafron Using FastLSA If you don’t have enough memory to run a full-matrix algorithm, use FastLSA and pick your k-value based on your available memory. It will run faster than Hirschberg’s algorithm.

©Duane Szafron Aligning Sub-sequences Sometimes you are trying to align a sub- sequence with a large sequence. In this case there should many leading and trailing gaps. AGATCTGATCGTAAGTCATTCGCATAATGCGT GTACGTC AGATCTGATCGTAAGTCATTCGCATAATGCGT GTA---C----G--T----C--... Score = 25*(-2) + 1*(-1) + 6*2 = -39 Score = 25*(-2) + 7*2 = -36

©Duane Szafron Leading and Trailing Gaps To score this properly, we assign zero penalties to leading and trailing gaps. AGATCTGATCGTAAGTCATTCGCATAATGCGT GTACGTC AGATCTGATCGTAAGTCATTCGCATAATGCGT GTA---C----G--T----C--... Score = 25*(0) + 1*(-1) + 6*2 = 11 Score = 12*(0) 13*(-2) + 7*2 = -8

©Duane Szafron Implementing Leading Gaps AGTATGCA- A T T G A T A

©Duane Szafron New optimal path - same score AGTATGCA- A T T G A T A A - G - T - A A T T - T G G C A - T A A

©Duane Szafron Implementing trailing Gaps AGTATGCA- A T T G A T A

©Duane Szafron New optimal paths - new score AGTATGCA- A T T G A T A A - G - T - A A T T G T C G A A - T - A - A - T - T - G A A G - T T A A T - G - C - A -