A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.

Slides:



Advertisements
Similar presentations
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sources Page & Holmes Vladimir Likic presentation: 20show.pdf
Measuring the degree of similarity: PAM and blosum Matrix
1 Chapter 2 Data Searches and Pairwise Alignments 暨南大學資訊工程學系 黃光璿 2004/03/08.
Introduction to Bioinformatics
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Space/Time Tradeoff and Heuristic Approaches in Pairwise Alignment.
Lecture outline Database searches
Optimatization of a New Score Function for the Detection of Remote Homologs Kann et al.
Heuristic alignment algorithms and cost matrices
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez.
Sequence Alignment.
Introduction to Bioinformatics Algorithms Sequence Alignment.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Sequence similarity.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
Sequence Alignment III CIS 667 February 10, 2004.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Introduction to Bioinformatics Algorithms Sequence Alignment.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Pairwise alignment Computational Genomics and Proteomics.
1 Lesson 3 Aligning sequences and searching databases.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Information theoretic interpretation of PAM matrices Sorin Istrail and Derek Aguiar.
1 BLAST: Basic Local Alignment Search Tool Jonathan M. Urbach Bioinformatics Group Department of Molecular Biology.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Sequence Alignment.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
Protein Sequence Alignment and Database Searching.
CISC667, S07, Lec5, Liao CISC 667 Intro to Bioinformatics (Spring 2007) Pairwise sequence alignment Needleman-Wunsch (global alignment)
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Introduction to Bioinformatics Algorithms Sequence Alignment.
Amino Acid Scoring Matrices Jason Davis. Overview Protein synthesis/evolution Protein synthesis/evolution Computational sequence alignment Computational.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Pairwise Sequence Alignment (II) (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 27, 2005 ChengXiang Zhai Department of Computer Science University.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Sequence Alignment.
Lecture 15 Algorithm Analysis
Construction of Substitution matrices
Step 3: Tools Database Searching
Protein Sequence Alignment Multiple Sequence Alignment
Heuristic Alignment Algorithms Hongchao Li Jan
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Techniques for Protein Sequence Alignment and Database Searching G P S Raghava Scientist & Head Bioinformatics Centre, Institute of Microbial Technology,
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence Alignment.
#8 Finish DP, Scoring Matrices, Stats & BLAST
Intro to Alignment Algorithms: Global and Local
Lecture 14 Algorithm Analysis
Alignment IV BLOSUM Matrices
Presentation transcript:

A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng

Outline Introduction Background Preliminary Method Experiment

Introduction Given a Query and database. Do local alignment Smith-Waterman : Guaranteed to find all local alignment. Expensive BLAST FASTA

Improvement Hardware: more investment on computer,CPU Software Phil Green’s SWAT appeal to sparsity and some machine-level coding tricks 60% of dynamic programming matrix has value 0 Avoiding computing most of these unproductive entries

Focus on improving protein similarity searches This approach examines and compute only 4% of the underlying dynamic programming matrix

Recall Sequence alignment  Local sequence alignment  Global sequence alignment Goal – matching path with highest score Table-based computation and dynamic programming

Dynamic Programming Three basic components  Recurrence relation  Tabular computation  Traceback

Smith-Waterman Method Dynamic programming algorithm Find the most similar subsequences of two sequences Problem  Lots of computation  will be googol  Programmer  will be crazy and excite  Why?  how to accelerate

Background Scoring System  Simple scoring scheme  Affine gap penalty scoring scheme  PAM120 (PAMn)  BLOSUM62 (BLOSUMn)

Simple Scoring Scheme Match (e.g. +8) Mismatch (e.g. -5) Gap constant penalty (e.g. -20)

Affine Gap Penalty Scoring Scheme Match (e.g. +8) Mismatch (e.g. -5) Gap symbol (e.g. -5) Gap open penalty (e.g. -10)

PAM PAM – Percent Accepted Mutation  Dayhoff et al. (1978) PAM unit  Evolutionary time corresponding to average of 1 mutation per 100 residues  1% accepted PAMn  Relates to mutation probabilities in evolutionary interval of n PAM units Some information from:

PAM120 Source:

BLOSUM62 BLOSUM – BLOcks SUbstitution Matrix  Steven and Jorga G. Henikoff (1992)  Paper: Amino acid substitution matrices from protein blocks [PubMed]PubMed BLOSUMn  Relates to mutation probabilities observed between pairs of related proteins that diverged so above n% identity Some information from:

BLOSUM62 CSTPAGNDEQHRKMILVFYW C S T P A G N D E Q H R K M I L V F Y W

Preliminaries Σ : sequences are composed |Σ| × |Σ| Substitution matrix S giving the score Uniform gap penalty g > 0 Query = q 1 q 2 . . . q p of P letters Target = t 1 t 2 . . . t n of N letters Threshold T > 0

Score Table  Edit Graph Picture source:

Problem Find a high score local alignment between Query and Target whose path score ≧ T Edit-graph figure1 Limit our attention to prefix-positive paths If there is a path of score T or greater in the edit graph then there is a prefix positive path of score T or greater

Definition A set P of index-value pairs { (i,v): i is [0,P]

The start and extension tables Consider a vertex x in row j of the edit graph of Query vs. Target

Start Trimming Limiting the dynamic programming to the startable vertices requires a table Start(w) where w = |Σ| ks

Start Trimming Worst case Let αbe the expected percentage of vertices that are seed

Extension Trimming A table that eliminates vertices that are not extendable (i,j) is extendable vertex iff C(i,j)>Extend(i,Target[j+1…j+ke])

Extension Trimming

A Table-Driven Scheme for DP Goal: to restrict the SW computation to productive vertices Jump table – captures the effect of Advance and Delete over k J > 0 rows  space  unmanageably large  But only record those for which

Jump table Start table Space-saving version for Jump and Start tables

Check for paths scoring T or more 

Recall – Affine Gap Penalty Score  Match  Mismatch  Gap symbol - gsp  Gap open penalty - gop Affine cost of gap of length k  g + kh, g = gop, h = gsp

Diagram of Affine Gap Penalty CI D CI D CI D CI D -h -g-h -h δ(a i,b j ) Source: kmchao’s lecture note

Recurrence system - Gotoh

The Case of Affine Gap Costs Simple scoring scheme  affine gap penalty scheme Affine edit graph and vertex structure Question: how to modify the equations defined above?

Recurrence System for Affine Gap Costs Two observations  To compute the j th row form the (j-1) st requires knowing only the vectors of and values in row j-1, and not on the values in that row  If then the value at vertex need not be recorded as any maximal path through its will have score less than the maximal path passing through the corresponding

Recurrence System

Results

Experiment Method  Edit graph based approach vs. SWAT Scoring matrix  PAM120 Affine gap cost  8+4n Database (target)  3 million residue subset of the PIR database Query  A periodic clock protein of length 173 (pcp)  A lactate dehydrogenase of length 319 (dehydro)  A cGMP kinase of length 670 (kinase)  A growth factor of length 1210 (g factor)

PAM120 & Gap Cost 8+4n

BLOSUM62 & Gap Cost 8+2n

Thanks for Your Attention Ending