1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Multithreaded FPGA Acceleration of DNA Sequence Mapping Edward Fernandez, Walid Najjar, Stefano Lonardi, Jason Villarreal UC Riverside, Department of Computer.
Lecture 8 Alignment of pairs of sequence Local and global alignment
Definitions Optimal alignment - one that exhibits the most correspondences. It is the alignment with the highest score. May or may not be biologically.
Jeff Shen, Morgan Kearse, Jeff Shi, Yang Ding, & Owen Astrachan Genome Revolution Focus 2007, Duke University, Durham, North Carolina Introduction.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Expect value Expect value (E-value) Expected number of hits, of equivalent or better score, found by random chance in a database of the size.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Introduction to Bioinformatics - Tutorial no. 2 Global Alignment Local Alignment FASTA BLAST.
Introduction to bioinformatics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment II CIS 667 Spring Optimal Alignments So we know how to compute the similarity between two sequences  How do we construct an.
Sequence Alignment III CIS 667 February 10, 2004.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Cost-Based Variable-Length-Gram Selection for String Collections to Support Approximate Queries Efficiently Xiaochun Yang, Bin Wang Chen Li Northeastern.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Ultrafast and memory-efficient alignment of short DNA sequences to the human genome Ben Langmead, Cole Trapnell, Mihai Pop, and Steven L. Salzberg Center.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Pairwise Alignment How do we tell whether two sequences are similar? BIO520 BioinformaticsJim Lund Assigned reading: Ch , Ch 5.1, get what you can.
Sequence Analysis Determining how similar 2 (or more) gene/protein sequences are (too each other) is a “staple” function in bioinformatics. This information.
Pair-wise Sequence Alignment What happened to the sequences of similar genes? random mutation deletion, insertion Seq. 1: 515 EVIRMQDNNPFSFQSDVYSYG EVI.
Presented by Mario Flores, Xuepo Ma, and Nguyen Nguyen.
Gapped BLAST and PSI-BLAST : a new generation of protein database search programs Team2 邱冠儒 黃尹柔 田耕豪 蕭逸嫻 謝朝茂 莊閔傑 2014/05/12 1.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
An Introduction to Bioinformatics
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Experiments An Efficient Trie-based Method for Approximate Entity Extraction with Edit-Distance Constraints Entity Extraction A Document An Efficient Filter.
Probe Design Using Exact Repeat Count August 8th, 2007 Aaron Arvey.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Indexing DNA sequences for local similarity search Joint work of Angela, Dr. Mamoulis and Dr. Yiu 17/5/2007.
Lecture 6. Pairwise Local Alignment and Database Search Csc 487/687 Computing for bioinformatics.
JM - 1 Introduction to Bioinformatics: Lecture III Genome Assembly and String Matching Jarek Meller Jarek Meller Division of Biomedical.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Biological Sequence Comparison and Alignment Speaker: Yu-Hsiang Wang Advisor: Prof. Jian-Jung Ding Digital Image and Signal Processing Lab Graduate Institute.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Database Similarity Search. 2 Sequences that are similar probably have the same function Why do we care to align sequences?
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2015.
From Smith-Waterman to BLAST
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
2016/1/27Summer Course1 Pattern Search Problems Part I: Fundament Concept.
Biosequence Similarity Search on the Mercury System Praveen Krishnamurthy, Jeremy Buhler, Roger Chamberlain, Mark Franklin, Kwame Gyang, and Joseph Lancaster.
Pairwise Sequence Alignment (cont.) (Lecture for CS397-CXZ Algorithms in Bioinformatics) Feb. 4, 2004 ChengXiang Zhai Department of Computer Science University.
Heuristic Methods for Sequence Database Searching BMI/CS 576 Colin Dewey Fall 2010.
Genome Revolution: COMPSCI 004G 8.1 BLAST l What is BLAST? What is it good for?  Basic.
. Sequence Alignment Author:- Aya Osama Supervision:- Dr.Noha khalifa.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Local alignment
CSC2431 February 3rd 2010 Alecia Fowler
Basic Local Alignment Search Tool
Presentation transcript:

1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China

2 Local Alignment Similar over short conserved regions Dissimilar over remaining regions Applications  Comparing long stretches of anonymous DNA  Searching for unknown domains or motifs within proteins from different families …

3 Related Work Smith-Waterman algorithm (1981)  An exact approach but very slow  Not used for search BLAST: an efficient but approximate approach OASIS: an exact approach and efficient only for short query sequences (less than 60 characters) BWT-SW: an exact approach but inefficient Our target  An efficient and exact approach: ALAE (Accelerating Local Alignment with affine gap Exactly)

4 Local Alignment Input: 2 sequences, a similarity function, a threshold Output: Alignments. T P Score >= H T P

5 Measure Similarity Scoring scheme  An identical mapping: positive score s a  A mismatch: negative score s b  Gap: negative score s g + r×s s TGCGC-ATGGATTGACCGA TGCGCCATTGAT--ACCGA sim(S1,S2) = 15×1 + (-3) + (-2-1) + ( × (-1)) = 5 S1: S2: Scoring scheme: Gap opening penalty Gap extension penalty

6 A Basic Approach T P X … i The best alignment score of X[1,i] and any substring of P ending at position j. j

7 A DP Algorithm

8 An Example of a DP Matrix P = GCTAG, T = AAAGCTA. Scoring scheme = Ga Gb

9 A Basic Approach i = i 1 +t 1 = i 2 +t T P i j

10 Challenges Speed  Each matrix contains m ~ m×n entries  n matrixes  How to avoid calculating most of entries without impairing the accuracy of the alignment results? In-memory algorithm  Long sequences: both T and P are long

11 Contributions Speed  Prune unnecessary calculations  Avoid duplicate calculations In-memory algorithm  Use compressed suffix array Mathematical analysis

12 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

13 Local filterings Length Filtering Pruned

14 Local filterings Score Filtering Pruned

15 Local filterings q-Prefix Filtering Pruned Simpler function

16 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC G C T A A XG C T A G T

17 Comparison of Calculating One Matrix P=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 A 10 G 11 C 12 T 13 G 14 C 15 X=G 1 C 2 T 3 A 4 A 5 G 6 C 7 T 8 A 9 G 10 T 11 Scoring scheme H=3 P GCTAAGCTAAGCTGC G-∞ C T A 4 4 A 5 5 XG 6 6 C 7 7 T 81 8 A G T 3

18 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

19 Global Filtering i = i 1 +t 1 = i 2 +t Pruned

20 Global Filtering Pruned fork areas Using X’ : Alignment score >= S a It is unnecessary to calculate the fork area in the matrix of X and P Question: Safely avoid calculating based on calculated matrixes?

21 Global Filtering X’ Update and check unnecessary calculations on-the-fly Scoring scheme Boolean matrix X (1)Space consuming: m×n space (2) Calculation order

22 Global Filtering X’ X q-prefix domination X’ dominates X

23 Global Filtering X’ X q-prefix domination X’ dominates X Text T  Constructing dominations offline in O(n) time Query P  Check useless calculations on-the-fly t Calculation order is unnecessary.

24 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

25 Reusing score calculations for P Entries with a common prefix P s can share alignment scores. reusable alignment entries

26 Reusing score calculations for P reusable alignment entries If two forks have equivalent scores for their FGOEs, their entries with common substring Ps can share alignment scores.

27 Outline Local filterings Global filtering Reusing calculations A hybrid algorithm

28 A Hybrid Algorithm Row by row Column by column

29 Mathematical Analysis Upper bound on the number of calculated entries for representative scoring schemes specified by BLAST (  DNA: 4.50mn ~ 9.05mn  Proteins: 8.28mn ~ 7.49mn 0.723

30 Experiments Data sets  Human genome data set Length of a text: 50 million ~ 1 billion.  Mouse genome data set Length of each query: 1 thousand ~ 1 million.  Protein data set Length of a text: 10 million ~ 50 million. Length of each query: 200 ~ 100,000. E-value: threshold Scoring scheme: the same parameters as BLAST Environment: GNU C++, Intel 2.93GHz Quad Core CPUi7 and 8GB memory with a 500GB disk, running a Ubuntu (Linux) operating system.

31 Alignment Time and Number of Results 76 times faster than BWT-SW 16 times faster than BWT-SW

32 Filtering Ratio

33 Reusing Ratio

34 Index Size

35 Conclusions High efficiency of ALAE  Improves BWT-SW significantly  Accelerates BLAST for most of the scoring schemes In-memory approach using compressed suffix array Mathematical analysis  Upper bound on calculated entries

36 Thank you! Source code to be available at

37 Simulating Searches Using Compressed Suffix Array Match a q-length substring in text  Identify forks Find occurrences of a substring in text  Calculate end positions of alignments Get all suffixes with the same prefix as X q

38 X = GC Positions of GC in T  SA[4] = 5  SA[5] = 1 Review of Compressed Suffix Array T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = G 1 C 2 T 3 A 4 G 5 C 6 $ 7 Conceptual matrix G C T A G C $ C T A G C $ G T A G C $ G C A G C $ G C T G C $ G C T A C $ G C T A G $ G C T A G C BTW = CTGGA$C $ G C T A G C A G C $ G C T C $ G C T A G C T A G C $ G G C $ G C T A G C T A G C $ T A G C $ G C SA[0,6]

39 X = GC  P -1 = CG Positions of CG in T -1  SA[2] = 2  SA[3] = 6 Therefore, Positions of GC in T  SA[2]-|X|+1 = 1  SA[3]-|X|+1= 5 Compressed Suffix Array – reverse T to T -1 T = G 1 C 2 T 3 A 4 G 5 C 6 T’ = $ 0 G 1 C 2 T 3 A 4 G 5 C 6 Conceptual matrix C G A T C G $ G A T C G $ C A T C G $ C G T C G $ C G A C G $ C G A T G $ C G A T C $ C G A T C G BTW = GGT$CCA $ C G A T C G A T C G $ C G C G $ C G A T C G A T C G $ G $ C G A T C G A T C G $ C T C G $ C G A SA[0,6] T -1 = C 6 G 5 A 4 T 3 C 2 G 1 $ 0

40 Align Distinct Substring in T with P T P X … i v j v v

41 Alignment Time T = 50 million characters P = 10 thousand characters Smith-Waterman algorithm7.7 hours ALAE25 ms