Biological Sequence Comparison and Alignment Speaker: Yu-Hsiang Wang Advisor: Prof. Jian-Jung Ding Digital Image and Signal Processing Lab Graduate Institute.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Parallel BioInformatics Sathish Vadhiyar. Parallel Bioinformatics  Many large scale applications in bioinformatics – sequence search, alignment, construction.
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Presented By Dr. Shazzad Hosain Asst. Prof. EECS, NSU
BLAST Sequence alignment, E-value & Extreme value distribution.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
1 ALAE: Accelerating Local Alignment with Affine Gap Exactly in Biosequence Databases Xiaochun Yang, Honglei Liu, Bin Wang Northeastern University, China.
6/11/2015 © Bud Mishra, 2001 L7-1 Lecture #7: Local Alignment Computational Biology Lecture #7: Local Alignment Bud Mishra Professor of Computer Science.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Sequence Alignment Variations Computing alignments using only O(m) space rather than O(mn) space. Computing alignments with bounded difference Exclusion.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Heuristic Approaches for Sequence Alignments
Sequence analysis of nucleic acids and proteins: part 1 Based on Chapter 3 of Post-genome Bioinformatics by Minoru Kanehisa, Oxford University Press, 2000.
Bioinformatics Workshop, Fall 2003 Algorithms in Bioinformatics Lawrence D’Antonio Ramapo College of New Jersey.
Class 2: Basic Sequence Alignment
15-853:Algorithms in the Real World
Sequence alignment, E-value & Extreme value distribution
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
Sequence comparison: Local alignment
Case Study. DNA Deoxyribonucleic acid (DNA) is a nucleic acid that contains the genetic instructions used in the development and functioning of all known.
Speed Up DNA Sequence Database Search and Alignment by Methods of DSP
Sequence Alignment.
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press.
BIONFORMATIC ALGORITHMS Ryan Tinsley Brandon Lile May 9th, 2014.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
New Segmentation Methods Advisor : 丁建均 Jian-Jiun Ding Presenter : 蔡佳豪 Chia-Hao Tsai Date: Digital Image and Signal Processing Lab Graduate Institute.
BLAST: A Case Study Lecture 25. BLAST: Introduction The Basic Local Alignment Search Tool, BLAST, is a fast approach to finding similar strings of characters.
SISAP’08 – Approximate Similarity Search in Genomic Sequence Databases using Landmark-Guided Embedding Ahmet Sacan and I. Hakki Toroslu
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Computational Biology, Part 9 Efficient database searching methods Robert F. Murphy Copyright  1996, 1999, All rights reserved.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Reference-Based Indexing of Sequence Databases (VLDB ’ 06) Jayendra Venkateswaran Deepak Lachwani Tamer Kahveci Christopher Jermaine Presented by Angela.
We want to calculate the score for the yellow box. The final score that we fill in the yellow box will be the SUM of two other scores, we’ll call them.
Image Enhancement [DVT final project]
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Color Image Segmentation Advisor : 丁建均 Jian-Jiun Ding Presenter : 蔡佳豪 Chia-Hao Tsai Date: Digital Image and Signal Processing Lab Graduate Institute.
A Table-Driven, Full-Sensitivity Similarity Search Algorithm Gene Myers and Richard Durbin Presented by Wang, Jia-Nan and Huang, Yu- Feng.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
. Sequence Alignment. Sequences Much of bioinformatics involves sequences u DNA sequences u RNA sequences u Protein sequences We can think of these sequences.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
A * Search A* (pronounced "A star") is a best first, graph search algorithm that finds the least-cost path from a given initial node to one goal node out.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Doug Raiford Phage class: introduction to sequence databases.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Heuristic Alignment Algorithms Hongchao Li Jan
4.2 - Algorithms Sébastien Lemieux Elitra Canada Ltd.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Sequence comparison: Local alignment
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Pairwise sequence Alignment.
Sequence Alignment with Traceback on Reconfigurable Hardware
Intro to Alignment Algorithms: Global and Local
CSE 589 Applied Algorithms Spring 1999
BCB 444/544 Lecture 7 #7_Sept5 Global vs Local Alignment
BIOINFORMATICS Fast Alignment
Bioinformatics Algorithms and Data Structures
Optimal Partitioning of Data Chunks in Deduplication Systems
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

Biological Sequence Comparison and Alignment Speaker: Yu-Hsiang Wang Advisor: Prof. Jian-Jung Ding Digital Image and Signal Processing Lab Graduate Institute of Communication Engineering National Taiwan University 1NTU - DISP Lab

Outline Introduction of biological sequence alignment Sequence alignment algorithm ◦ Dynamic programming ◦ FASTA & BLAST ◦ UDCR ◦ Less-redundant fast algorithm ◦ Algorithm for approximate string matching and improvement Conclusion Reference 2NTU - DISP Lab

Introduction of biological sequence alignment Two strings: S 1 : “caccba” and S 2 : “cabbab” Alignment: where 3NTU - DISP Lab

Introduction of biological sequence alignment The edit distance between two strings Scoring matrix of alphabet or Similarity 4NTU - DISP Lab

Dynamic programming The fundamental sequence alignment algorithm The recurrence relation Tabular computation The traceback 5NTU - DISP Lab

Dynamic programming The recurrence relation: ◦ Define D(i, j) to be the edit distance of S 1 [1..i] and S 2 [1..j] ◦ Base condition:  D(i, 0)=i (the first column)  D(0, j)=j (the first row) ◦ Recurrence relation:  D(i, j) = min[D(i-1, j) + d, D(i, j-1) + d, D(i-1, j-1) + t(i, j)]  where t(i, j)=e if S 1 (i)=S 2 (j); otherwise t(i, j)=r  (Assume d=1, r=1, and e=0 ) 6NTU - DISP Lab

Dynamic programming Tabular computation ◦ Initial table 7NTU - DISP Lab

Dynamic programming Tabular computation ◦ Finished table 8NTU - DISP Lab

Dynamic programming The traceback ◦ If D(i, j)=D(i, j-1)+1, set a pointer from (i, j) to (i, j-1), denote as “←” ◦ If D(i, j)=D(i-1, j)+1, set a pointer from (i, j) to (i-1, j), denote as “↑” ◦ If D(i, j)=D(i-1, j-1)+t(i, j) , set a pointer from (i, j) to (i-1, j-1) as “ ↖ ”  where t(i, j)=0 if S 1 (i)=S 2 (j); otherwise t(i, j)=1 9NTU - DISP Lab

Dynamic programming The traceback 10NTU - DISP Lab

Dynamic programming Alignment(1) S 1 :wri-t-ers S 2 :-vintner- 11NTU - DISP Lab

Dynamic programming Alignment(2) S 1 :wri-t-ers S 2 :v-intner- 12NTU - DISP Lab

Dynamic programming Alignment(3) S 1 :wri-t-ers S 2 :vintner- 13NTU - DISP Lab

FASTA The main idea: similar sequences probably share some short matches (word) Only search for the consecutive identities of length k (k-tuple word) 14NTU - DISP Lab

FASTA Step1:Select k to establish the lookup table  “ATAGTCAATCCG” and “TGAGCAATCAAG” 15NTU - DISP Lab

FASTA Step2: labels each k-tuple word as an “x” (the word hits sharing same offset are on a same diagonal) 16NTU - DISP Lab

FASTA Step3: Choose 10 best diagonal regions 17NTU - DISP Lab

FASTA Step4: If a region’s score is higher than the threshold, it can remain 18NTU - DISP Lab

FASTA Step5: Combine these remained regions into a longer high-scoring alignment (allow some spaces) 19NTU - DISP Lab

BLAST Similar to FASTA BLAST only care about the high-scoring words Establish a list to store those words 20NTU - DISP Lab

BLAST For example (3-mers): 21NTU - DISP Lab PEG PQA 15 12

BLAST For example, set D-score as 0 22NTU - DISP Lab

UDCR Use unitary mapping to represent the four types of nucleotide ◦ b x [τ] = 1 if x[τ] = ‘A’, ◦ b x [τ] = -1 if x[τ] = ‘T’, ◦ b x [τ] = j if x[τ] = ‘G’, ◦ b x [τ] = -j if x[τ] = ‘C’ 23NTU - DISP Lab

UDCR Calculate discrete correlations: ◦ For example  x = ‘GTAGCTGAACTGAAC’;  y = ‘AACTGAA’,  b x = [j,  1, 1, j,  j,  1, j, 1, 1,  j,  1, j, 1, 1,  j],  b y = [1, 1,  j,  1, j, 1, 1].  z 1 = [j,-1+j, 1,1+j, -j,-1-j,-3+j2, j3,6+j,1-j4,-4-j3, -4+j3,2+j5, 7,2-j5,-3-j3,-3+j2, 1+j3, 3, 1-j, -j],  z 2 = [  1, 0, 3,  2,  1, 0,  1, 1, 5,  5, 1, 1,  3, 7,  3, 0, 1,  2, 3, 0,  1] 24NTU - DISP Lab

UDCR Similarity ◦ since s[2]=6, we can know that the sequence {x[2],x[3],…,x[7]} is similar to y:  x =‘GTAGCTGAACTGAAC’, y = ‘AACTGAA’. 25NTU - DISP Lab

Less-redundant fast algorithm The movement ↘ is generalized and the movement → is removed New method to determine edit distance: ◦ D(i, j)=min [D(i-1, j 0 ) + j - j 0 – 1 + s(i, j, j 0 )]  where s(i, j, j) = 2, s(i, j, j 0 ) = 1, if j 0 ≦ j-1 and x(τ) ≠ y(j) for all τ in the range of j 0 +1 ≦ τ ≦ j s(i, j, j 0 ) = 0, if j 0 ≦ j-1 and x(τ) = y(j) for a τ in the range of j 0 +1 ≦ τ ≦ j 26NTU - DISP Lab

Less-redundant fast algorithm Slope rule ◦ If D(i, j a ) and D(i, j b ) on the same row satisfy D(i, j b ) - D(i, j a ) ≧ | j b - j a | then D(i, j b ) can be ignored. 27NTU - DISP Lab

Less-redundant fast algorithm Different entry rule(1) ◦ If (a) x(i) ≠ y(j) (b) D(i-1, j) is inactive (c) D(i-1. j-1) is active, then D(i, j) can be calculated by “D(i, j) = D(i-1, j-1) + 1” 28NTU - DISP Lab inactive active inactive

Less-redundant fast algorithm Different entry rule(2) ◦ If (a) x(i) ≠ y(j) (b) D(i-1, j) is active (c) x(i) ≠ y(j+1), then D(i, j) can be calculated by “D(i, j) = D(i-1, j) + 1” 29NTU - DISP Lab

Less-redundant fast algorithm Different entry rule(3) ◦ If (a) x(i) ≠ y(j) (b) D(i-1, j) is active (c) x(i) = y(j+1), then “D(i, j) can be ignored” 30NTU - DISP Lab

Less-redundant fast algorithm Same entry rule(1) ◦ If (a) x(i) = y(j) (b) D(i-1, j) is inactive (c) one of D(i-1, j 1 ), D(i-1, j 1 +1), …,D(i-1, j-1) is active, then D(i, j) can be calculated by “D(i, j) = D(i-1, j 2 ) + j - j 2 – 1” 31NTU - DISP Lab

Less-redundant fast algorithm Same entry rule(2) ◦ If (a) x(i) = y(j) (b) D(i-1, j) is active (c) none of D(i-1, j 1 ),D(i-1, j 1 +1),…,D(i-1, j-1) is active, then D(i, j) can be calculated by “D(i, j) = D(i-1, j 2 ) + 1” when x(i+1) ≠ y(j), “D(i, j) can be ignored” when x(i+1) = y(j). 32NTU - DISP Lab

Less-redundant fast algorithm Same entry rule(3) ◦ If (a) x(i) = y(j) (b) D(i-1, j) is active (c) one of D(i-1, j 1 ),D(i-1, j 1 +1),…,D(i-1, j-1) is active, then D(i, j) can be calculated by “D(i, j) = D(i-1, j 2 ) + 1” when x(i+1) ≠ y(j), “D(i, j) can be ignored” when x(i+1) = y(j). 33NTU - DISP Lab

Less-redundant fast algorithm Limitation rule ◦ First, Roughly estimate the upper bound of the edit distance H 1 by ◦ where k(x(i), y(j))=0 if x(i) = y(j) and k(x(i), y(j))=1 if x(i) ≠ y(j). 34NTU - DISP Lab

Less-redundant fast algorithm Limitation rule ◦ The diagram of H 1. (a) τ is 1. (b) No movement. (c) τ is NTU - DISP Lab

Less-redundant fast algorithm Limitation rule ◦ Set the upper bound of the edit distance as  where Max(N-i, M-j) is the maximal possible edit distances between x[i…N] and y[j…M]. 36NTU - DISP Lab

Less-redundant fast algorithm Limitation rule ◦ If then D(i, j) can be ignored. 37NTU - DISP Lab

Algorithm for approximate string matching Focus on diagonal. Define f kp = the largest index i ◦ k denote the diagonal’s number ◦ p denote the edit distance’s value 38NTU - DISP Lab

Algorithm for approximate string matching f 1,-1 =-∞, f 1,0 =-1, f 1,1 =2, f 1,2 =3, f 1,3 =4.  Computation range: -p ~ N-M+p 39NTU - DISP Lab

Improved algorithm for approximate string matching Use main diagonal N-M to separate table. Linking List New scoring scheme 40NTU - DISP Lab

Improved algorithm for approximate string matching Example of new scoring scheme 41NTU - DISP Lab

Improved algorithm for approximate string matching Score 0 iteration 42NTU - DISP Lab

Improved algorithm for approximate string matching Score 1 iteration 43NTU - DISP Lab

Improved algorithm for approximate string matching Score 2 iteration 44NTU - DISP Lab

Improved algorithm for approximate string matching Score 3 iteration 45NTU - DISP Lab

Improved algorithm for approximate string matching Score 4 iteration 46 Edit distance: iteration+(N-M) = 4+(10-7) = 7 NTU - DISP Lab

Improved algorithm for approximate string matching Experiments (complexity depend on s-|M-N|) 47 Strings on 4-letters alphabetStrings on 20-letters alphabet NTU - DISP Lab

Conclusion Traditional Dynamic programming costs too much computational time and memory. Less-redundant fast algorithm can save lots of time and space. Improved algorithm for approximate string matching can also save lots of time and space. (Although it cannot traceback to get the alignment) 48NTU - DISP Lab

Reference [1]D. Gusfield, Algorithms on Strings, Trees, and Sequence, Cambridge University Press, [2]S. C. Pei, J. J. Ding “Sequence Comparison and Alignment by Discrete Correlations, Unitary Mapping, and Number Theoretic Transforms” [3]K.H. Hsu, “Faster Algorithms for Computing Edit Distance and Alignment between DNA Strings”, [4]J. J. Ding, K. H. Hsu, “A Less-Redundant Fast Algorithm for Computing Edit Distances Exactly” [5]E. UKKONEN, “Algorithms for Approximate String Matching,” Information and Control 64, , 1985 [6]D. Papamichail and G. Papamichail, “Improved Algorithms for Approximate String Matching,” BMC Bioinformatics, Vol.10(Suppl.1):S10, January, NTU - DISP Lab