FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming www.cse.ucsd.edu/classes/fa05/cse182 www.cse.ucsd.edu/~vbafna.

Slides:



Advertisements
Similar presentations
FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Advertisements

Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
Sequence allignement 1 Chitta Baral. Sequences and Sequence allignment Two main kind of sequences –Sequence of base pairs in DNA molecules (A+T+C+G)*
Sequence Alignment Tutorial #2
Inexact Matching of Strings General Problem –Input Strings S and T –Questions How distant is S from T? How similar is S to T? Solution Technique –Dynamic.
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Tutorial #2
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
. Class 4: Fast Sequence Alignment. Alignment in Real Life u One of the major uses of alignments is to find sequences in a “database” u Such collections.
Reminder -Structure of a genome Human 3x10 9 bp Genome: ~30,000 genes ~200,000 exons ~23 Mb coding ~15 Mb noncoding pre-mRNA transcription splicing translation.
Fa05CSE 182 L3: Blast: Local Alignment and other flavors.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Distance Functions for Sequence Data and Time Series
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Introduction to bioinformatics
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.
Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
15-853:Algorithms in the Real World
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
From Pairwise Alignment to Database Similarity Search.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
LCS and Extensions to Global and Local Alignment Dr. Nancy Warter-Perez June 26, 2003.
Developing Pairwise Sequence Alignment Algorithms
Inferring function by homology The fact that functionally important aspects of sequences are conserved across evolutionary time allows us to find, by homology.
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
An Introduction to Bioinformatics
Basic Introduction of BLAST Jundi Wang School of Computing CSC691 09/08/2013.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Hugh E. Williams and Justin Zobel IEEE Transactions on knowledge and data engineering Vol. 14, No. 1, January/February 2002 Presented by Jitimon Keinduangjun.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Sequence Alignment.
Step 3: Tools Database Searching
Local Alignment Vasileios Hatzivassiloglou University of Texas at Dallas.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Vineet Bafna. How can we compute the local alignment itself?
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Distance Functions for Sequence Data and Time Series
Biology 162 Computational Genetics Todd Vision Fall Aug 2004
Sequence Alignment 11/24/2018.
Pairwise sequence Alignment.
Intro to Alignment Algorithms: Global and Local
Bioinformatics Algorithms and Data Structures
Bioinformatics Lecture 2 By: Dr. Mehdi Mansouri
Basic Local Alignment Search Tool
Sequence Alignment Tutorial #2
Presentation transcript:

FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming

FA05CSE182 Searching Sequence databases

FA05CSE182 Query: >gi| |dbj|BAC | unnamed protein product [Mus musculus] MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGY IIVFVVALIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSL CKVIPYLQTVSVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVME CSSMLPGLANKTTLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQ IPGTSSVVQRKWKQQQPVSQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAIC YLPISILNVLKRVFGMFTHTEDRETVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAF SCCLGVHHRQGDRLARGRTSTESRKSLTTQISNFDNVSKLSEHVVLTSISTLPAANGAGPLQ NWYLQQGVPSSLLSTWLEV What is the function of this sequence? Is there a human homolog? Which organelle does it work in? (Secreted/membrane bound) Idea: Search a database of known proteins to see if you can find similar sequences which have a known function

FA05CSE182 Querying with Blast

FA05CSE182 Blast Output The output (Blastp query) is a series of protein sequences, ranked according to similarity with the query Each database hit is aligned to a subsequence of the query

FA05CSE182 Blast Output query Schematic db

FA05CSE182 Blast Output Q beg S beg Q end S end S Id

FA05CSE182 Interpreting Blast results How do we interpret these results? –Similar sequence in the 3 species implies that the common ancestor of the 3 had that sequence. –The sequence accumulates mutations over time. These mutations may be indels, or substitutions. –Hum and mus diverged more recently and so the sequences are more likely to be similar. hum mus dros hummus?

FA05CSE182 How do we measure similarity between sequences? Percent identity? –Hard to compute without indels.  Number of sequence edit operations?  Implies a notion of alignment. A T C A A T C G T C A A T G G T A T C A A T C G T C A A T G G T

FA05CSE182 Computing alignments What is an alignment? 2Xm table. Each sequence is a row, with interspersed gaps Columns describe the edit operations What is the score of an alignment? What is the score of an alignment? Score every column, and sum up the scores. Let C be the score function for the column Score every column, and sum up the scores. Let C be the score function for the column How do we compute the alignment with the best score? How do we compute the alignment with the best score? AA-TCGGA ACTCG-A

FA05CSE182 Optimum scoring alignments, and score of optimum alignment Instead of computing an optimum scoring alignment, we attempt to compute the score of an optimal alignment. Later, we will show that the two are equivalent

FA05CSE182 Computing Optimal Alignment score Observations: The optimum alignment has nice recursive properties: –The alignment score is the sum of the scores of columns. –If we break off at cell k, the left part and right part must be optimal sub-alignments. –The left part contains prefixes s[1..i], and t[1..j] k t s

FA05CSE182 Optimum prefix alignments Consider an optimum alignment of the prefix s[1..i], and t[1..j] Look at the last cell. It can only have 3 possibilities. 1 k s t

FA05CSE182 3 possibilities for rightmost cell 1. s[i] is aligned to t[j] 2. s[i] is aligned to ‘-’ 3. t[j] is aligned to ‘-’ s[i] t[j] s[i] t[j] Optimum alignment of s[1..i-1], and t[1..j-1] Optimum alignment of s[1..i-1], and t[1..j] Optimum alignment of s[1..i], and t[1..j-1]

FA05CSE182 Optimal score of an alignment Let S[i,j] be the score of an optimal alignment of the prefix s[1..i], and t[1..j]. It must be one of 3 possibilities. s[i] t[j] Optimum alignment of s[1..i-1], and t[1..j-1] s[i] - Optimum alignment of s[1..i-1], and t[1..j] - Optimum alignment of s[1..i], and t[1..j-1] t[j] S[i,j] = C(s i,t j )+S(i-1,j-1) S[i,j] = C(s i,-)+S(i-1,j) S[i,j] = C(-,t j )+S(i,j-1)

FA05CSE182 Optimal alignment score Which prefix pairs (i,j) should we use? For now, simply use all. If the strings are of length m, and n, respectively, what is the score of the optimal alignment?

FA05CSE182 An O(nm) algorithm for score computation The iteration ensures that all values on the right are computed in earlier steps. For i = 1 to n For j = 1 to m

FA05CSE182 Initialization

FA05CSE182 A tableaux approach s n 1 i 1jn S[i,j-1]S[i,j] S[i-1,j] S[i-1,j-1] t Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells

FA05CSE182 An Example Align s=TCAT with t=TGCAA Match Score = 1 Mismatch score = -1, Indel Score = -1 Score A1?, Score A2? T C A T - T G C A A T C A T T G C A A A1 A2

FA05CSE T G C A A T C A T Alignment Table

FA05CSE T G C A A T C A T Alignment Table S[4,5] = 1 is the score of an optimum alignment Therefore, A2 is an optimum alignment We know how to obtain the optimum Score. How do we get the best alignment?

FA05CSE182 Computing Optimum Alignment At each cell, we have 3 choices We maintain additional information to record the choice at each step. For i = 1 to n For j = 1 to m If (S[i,j]= S[i-1,j-1] + C(s i,t j )) M[i,j] = If (S[i,j]= S[i-1,j] + C(s i,-)) M[i,j] = If (S[i,j]= S[i,j-1] + C(-,t j ) ) M[i,j] = j-1 i-1 j i

FA05CSE182 T G C A A T C A T Computing Optimal Alignments

FA05CSE182 Retrieving Opt.Alignment T G C A A T C A T M[4,5]= Implies that S[4,5]=S[3,4]+C( A,T ) or A T M[3,4]= Implies that S[3,4]=S[2,3] +C( A,A ) or A T A A

FA05CSE182 Retrieving Opt.Alignment T G C A A T C A T M[2,3]= Implies that S[2,3]=S[1,2]+C( C,C ) or A T M[1,2]= Implies that S[1,2]=S[1,1] +C (-,G ) or A T A A A A C C C C - GT T

FA05CSE182 Algorithm to retrieve optimal alignment RetrieveAl(i,j) if (M[i,j] == `\’) return (RetrieveAl (i-1,j-1). ) else if (M[i,j] == `|’) return (RetrieveAl (i-1,j). ) sisi tjtj sisi - - tjtj else if (M[i,j] == `--’) return (RetrieveAl (i,j-1). ) return (RetrieveAl (i,j-1). )

FA05CSE182 Summary An optimal alignment of strings of length n and m can be computed in O(nm) time There is a tight connection between computation of optimal score, and computation of opt. Alignment –True for all DP based solutions

FA05CSE182 Global versus Local Alignment Consider s = ACCACCCCTT t = ATCCCCACAT ACCACCCCTT A TCCCCACAT ATCCCCACAT ACCACCCCT T

FA05CSE182 Blast Outputs Local Alignment query Schematic db

FA05CSE182 Local Alignment Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’) How can we compute this efficiently? a b a’b’

FA05CSE182 Local Alignment Trick How can we compute the local alignment itself?

FA05CSE182 Generalizing Gap Cost It is more likely for gaps to be contiguous The penalty for a gap of length l should be

FA05CSE182 Using affine gap penalties What is the time taken for this? What are the values that l can take? Can we get rid of the extra Dimension?

FA05CSE182 Affine gap penalties Define D[i,j] : Score of the best alignment, given that the final column is a ‘deletion’ (s i is aligned to a gap) Define I[i,j]: Score of the best alignment, given that the final column is an insertion (t j is aligned to a gap)

FA05CSE182 O(nm) solution for affine gap costs

FA05CSE182 Blast variants Blastn: DNA query, DNA database Blastp: Protein Query, DNA database Blastx,Tblastn,Tblastx –all require a 6 frame translation

FA05CSE182 The genetic code The DNA can be translated conceptually using triplets, and looking up the table. How many translations are possible? ATCGGATCGTGAT

FA05CSE182 Blast variants Blastx: DNA query, protein database Tblastn: Protein query, DNA database Tblastx: DNA query, DNA database –How is it different from Blastn?

FA05CSE182 Summary A critical part of BLAST is local sequence alignment. Optimum Local alignments between two sequences of length n and m can be computed in O(nm) time. The algorithm can be extended to deal with affine gap penalties. What is the memory requirement for this computation? Q: If one of the sequences was a large database, would it be feasible to compute these alignments? Q: What if the query and database were large?