FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming www.cse.ucsd.edu/classes/fa09/cse182 www.cse.ucsd.edu/~vbafna.

Slides:



Advertisements
Similar presentations
Sequence Alignment I Lecture #2
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Blast outputoutput. How to measure the similarity between two sequences Q: which one is a better match to the query ? Query: M A T W L Seq_A: M A T P.
Hidden Markov Models (1)  Brief review of discrete time finite Markov Chain  Hidden Markov Model  Examples of HMM in Bioinformatics  Estimations Basic.
Alignment methods Introduction to global and local sequence alignment methods Global : Needleman-Wunch Local : Smith-Waterman Database Search BLAST FASTA.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Lecture 8 Alignment of pairs of sequence Local and global alignment
C E N T R F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U E Alignments 1 Sequence Analysis.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
Multiple Sequence Alignment Algorithms in Computational Biology Spring 2006 Most of the slides were created by Dan Geiger and Ydo Wexler and edited by.
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Alignment methods and database searching April 14, 2005 Quiz#1 today Learning objectives- Finish Dotter Program analysis. Understand how to use the program.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Fa05CSE 182 L3: Blast: Local Alignment and other flavors.
Sequence Alignment Bioinformatics. Sequence Comparison Problem: Given two sequences S & T, are S and T similar? Need to establish some notion of similarity.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Pairwise Sequence Alignment Part 2. Outline Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez June 23, 2004.
Sequence similarity.
UNC Chapel Hill Lin/Manocha/Foskey Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject.
Alignment methods June 26, 2007 Learning objectives- Understand how Global alignment program works. Understand how Local alignment program works.
Similar Sequence Similar Function Charles Yan Spring 2006.
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 20, 2003.
UNIVERSITY OF SOUTH CAROLINA College of Engineering & Information Technology Bioinformatics Algorithms and Data Structures Chapter 11: Core String Edits.
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
. Sequence Alignment II Lecture #3 This class has been edited from Nir Friedman’s lecture. Changes made by Dan Geiger, then by Shlomo Moran. Background.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Protein Sequence Comparison Patrice Koehl
Developing Pairwise Sequence Alignment Algorithms Dr. Nancy Warter-Perez May 10, 2005.
Phylogenetic Tree Construction and Related Problems Bioinformatics.
FA05CSE182 CSE 182-L2:Blast & variants I Dynamic Programming
Class 2: Basic Sequence Alignment
Introduction to Bioinformatics From Pairwise to Multiple Alignment.
Alignment methods II April 24, 2007 Learning objectives- 1) Understand how Global alignment program works using the longest common subsequence method.
TM Biological Sequence Comparison / Database Homology Searching Aoife McLysaght Summer Intern, Compaq Computer Corporation Ballybrit Business Park, Galway,
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Developing Pairwise Sequence Alignment Algorithms
Sequence Analysis Alignments dot-plots scoring scheme Substitution matrices Search algorithms (BLAST)
BIOMETRICS Module Code: CA641 Week 11- Pairwise Sequence Alignment.
Pairwise & Multiple sequence alignments
Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.
Sequence Alignment Goal: line up two or more sequences An alignment of two amino acid sequences: …. Seq1: HKIYHLQSKVPTFVRMLAPEGALNIHEKAWNAYPYCRTVITN-EYMKEDFLIKIETWHKP.
Alignment methods April 26, 2011 Return Quiz 1 today Return homework #4 today. Next homework due Tues, May 3 Learning objectives- Understand the Smith-Waterman.
Pairwise Sequence Alignment BMI/CS 776 Mark Craven January 2002.
Sequence Analysis CSC 487/687 Introduction to computing for Bioinformatics.
Scoring Matrices April 23, 2009 Learning objectives- 1) Last word on Global Alignment 2) Understand how the Smith-Waterman algorithm can be applied to.
Alignment, Part I Vasileios Hatzivassiloglou University of Texas at Dallas.
Multiple sequence comparison (MSC) Reading: Setubal/Meidanis, 3.4 Gusfield, Algorithms on Strings, Trees and Sequences, chapter 14.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
Intro to Alignment Algorithms: Global and Local Intro to Alignment Algorithms: Global and Local Algorithmic Functions of Computational Biology Professor.
Applied Bioinformatics Week 3. Theory I Similarity Dot plot.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Sequence Alignments with Indels Evolution produces insertions and deletions (indels) – In addition to substitutions Good example: MHHNALQRRTVWVNAY MHHALQRRTVWVNAY-
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
Point Specific Alignment Methods PSI – BLAST & PHI – BLAST.
Optimization Problems In which a set of choices must be made in order to arrive at an optimal (min/max) solution, subject to some constraints. (There may.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
HomologyIf twp proteins are homologous, they have a common fold and a common ancestor If two proteins have >25% identity across their entire length, they.
Sequence Alignment. Assignment Read Lesk, Problem: Given two sequences R and S of length n, how many alignments of R and S are possible? If you.
Multiple String Comparison – The Holy Grail. Why multiple string comparison? It is the most critical cutting-edge toοl for extracting and representing.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Intro to Alignment Algorithms: Global and Local
Bioinformatics Algorithms and Data Structures
Presentation transcript:

FA08CSE182 CSE 182-L2:Blast & variants I Dynamic Programming

FA08CSE182 Notes Assignment 1 is online, due next Tuesday. Discussion section is optional. Use it as a resource. On the web-site, you’ll find some questions on lectures. Ideally, you should be able to answer the questions after attending these lectures (Not all of these are trivial, so please study them carefully).

FA08CSE182 Searching Sequence databases

FA08 CSE182 Query: >gi| |dbj|BAC | unnamed protein product [Mus musculus] MSSTKLEDSLSRRNWSSASELNETQEPFLNPTDYDDEEFLRYLWREYLHPKEYEWVLIAGY IIVFVVALIGNVLVCVAVWKNHHMRTVTNYFIVNLSLADVLVTITCLPATLVVDITETWFFGQSL CKVIPYLQTVSVSVSVLTLSCIALDRWYAICHPLMFKSTAKRARNSIVVIWIVSCIIMIPQAIVME CSSMLPGLANKTTLFTVCDEHWGGEVYPKMYHICFFLVTYMAPLCLMILAYLQIFRKLWCRQ IPGTSSVVQRKWKQQQPVSQPRGSGQQSKARISAVAAEIKQIRARRKTARMLMVVLLVFAIC YLPISILNVLKRVFGMFTHTEDRETVYAWFTFSHWLVYANSAANPIIYNFLSGKFREEFKAAF SCCLGVHHRQGDRLARGRTSTESRKSLTTQISNFDNVSKLSEHVVLTSISTLPAANGAGPLQ NWYLQQGVPSSLLSTWLEV What is the function of this sequence? Is there a human homolog? Which cellular organelle does it work in? (Secreted/membrane bound) Idea: Search a database of known proteins to see if you can find similar sequences which have a known function

FA08CSE182 Querying with Blast

FA08CSE182 Blast Output The output (Blastp query) is a series of protein sequences, ranked according to similarity with the query Each database hit is aligned to a subsequence of the query

FA08CSE182 Blast Output 1 query Schematic db Q beg S beg Q end S end S Id

FA08CSE182 Blast Output 2 (drosophila) Q beg S beg Q end S end S Id

FA08CSE182 The technological question How do we measure similarity between sequences? Percent identity? Number of sequence edit operations? Number of sequence edit operations? Implies a notion of alignment that includes indels Implies a notion of alignment that includes indels Technology question: Given two sequences, how do we compute a ‘good’ alignment? What is ‘good’? Technology question: Given two sequences, how do we compute a ‘good’ alignment? What is ‘good’? A T C A A C G T C A A T G G T A T C A A - C G - - T C A A T G G T

FA08CSE182 The biology question How do we interpret these results? –Similar sequence in the 3 species implies that the common ancestor of the 3 had an ancestral form of that sequence. –The sequence accumulates mutations over time. These mutations may be indels, or substitutions. A ‘good’ alignment might be one in which many residues are identical. However, –Hum and mus diverged more recently and so the sequences are more likely to be similar. –Paralogs can create big problems hum mus dros hummus? ?

FA08CSE182 Computing alignments What is an alignment? 2Xm table. Each sequence is a row, with interspersed gaps Columns describe the edit operations What is the score of an alignment? What is the score of an alignment? Score every column, and sum up the scores. Let C be the score function for the column Score every column, and sum up the scores. Let C be the score function for the column How do we compute the alignment with the best score? How do we compute the alignment with the best score? AA-TCGGA ACTCG-A

FA08CSE182 Optimum scoring alignments, and score of optimum alignment Instead of computing an optimum scoring alignment, we attempt to compute the score of an optimal alignment. Later, we will show that the two are equivalent

FA08CSE182 Computing Optimal Alignment score Observations: The optimum alignment has nice recursive properties: –The alignment score is the sum of the scores of columns. –If we break off at cell k, the left part and right part must be optimal sub-alignments. –The left part contains prefixes s[1..i], and t[1..j] for some i and some j (we don’t know the values of i and j) k t s

FA08CSE182 Optimum prefix alignments Consider an optimum alignment of the prefix s[1..i], and t[1..j] Look at the last cell, indexed by k. It can only have 3 possibilities. 1 k s t

FA08CSE182 3 possibilities for rightmost cell 1. s[i] is aligned to t[j] 2. s[i] is aligned to ‘-’ 3. t[j] is aligned to ‘-’ s[i] t[j] s[i] t[j] Optimum alignment of s[1..i-1], and t[1..j-1] Optimum alignment of s[1..i-1], and t[1..j] Optimum alignment of s[1..i], and t[1..j-1]

FA08CSE182 Optimal score of an alignment Let S[i,j] be the score of an optimal alignment of the prefix s[1..i], and t[1..j]. It must be one of 3 possibilities. s[i] t[j] Optimum alignment of s[1..i-1], and t[1..j-1] s[i] - Optimum alignment of s[1..i-1], and t[1..j] - Optimum alignment of s[1..i], and t[1..j-1] t[j] S[i,j] = C(s i,t j )+S(i-1,j-1) S[i,j] = C(s i,-)+S(i-1,j) S[i,j] = C(-,t j )+S(i,j-1)

FA08CSE182 Optimal alignment score Which prefix pairs (i,j) should we use? For now, simply use all. If the strings are of length m, and n, respectively, what is the score of the optimal alignment?

FA08CSE182 Sequence Alignment Recall: Instead of computing the optimum alignment, we are computing the score of the optimum alignment Let S[i,j] denote the score of the optimum alignment of the prefix s[1..i] and t [1..j]

FA08CSE182 An O(nm) algorithm for score computation The iteration ensures that all values on the right are computed in earlier steps. For i = 1 to n For j = 1 to m

FA08CSE182 Base case (Initialization)

FA08CSE182 A tableaux approach s n 1 i 1jn S[i,j-1]S[i,j] S[i-1,j] S[i-1,j-1] t Cell (i,j) contains the score S[i,j]. Each cell only looks at 3 neighboring cells

FA08CSE182 An Example Align s=TCAT with t=TGCAA Match Score = 1 Mismatch score = -1, Indel Score = -1 Score A1?, Score A2? T C A T - T G C A A T C A T T G C A A A1 A2

FA08CSE T G C A A T C A T Alignment Table

FA08CSE T G C A A T C A T Alignment Table S[4,5] = 1 is the score of an optimum alignment Therefore, A2 is an optimum alignment We know how to obtain the optimum Score. How do we get the best alignment?

FA07CSE182 Computing Optimum Alignment At each cell, we have 3 choices We maintain additional information to record the choice at each step. For i = 1 to n For j = 1 to m If (S[i,j]= S[i-1,j-1] + C(s i,t j )) M[i,j] = If (S[i,j]= S[i-1,j] + C(s i,-)) M[i,j] = If (S[i,j]= S[i,j-1] + C(-,t j ) ) M[i,j] = j-1 i-1 j i

FA07CSE182 T G C A A T C A T Computing Optimal Alignments

FA07CSE182 Retrieving Opt.Alignment T G C A A T C A T M[4,5]= Implies that S[4,5]=S[3,4]+C( A,T ) or A T M[3,4]= Implies that S[3,4]=S[2,3] +C( A,A ) or A T A A

FA08CSE182 Retrieving Opt.Alignment T G C A A T C A T M[2,3]= Implies that S[2,3]=S[1,2]+C( C,C ) or A T M[1,2]= Implies that S[1,2]=S[1,1] +C (-,G ) or A T A A A A C C C C - GT T

FA08CSE182 Algorithm to retrieve optimal alignment RetrieveAl(i,j) if (M[i,j] == `\’) return (RetrieveAl (i-1,j-1). ) else if (M[i,j] == `|’) return (RetrieveAl (i-1,j). ) sisi tjtj sisi - - tjtj else if (M[i,j] == `--’) return (RetrieveAl (i,j-1). ) return (RetrieveAl (i,j-1). )

FA08CSE182 Summary An optimal alignment of strings of length n and m can be computed in O(nm) time There is a tight connection between computation of optimal score, and computation of opt. Alignment –True for all DP based solutions

FA08CSE182 Global versus Local Alignment Consider s = ACCACCCCTT t = ATCCCCACAT ACCACCCCTT A TCCCCACAT ATCCCCACAT ACCACCCCT T Sometimes, this is preferable

FA08CSE182 Blast Outputs Local Alignment query Schematic db

FA08CSE182 Local Alignment Compute maximum scoring interval over all sub-intervals (a,b), and (a’,b’) How can we compute this efficiently? a b a’b’

FA08CSE182 Local Alignment Recall that in global alignment, we compute the best score for all prefix pairs s(1..i) and t(1..j). Instead, compute the best score for all sub- alignments that end in s(i) and t(j). What changes in the recurrence? a i a’j

FA08CSE182 Local Alignment The original recurrence still works, except when the optimum score S[i,j] is negative When S[i,j] <0, it means that the optimum local alignment cannot include the point (i,j). So, we must reset the score to 0. i i-1 j j-1 sisi tjtj

FA08CSE182 Local Alignment Trick (Smith-Waterman algorithm) How can we compute the local alignment itself?

FA07CSE182 Generalizing Gap Cost It is more likely for gaps to be contiguous The penalty for a gap of length l should be

End of Lecture 2 FA07CSE182