Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.

Slides:



Advertisements
Similar presentations
Substitution matrices
Advertisements

Gapped BLAST and PSI-BLAST Altschul et al Presenter: 張耿豪 莊凱翔.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
Measuring the degree of similarity: PAM and blosum Matrix
Local alignments Seq X: Seq Y:. Local alignment  What’s local? –Allow only parts of the sequence to match –Results in High Scoring Segments –Locally.
Lecture outline Database searches
Heuristic alignment algorithms and cost matrices
Sequence similarity (II). Schedule Mar 23midterm assignedalignment Mar 30midterm dueprot struct/drugs April 6teams assignedprot struct/drugs April 13RNA.
March 2006Vineet Bafna Database Filtering. March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 –
Fa05CSE 182 CSE182-L4: Keyword matching. Fa05CSE 182 Backward scoring Defin S b [i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m]
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Fa05CSE 182 L3: Blast: Local Alignment and other flavors.
Heuristic alignment algorithms; Cost matrices 2.5 – 2.9 Thomas van Dijk.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Introduction to bioinformatics
Similar Sequence Similar Function Charles Yan Spring 2006.
Sequence Alignment III CIS 667 February 10, 2004.
Heuristic Approaches for Sequence Alignments
. Computational Genomics Lecture #3a (revised 24/3/09) This class has been edited from Nir Friedman’s lecture which is available at
Scoring matrices Identity PAM BLOSUM.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Alignment III PAM Matrices. 2 PAM250 scoring matrix.
Sequence similarity. Motivation Same gene, or similar gene Suffix of A similar to prefix of B? Suffix of A similar to prefix of B..Z? Longest similar.
Sequence alignment, E-value & Extreme value distribution
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Pairwise alignments Introduction Introduction Why do alignments? Why do alignments? Definitions Definitions Scoring alignments Scoring alignments Alignment.
An Introduction to Bioinformatics
Evolution and Scoring Rules Example Score = 5 x (# matches) + (-4) x (# mismatches) + + (-7) x (total length of all gaps) Example Score = 5 x (# matches)
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
Chapter 3 Computational Molecular Biology Michael Smith
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Sequence Comparison Algorithms Ellen Walker Bioinformatics Hiram College.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
Biocomputation: Comparative Genomics Tanya Talkar Lolly Kruse Colleen O’Rourke.
Pairwise Sequence Alignment Part 2. Outline Summary Local and Global alignments FASTA and BLAST algorithms Evaluating significance of alignments Alignment.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Pairwise sequence alignment Lecture 02. Overview  Sequence comparison lies at the heart of bioinformatics analysis.  It is the first step towards structural.
Sequence Alignment.
Construction of Substitution matrices
Doug Raiford Phage class: introduction to sequence databases.
Bioinformatics Computing 1 CMP 807 – Day 2 Kevin Galens.
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Vineet Bafna. How can we compute the local alignment itself?
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Database Scanning/Searching FASTA/BLAST/PSIBLAST G P S Raghava.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Homology Search Tools Kun-Mao Chao (趙坤茂)
Fast Sequence Alignments
Basic Local Alignment Search Tool
Homology Search Tools Kun-Mao Chao (趙坤茂)
BLAST Slides adapted & edited from a set by
Sequence alignment, E-value & Extreme value distribution
BLAST Slides adapted & edited from a set by
Presentation transcript:

Fa07CSE 182 CSE182-L4: Database filtering

Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score computation –Reconstructing alignments –Local alignments –Affine gap costs –Space saving measures The local alignment algorithm (for biological strings) was given by Temple Smith, and Mike Waterman, and is called the Smith-Waterman algorithm (SW) Can we recreate Blast?

Fa07CSE 182 The birthday question True or False: 2 people in our class share a birthday? How large should the class be that you can answer this reliably

Fa07CSE 182 Basic Probability Q: Toss a coin many times: If it is HEADS, you get a dollar. Otherwise you lose a dollar. What is the money you expect to get (lose) after 100 tosses?

Fa07CSE 182 Basic Probability:Expectation Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. Otherwise you lose a dollar. What is the money you expect to get (lose) after 100 tosses?

Fa07CSE 182 Blast and local alignment Concatenate all of the database sequences to form one giant sequence. Do local alignment computation with the query.

Fa07CSE 182 Large database search Query (m) Database (n) Database size n=100M, Querysize m=1000. O(nm) = computations

Fa07CSE 182 Why is Blast Fast?

Fa07CSE 182 Observations For a typical query, there are only a few ‘real hits’ in the database. Much of the database is random from the query’s perspective Consider a random DNA string of length n. –Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 Assume for the moment that the query is all A’s (length m). What is the probability that an exact match to the query can be found?

Fa07CSE 182 Basic probability Probability that there is a match starting at a fixed position i = 0.25 m What is the probability that some position i has a match. Dependencies confound probability estimates.

Fa07CSE 182 Basic Probability:Expectation Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. Otherwise you lose a dollar. What is the money you expect to get after n tosses? –Let X i be the amount earned in the i-th toss  Total money you expect to earn

Fa07CSE 182 Expected number of matches Expected number of matches can still be computed.  Let X i =1 if there is a match starting at position i, X i =0 otherwise  Expected number of matches = i

Fa07CSE 182 Expected number of exact Matches is small! Expected number of matches = n*0.25 m –If n=10 7, m=10, Then, expected number of matches = –If n=10 7, m=11 expected number of hits = 2.38 –n=10 7,m=12, Expected number of hits = 0.5 < 1 Bottom Line: An exact match to a substring of the query is unlikely just by chance.

Fa07CSE 182 Observation 2 What is the pigeonhole principle?  Suppose we are looking for a database string with greater than 90% identity to the query (length 100)  Partition the query into size 10 substrings. At least one much match the database string exactly

Fa07CSE 182 Why is this important? Suppose we are looking for sequences that are 80% identical to the query sequence of length 100. Assume that the mismatches are randomly distributed. What is the probability that there is no stretch of 10 bp, where the query and the subject match exactly? Rough calculations show that it is very low. Exact match of a short query substring to a truly similar subject is very high. –The above equation does not take dependencies into account –Reality is better because the matches are not randomly distributed

Fa07CSE 182 Just the Facts Consider the set of all substrings of the query string of fixed length W. –Prob. of exact match to a random database string is very low. –Prob. of exact match to a true homolog is very high. –Keyword Search (exact matches) is MUCH faster than sequence alignment

Fa07CSE 182 BLAST Consider all (m-W) query words of size W (Default = 11) Scan the database for exact match to all such words For all regions that hit, extend using a dynamic programming alignment. Can be many orders of magnitude faster than SW over the entire string Database (n)

Fa07CSE 182 Why is BLAST fast? Assume that keyword searching does not consume any time and that alignment computation the expensive step. Query m=1000, random Db n=10 7, no TP SW = O(nm) = 1000*10 7 = computations BLAST, W=11 E(#11-mer hits)= 1000* (1/4) 11 * 10 7 =2384 Number of computations = 2384*100*100=2.384*10 7 Ratio=10 10 /(2.384*10 7 )=420 Further speed improvements are possible

Fa07CSE 182 Keyword Matching How fast can we match keywords? Hash table/Db index? What is the size of the hash table, for m=11 Suffix trees? What is the size of the suffix trees? Trie based search. We will do this in class. AATCA 567

Fa07CSE 182 Silly Quiz Name a famous Bioinformatics Researcher Name a famous Bioinformatics Researcher who is a woman

Fa07CSE 182 Scoring DNA DNA has structure.

Fa07CSE 182 DNA scoring matrices So far, we considered a simple match/mismatch criterion. The nucleotides can be grouped into Purines (A,G) and Pyrimidines. Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions)

Fa07CSE 182 Scoring proteins Scoring protein sequence alignments is a much more complex task than scoring DNA –Not all substitutions are equal Problem was first worked on by Pauling and collaborators In the 1970s, Margaret Dayhoff created the first similarity matrices. –“One size does not fit all” –Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant –Different proteins might evolve at different rates and we need to normalize for that

Fa07CSE 182 PAM 1 distance Two sequences are 1 PAM apart if they differ in 1 % of the residues. PAM 1 (a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] 1% mismatch

Fa07CSE 182 PAM1 matrix Align many proteins that are very similar –Is this a problem? 1 PAM evolutionary distance represents the time in which 1% of the residues have changed Estimate the frequency P b|a of residue a being substituted by residue b. PAM1(a,b) = P a|b = Pr(b will mutate to an a after 1 PAM evolutionary distance) Scoring matrix –S(a,b) = log 10 (P ab /P a P b ) = log 10 (P b|a /P b )

Fa07CSE 182 PAM 1 Top column shows original, and left column shows replacement residue = PAM1(a,b) = Pr(a|b)

Fa07CSE 182 PAM distance Two sequences are 1 PAM apart when they differ in 1% of the residues. When are 2 sequences 2 PAMs apart? 1 PAM 2 PAM

Fa07CSE 182 Higher PAMs PAM 2 (a,b) = ∑ c PAM 1 (a,c). PAM 1 (c,b) PAM 2 = PAM 1 * PAM 1 (Matrix multiplication) PAM 250 –= PAM 1 *PAM 249 –= PAM 1 250

Fa07CSE 182 Note: This is not the score matrix: What happens as you keep increasing the power?

Fa07CSE 182 END of L4