CSE182-L5: Scoring matrices Dictionary Matching

Slides:



Advertisements
Similar presentations
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Advertisements

Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
1 Introduction to Sequence Analysis Utah State University – Spring 2012 STAT 5570: Statistical Bioinformatics Notes 6.1.
BLAST Sequence alignment, E-value & Extreme value distribution.
Gapped Blast and PSI BLAST Basic Local Alignment Search Tool ~Sean Boyle Basic Local Alignment Search Tool ~Sean Boyle.
BLAST, PSI-BLAST and position- specific scoring matrices Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and.
Measuring the degree of similarity: PAM and blosum Matrix
Combinatorial Pattern Matching CS 466 Saurabh Sinha.
Introduction to Bioinformatics
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
Sequence Alignment Storing, retrieving and comparing DNA sequences in Databases. Comparing two or more sequences for similarities. Searching databases.
Heuristic alignment algorithms and cost matrices
March 2006Vineet Bafna Database Filtering. March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May 9 –
Fa05CSE 182 CSE182-L4: Keyword matching. Fa05CSE 182 Backward scoring Defin S b [i,j] : Best scoring alignment of the suffixes s[i+1..n] and t[j+1..m]
Fa05CSE 182 L3: Blast: Keyword match basics. Fa05CSE 182 Silly Quiz TRUE or FALSE: In New York City at any moment, there are 2 people (not bald) with.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
1 1. BLAST (Basic Local Alignment Search Tool) Heuristic Only parts of protein are frequently subject to mutations. For example, active sites (that one.
Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Introduction to bioinformatics
Sequence similarity.
Similar Sequence Similar Function Charles Yan Spring 2006.
Fa 05CSE182 CSE182-L6 Protein structure basics Protein sequencing.
1-month Practical Course Genome Analysis Lecture 3: Residue exchange matrices Centre for Integrative Bioinformatics VU (IBIVU) Vrije Universiteit Amsterdam.
BLOSUM Information Resources Algorithms in Computational Biology Spring 2006 Created by Itai Sharon.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Fa05CSE 182 CSE182-L5: Scoring matrices Dictionary Matching.
Sequence Alignments Revisited
Alignment IV BLOSUM Matrices. 2 BLOSUM matrices Blocks Substitution Matrix. Scores for each position are obtained frequencies of substitutions in blocks.
Aho-Corasick Algorithm Generalizes KMP to handle sets of strings New ideas –keyword trees –failure functions/links –output links.
Sequence alignment, E-value & Extreme value distribution
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Multiple Sequence Alignment CSC391/691 Bioinformatics Spring 2004 Fetrow/Burg/Miller (Slides by J. Burg)
BLAST Workshop Maya Schushan June 2009.
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Eric C. Rouchka, University of Louisville Sequence Database Searching Eric Rouchka, D.Sc. Bioinformatics Journal Club October.
Pairwise alignment of DNA/protein sequences I519 Introduction to Bioinformatics, Fall 2012.
BLAST Anders Gorm Pedersen & Rasmus Wernersson. Database searching Using pairwise alignments to search databases for similar sequences Database Query.
Comp. Genomics Recitation 3 The statistics of database searching.
Construction of Substitution Matrices
Sequence Alignment Csc 487/687 Computing for bioinformatics.
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
10-07CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding.
Tutorial 4 Substitution matrices and PSI-BLAST 1.
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
BLAST Slides adapted & edited from a set by Cheryl A. Kerfeld (UC Berkeley/JGI) & Kathleen M. Scott (U South Florida) Kerfeld CA, Scott KM (2011) Using.
Overview of Previous Lesson(s) Over View  Symbol tables are data structures that are used by compilers to hold information about source-program constructs.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Sequence Alignment.
Construction of Substitution matrices
Step 3: Tools Database Searching
The statistics of pairwise alignment BMI/CS 576 Colin Dewey Fall 2015.
©CMBI 2005 Database Searching BLAST Database Searching Sequence Alignment Scoring Matrices Significance of an alignment BLAST, algorithm BLAST, parameters.
Dynamic programming with more complex models When gaps do occur, they are often longer than one residue.(biology) We can still use all the dynamic programming.
BLAST: Database Search Heuristic Algorithm Some slides courtesy of Dr. Pevsner and Dr. Dirk Husmeier.
Substitution Matrices and Alignment Statistics BMI/CS 776 Mark Craven February 2002.
Vineet Bafna. How can we compute the local alignment itself?
9/6/07BCB 444/544 F07 ISU Dobbs - Lab 3 - BLAST1 BCB 444/544 Lab 3 BLAST Scoring Matrices & Alignment Statistics Sept6.
Sequence similarity, BLAST alignments & multiple sequence alignments
BLAST Anders Gorm Pedersen & Rasmus Wernersson.
Dicitionary matching Pattern matching
Sequence Based Analysis Tutorial
Basic Local Alignment Search Tool
Sequence alignment, E-value & Extreme value distribution
Presentation transcript:

CSE182-L5: Scoring matrices Dictionary Matching Fa05 CSE 182

Scoring DNA DNA has structure. Fa05 CSE 182

DNA scoring matrices So far, we considered a simple match/mismatch criterion. The nucleotides can be grouped into Purines (A,G) and Pyrimidines. Nucleotide substitutions within a group (transitions) are more likely than those across a group (transversions) Fa05 CSE 182

Scoring proteins Scoring protein sequence alignments is a much more complex task than scoring DNA Not all substitutions are equal Problem was first worked on by Pauling and collaborators In the 1970s, Margaret Dayhoff created the first similarity matrices. “One size does not fit all” Homologous proteins which are evolutionarily close should be scored differently than proteins that are evolutionarily distant Different proteins might evolve at different rates and we need to normalize for that Fa05 CSE 182

PAM 1 distance Two sequences are 1 PAM apart if they differ in 1 % of the residues. 1% mismatch PAM1(a,b) = Pr[residue b substitutes residue a, when the sequences are 1 PAM apart] Fa05 CSE 182

PAM1 matrix Align many proteins that are very similar Is this a problem? PAM1 distance is the probability of a substitution when 1% of the residues have changed Estimate the frequency Pb|a of residue a being substituted by residue b. S(a,b) = log10(Pab/PaPb) = log10(Pb|a/Pb) Fa05 CSE 182

PAM 1 Fa05 CSE 182

PAM distance Two sequences are 1 PAM apart when they differ in 1% of the residues. When are 2 sequences 2 PAMs apart? 1 PAM 2 PAM Fa05 CSE 182

Higher PAMs PAM2(a,b) = ∑c PAM1(a,c). PAM1 (c,b) PAM2 = PAM1 * PAM1 (Matrix multiplication) PAM250 = PAM1*PAM249 = PAM1250 Fa05 CSE 182

PAM250 based scoring matrix S250(a,b) = log10(Pab/PaPb) = log10(PAM250(b|a)/Pb) Fa05 CSE 182

Scoring using PAM matrices Suppose we know that two sequences are 250 PAMs apart. S(a,b) = log10(Pab/PaPb)= log10(Pb|a/Pb) = log10(PAM250(a,b)/Pb) How does it help? S250(A,V) >> S1(A,V) Scoring of hum vs. Dros should be using a higher PAM matrix than scoring hum vs. mus. An alignment with a smaller % identity could still have a higher score and be more significant hum mus dros Fa05 CSE 182

BLOSUM series of Matrices Henikoff & Henikoff: Sequence substitutions in evolutionarily distant proteins do not seem to follow the PAM distributions A more direct method based on hand-curated multiple alignments of distantly related proteins from the BLOCKS database. BLOSUM60 Merge all proteins that have greater than 60%. Then, compute the substitution probability. In practice BLOSUM62 seems to work very well. Fa05 CSE 182

PAM vs. BLOSUM What is the correspondence? PAM1 Blosum1 PAM2 Blosum2 Fa05 CSE 182

The last step in Blast We have discussed Alignments Db filtering using keywords E-values and P-values Scoring matrices The last step: Database filtering requires us to scan a large sequence fast for matching keywords Fa05 CSE 182

Dictionary Matching, R.E. matching, and position specific scoring Fa05 CSE 182

Keyword search Recall: In BLAST, we get a collection of keywords from the query sequence, and identify all db locations with an exact match to the keyword. Question: Given a collection of strings (keywords), find all occrrences in a database string where they keyword might match. Fa05 CSE 182

Dictionary Matching 1:POTATO 2:POTASSIUM 3:TASTE P O T A S T P O T A T O database dictionary Q: Given k words (si has length li), and a database of size n, find all matches to these words in the database string. How fast can this be done? Fa05 CSE 182

Dict. Matching & string matching How fast can you do it, if you only had one word of length m? Trivial algorithm O(nm) time Pre-processing O(m), Search O(n) time. Dictionary matching Trivial algorithm (l1+l2+l3…)n Using a keyword tree, lpn (lp is the length of the longest pattern) Aho-Corasick: O(n) after preprocessing O(l1+l2..) We will consider the most general case Fa05 CSE 182

Direct Algorithm P O P O P O T A S T P O T A T O P O T A T O Observations: When we mismatch, we (should) know something about where the next match will be. When there is a mismatch, we (should) know something about other patterns in the dictionary as well. Fa05 CSE 182

The Trie Automaton P O T A U I S M E 1 r S 2 3 1:POTATO 2:POTASSIUM Construct an automaton A from the dictionary A[v,x] describes the transition from node v to a node w upon reading x. A[u,’T’] = v, and A[u,’S’] = w Special root node r Some nodes are terminal, and labeled with the index of the dictionary word. 1:POTATO 2:POTASSIUM 3:TASTE u v P O T A U I S M E 1 r S 2 w 3 Fa05 CSE 182

An O(lpn) algorithm for keyword matching Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node “success” Else Retract ‘current’ pointer Increment ‘start’ pointer Move to root & repeat Fa05 CSE 182

Illustration: c l P O T A S T P O T A T O v P O T A U I S M E 1 S Fa05 CSE 182

Idea for improving the time Suppose we have partially matched pattern i (indicated by l, and c), but fail subsequently. If some other pattern j is to match Then prefix(pattern j) = suffix [ first c-l characters of pattern(i)) c l P O T A S T P O T A T O P O T A S S I U M Pattern i T A S T E 1:POTATO 2:POTASSIUM 3:TASTE Pattern j Fa05 CSE 182

Improving speed of dictionary matching Every node v corresponds to a string sv that is a prefix of some pattern. Define F[v] to be the node u such that su is the longest suffix of sv If we fail to match at v, we should jump to F[v], and commence matching from there Let lp[v] = |su| 2 3 4 5 1 P O T A U I S M E S 11 6 7 9 10 8 Fa05 CSE 182

An O(n) alg. For keyword matching Start with the first position in the db, and the root node. If successful transition Increment current pointer Move to a new node If terminal node “success” Else (if at root) Increment ‘current’ pointer Mv ‘start’ pointer Move to root Else Move ‘start’ pointer forward Move to failure node Fa05 CSE 182

Illustration P O T A S T P O T A T O l c 1 P O T A T O v T S S I U M A E Fa05 CSE 182

Time analysis l c P O T A S T P O T A T O In each step, either c is incremented, or l is incremented Neither pointer is ever decremented (lp[v] < c-l). l and c do not exceed n Total time <= 2n l c P O T A S T P O T A T O Fa05 CSE 182

Blast: Putting it all together Input: Query of length m, database of size n Select word-size, scoring matrix, gap penalties, E-value cutoff Fa05 CSE 182

Blast Steps Generate an automaton of all query keywords. Scan database using a “Dictionary Matching” algorithm (O(n) time). Identify all hits. Extend each hit using a variant of “local alignment” algorithm. Use the scoring matrix and gap penalties. For each alignment with score S, compute the bit-score, E-value, and the P-value. Sort according to increasing E-value until the cut-off is reached. Output results. Fa05 CSE 182

Protein Sequence Analysis What can you do if BLAST does not return a hit? Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. A: Accept hits at higher P-value. This increases the probability that the sequence similarity is a chance event. How can we get around this paradox? Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish? Fa05 CSE 182

Silly Quiz Fa05 CSE 182

Silly Quiz Fa05 CSE 182

Protein sequence motifs Premise: The sequence of a protein sequence gives clues about its structure and function. Not all residues are equally important in determining function. Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. How can we identify these key residues? Fa05 CSE 182

Prosite The PROSITE database, its status in 1999 Fa05 CSE 182 In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann ,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999 Fa05 CSE 182

Basic idea It is a heuristic approach. Start with the following: A collection of sequences with the same function. Region/residues known to be significant for maintaining structure and function. Develop a pattern of conserved residues around the residues of interest Iterate for appropriate sensitivity and specificity Fa05 CSE 182

EX: Zinc Finger domain Fa05 CSE 182

Proteins containing zf domains How can we find a motif corresponding to a zf domain Fa05 CSE 182

From alignment to regular expressions * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] Search Swissprot with the resulting pattern Refine pattern to eliminate false positives Iterate Fa05 CSE 182

The sequence analysis perspective Zinc Finger motif C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H 2 conserved C, and 2 conserved H How can we search a database using these motifs? The motif is described using a regular expression. What is a regular expression? Fa05 CSE 182

Regular Expressions Concise representation of a set of strings over alphabet . Described by a string over R is a r.e. if and only if Fa05 CSE 182

Regular Expression Q: Let ={A,C,E} Is (A+C)*EEC* a regular expression? *(A+C)? AC*..E? Q: When is a string s in a regular expression? R =(A+C)*EEC* Is CEEC in R? AEC? ACEE? Fa05 CSE 182

Regular Expression & Automata Every R.E can be expressed by an automaton (a directed graph) with the following properties: The automaton has a start and end node Each edge is labeled with a symbol from , or  Suppose R is described by automaton A S  R if and only if there is a path from start to end in A, labeled with s. Fa05 CSE 182

Examples: Regular Expression & Automata (A+C)*EEC* A C E E start end C Fa05 CSE 182

Constructing automata from R.E  R = {} R = {},    R = R1 + R2 R = R1 · R2 R = R1*       Fa05 CSE 182

Regular Expression Matching Given a database D, and a regular expression R, is a substring of D in R? Is there a string D[l..c] that is accepted by the automaton of R? Simpler Q: Is D[1..c] accepted by the automaton of R? Fa05 CSE 182

Alg. For matching R.E. If D[1..c] is accepted by the automaton RA There is a path labeled D[1]…D[c] that goes from START to END in RA  D[1] D[2] D[c] Fa05 CSE 182

Alg. For matching R.E. If D[1..c] is accepted by the automaton RA There is a path labeled D[1]…D[c] that goes from START to END in RA There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END u D[1] .. D[c-1] D[c] Fa05 CSE 182

D.P. to match regular expression  v Define: A[u,] = Automaton node reached from u after reading  Eps(u): set of all nodes reachable from node u using epsilon transitions. N[c] = subset of nodes reachable from START node after reading D[1..c] Q: when is v  N[c]  u Eps(u) Fa05 CSE 182

D.P. to match regular expression Q: when is v  N[c]? A: If for some u  N[c-1], w = A[u,D[c]], v  {w}+ Eps(w) Fa05 CSE 182

Algorithm Fa05 CSE 182

The final step We have answered the question: Is D[1..c] accepted by R? Yes, if END  N[c] We need to answer Is D[l..c] (for some l, and some c) accepted by R Fa05 CSE 182

Profiles versus regular expressions Regular expressions are intolerant to an occasional mis-match. The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. Profiles capture some of these ideas. Fa05 CSE 182

Profiles Start with an alignment of strings of length m, over an alphabet A, Build an |A| X m matrix F=(fki) Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 Fa05 CSE 182

Scoring Profiles Scoring Matrix i k fki s Fa05 CSE 182

Psi-BLAST idea Multiple alignments are important for capturing remote homology. Profile based scores are a natural way to handle this. Q: What if the query is a single sequence. A: Iterate: Find homologs using Blast on query Discard very similar homologs Align, make a profile, search with profile. Fa05 CSE 182

Psi-BLAST speed Two time consuming steps. Multiple alignment of homologs Searching with Profiles. Does the keyword search idea work? Multiple alignment: Use ungapped multiple alignments only Pigeonhole principle again: If profile of length m must score >= T Then, a sub-profile of length l must score >= lT/m Generate all l-mers that score at least lT|/M Search using an automaton Fa05 CSE 182

Databases of Motifs Functionally related proteins have sequence motifs. The sequence motifs can be represented in many ways, and different biological databases capture these representations Collection of sequences (SMART) Multiple alignments (BLOCKS) Profiles (Pfam (HMMs)/Impala)) Regular Expressions (Prosite) Different representations must be queried in different ways Fa05 CSE 182

Databases of protein domains Fa05 CSE 182

Pfam http://pfam.wustl.edu/ Also at Sanger Fa05 CSE 182

PROSITE http://us.expasy.org/prosite/ Fa05 CSE 182

Fa05 CSE 182

BLOCKS Fa05 CSE 182

Fa05 CSE 182