Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.

Slides:



Advertisements
Similar presentations
Fa07CSE 182 CSE182-L4: Database filtering. Fa07CSE 182 Summary (through lecture 3) A2 is online We considered the basics of sequence alignment –Opt score.
Advertisements

Blast to Psi-Blast Blast makes use of Scoring Matrix derived from large number of proteins. What if you want to find homologs based upon a specific gene.
1 Profile Hidden Markov Models For Protein Structure Prediction Colin Cherry
Profiles for Sequences
March 2006Vineet Bafna Designing Spaced Seeds March 2006Vineet Bafna Project/Exam deadlines May 2 – Send to me with a title of your project May.
An Introduction to Bioinformatics Protein Structure Prediction.
HIDDEN MARKOV MODELS IN MULTIPLE ALIGNMENT. 2 HMM Architecture Markov Chains What is a Hidden Markov Model(HMM)? Components of HMM Problems of HMMs.
Fa05CSE 182 CSE182-L5: Position specific scoring matrices Regular Expression Matching Protein Domains.
Fa 06CSE182 CSE182-L6 Protein sequence analysis Fa 06CSE182 Possible domain queries Case 1: –You have a collection of sequences that belong to a family.
Tutorial 5 Motif discovery.
Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching.
Protein Modules An Introduction to Bioinformatics.
Sequence similarity.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Similar Sequence Similar Function Charles Yan Spring 2006.
Fa 05CSE182 CSE182-L6 Protein structure basics Protein sequencing.
Bioinformatics Unit 1: Data Bases and Alignments Lecture 3: “Homology” Searches and Sequence Alignments (cont.) The Mechanics of Alignments.
Single Motif Charles Yan Spring Single Motif.
CSE182-L5: Scoring matrices Dictionary Matching
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Making Sense of DNA and protein sequence analysis tools (course #2) Dave Baumler Genome Center of Wisconsin,
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Alignment Statistics and Substitution Matrices BMI/CS 576 Colin Dewey Fall 2010.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
C OMPUTATIONAL BIOLOGY. O UTLINE Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity of the Algorithms.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
CRB Journal Club February 13, 2006 Jenny Gu. Selected for a Reason Residues selected by evolution for a reason, but conservation is not distinguished.
Proteins Secondary Structure Predictions Structural Bioinformatics.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
BINF6201/8201 Hidden Markov Models for Sequence Analysis
Sequence Alignment Techniques. In this presentation…… Part 1 – Searching for Sequence Similarity Part 2 – Multiple Sequence Alignment.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Protein Secondary Structure Prediction Based on Position-specific Scoring Matrices Yan Liu Sep 29, 2003.
Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha.
Chapter 6 Profiles and Hidden Markov Models. The following approaches can also be used to identify distantly related members to a family of protein (or.
Pairwise Sequence Alignment. The most important class of bioinformatics tools – pairwise alignment of DNA and protein seqs. alignment 1alignment 2 Seq.
Construction of Substitution Matrices
Protein Structure & Modeling Biology 224 Instructor: Tom Peavy Nov 18 & 23, 2009
HMMs for alignments & Sequence pattern discovery I519 Introduction to Bioinformatics.
CSE182-L9 Modeling Protein domains using HMMs. Profiles Revisited Note that profiles are a powerful way of capturing domain information Pr(sequence x|
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
10-07CSE182 CSE182-L7 Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding.
Protein and RNA Families
BLAST: Basic Local Alignment Search Tool Altschul et al. J. Mol Bio CS 466 Saurabh Sinha.
Motif discovery and Protein Databases Tutorial 5.
Finding Patterns Gopalan Vivek Lee Teck Kwong Bernett.
Introduction to Protein Structure Prediction BMI/CS 576 Colin Dewey Fall 2008.
Pairwise Local Alignment and Database Search Csc 487/687 Computing for Bioinformatics.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
Exercises Pairwise alignment Homology search (BLAST) Multiple alignment (CLUSTAL W) Iterative Profile Search: Profile Search –Pfam –Prosite –PSI-BLAST.
Construction of Substitution matrices
Step 3: Tools Database Searching
Protein backbone Biochemical view:
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
Protein Structure Prediction. Protein Sequence Analysis Molecular properties (pH, mol. wt. isoelectric point, hydrophobicity) Secondary Structure Super-secondary.
Prepared By: Syed Khaleelulla Hussaini. Outline Proteins DNA RNA Genetics and evolution The Sequence Matching Problem RNA Sequence Matching Complexity.
Protein families, domains and motifs in functional prediction May 31, 2016.
Protein families, domains and motifs in functional prediction
Protein Families, Motifs & Domains.
Dicitionary matching Pattern matching
Predicting Active Site Residue Annotations in the Pfam Database
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Protein structure prediction.
Basic Local Alignment Search Tool
Neural Networks for Protein Structure Prediction Dr. B Bhunia.
Presentation transcript:

Fa05CSE 182 CSE182-L4: Scoring matrices, Dictionary Matching

Fa05CSE 182 Class Mailing List To subscribe, send to You can subscribe from the course web page Use the list for all course related queries, discussions,…

Fa05CSE 182 Protein Sequence Analysis What can you do if BLAST does not return a hit? –Sometimes, homology (evolutionary similarity) exists at very low levels of sequence similarity. A: Accept hits at higher P-value. –This increases the probability that the sequence similarity is a chance event. –How can we get around this paradox? –Reformulated Q: suppose two sequences B,C have the same level of sequence similarity to sequence A. If A& B are related in function, can we assume that A& C are? If not, how can we distinguish?

Fa05CSE 182 Protein sequence motifs Premise: The sequence of a protein sequence gives clues about its structure and function. Not all residues are equally important in determining function. How can we identify these key residues?

Fa05CSE 182 Prosite In some cases the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment. However, relationships can be revealed by the occurrence in its sequence of a particular cluster of residue types, which is variously known as a pattern, motif, signature or fingerprint. These motifs arise because specific region(s) of a protein which may be important, for example, for their binding properties or for their enzymatic activity are conserved in both structure and sequence. These structural requirements impose very tight constraints on the evolution of this small but important portion(s) of a protein sequence. The use of protein sequence patterns or profiles to determine the function of proteins is becoming very rapidly one of the essential tools of sequence analysis. Many authors ( 3,4) have recognized this reality. Based on these observations, we decided in 1988, to actively pursue the development of a database of regular expression-like patterns, which would be used to search against sequences of unknown function. Kay Hofmann,Philipp Bucher, Laurent Falquet and Amos Bairoch The PROSITE database, its status in 1999

Fa05CSE 182 Basic idea It is a heuristic approach. Start with the following: –A collection of sequences with the same function. –Region/residues known to be significant for maintaining structure and function. Develop a pattern of conserved residues around the residues of interest Iterate for appropriate sensitivity and specificity

Fa05CSE 182 Zinc Finger domain

Fa05CSE 182 Proteins containing zf domains How can we find a motif corresponding to a zf domain

Fa05CSE 182 From alignment to regular expressions * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] Search Swissprot with the resulting pattern Refine pattern to eliminate false positives Iterate

Fa05CSE 182 The sequence analysis perspective Zinc Finger motif –C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H –2 conserved C, and 2 conserved H How can we search a database using these motifs? –The motif is described using a regular expression. What is a regular expression? –How can we search for a match to a regular expression? Not allowed to use Perl :-) The ‘regular expression’ motif is weak. How can we make it stronger

Fa05CSE 182 Profiles Start with an alignment of strings of length m, over an alphabet A, Build an |A| X m matrix F=(f ki ) Each entry f ki represents the frequency of symbol k in position i

Fa05CSE 182 Scoring Profiles k i s f ki Scoring Matrix

Fa05CSE 182 Psi-BLAST idea Multiple alignments are important for capturing remote homology. Profile based scores are a natural way to handle this. Q: What if the query is a single sequence. A: Iterate: –Find homologs using Blast on query –Discard very similar homologs –Align, make a profile, search with profile.

Fa05CSE 182 Psi-BLAST speed Two time consuming steps. 1.Multiple alignment of homologs 2.Searching with Profiles. 1.Does the keyword search idea work? Pigeonhole principle again: –If profile of length m must score >= T –Then, a sub-profile of length l must score >= lT/m –Generate all l-mers that score at least lT/M –Search using an automaton Multiple alignment: –Use ungapped multiple alignments only

Fa05CSE 182

Fa05CSE 182 CSE182-L6 Regular Expression Matching Protein structure basics

Fa05CSE 182 Zinc Finger domain

Fa05CSE 182 The sequence analysis perspective Zinc Finger motif –C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H –2 conserved C, and 2 conserved H How can we search a database using these motifs? –The motif is described using a regular expression. What is a regular expression?

Fa05CSE 182 Regular Expressions Concise representation of a set of strings over alphabet . Described by a string over R is a r.e. if and only if

Fa05CSE 182 Regular Expression Q: Let  ={A,C,E} –Is (A+C)*EEC* a regular expression? –*(A+C)? –AC*..E? Q: When is a string s in a regular expression? –R =(A+C)*EEC* –Is CEEC in R? –AEC? –ACEE?

Fa05CSE 182 Regular Expression & Automata Every R.E can be expressed by an automaton (a directed graph) with the following properties: –The automaton has a start and end node –Each edge is labeled with a symbol from , or   Suppose R is described by automaton A  S  R if and only if there is a path from start to end in A, labeled with s.

Fa05CSE 182 Examples: Regular Expression & Automata (A+C)*EEC* CA C startend EE

Fa05CSE 182 Constructing automata from R.E R = {  } R = {  },    R = R 1 + R 2 R = R 1 · R 2 R = R 1 *      

Fa05CSE 182 Regular Expression Matching Given a database D, and a regular expression R, is a substring of D in R? Is there a string D[l..c] that is accepted by the automaton of R? Simpler Q: Is D[1..c] accepted by the automaton of R?

Fa05CSE 182 Alg. For matching R.E. If D[1..c] is accepted by the automaton R A –There is a path labeled D[1]…D[c] that goes from START to END in R A D[1] D[2]  D[c]

Fa05CSE 182 Alg. For matching R.E. If D[1..c] is accepted by the automaton R A –There is a path labeled D[1]…D[c] that goes from START to END in R A –There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END D[1].. D[c-1] D[c] u

Fa05CSE 182 D.P. to match regular expression Define: –A[u,  ] = Automaton node reached from u after reading  –Eps(u): set of all nodes reachable from node u using epsilon transitions. –N[c] = subset of nodes reachable from START node after reading D[1..c] –Q: when is v  N[c]  u v  u Eps(u)

Fa05CSE 182 Q: when is v  N[c]? A: If for some u  N[c-1], w = A[u,D[c]], v  {w}+ Eps(w) D.P. to match regular expression

Fa05CSE 182 Algorithm

Fa05CSE 182 The final step We have answered the question: –Is D[1..c] accepted by R? –Yes, if END  N[c] We need to answer –Is D[l..c] (for some l, and some c) accepted by R

Fa05CSE 182 A structural view of proteins

Fa05CSE 182 CS view of a protein >sp|P00974|BPT1_BOVIN Pancreatic trypsin inhibitor precursor (Basic protease inhibitor) (BPI) (BPTI) (Aprotinin) - Bos taurus (Bovine). MKMSRLCLSVALLVLLGTLAASTPGCDTSNQAKAQ RPDFCLEPPYTGPCKARIIRYFYNAKAGLCQTFVYGG CRAKRNNFKSAEDCMRTCGGAIGPWENL

Fa05CSE 182 Protein structure basics

Fa05CSE 182 Side chains determine amino-acid type The residues may have different properties. Aspartic acid (D), and Glutamic Acid (E) are acidic residues

Fa05CSE 182 Bond angles form structural constraints

Fa05CSE 182 Various constraints determine 3d structure Constraints –Structural constraints due to physiochemical properties –Constraints due to bond angles –H-bond formation Surprisingly, a few conformations are seen over and over again.

Fa05CSE 182 Alpha-helix 3.6 residues per turn H-bonds between 1st and 4th residue stabilize the structure. First discovered by Linus Pauling

Fa05CSE 182 Beta-sheet Each strand by itself has 2 residues per turn, and is not stable. Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions.

Fa05CSE 182 Domains The basic structures (helix, strand, loop) combine to form complex 3D structures. Certain combinations are popular. Many sequences, but only a few folds

Fa05CSE 182 3D structure Predicting tertiary structure is an important problem in Bioinformatics. Premise: Clues to structure can be found in the sequence. While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals.

Fa05CSE 182 Protein Domains An important realization (in the last decade) is that proteins have a modular architecture of domains/folds. Example: The zinc finger domain is a DNA-binding domain. What is a domain? –Part of a sequence that can fold independently, and is present in other sequences as well

Fa05CSE 182 Proteins containing zf domains How can we find a motif corresponding to a zf domain

Fa05CSE 182 Domain review What is a domain? How are domains expressed –Motifs (Regular expression & others) –Multiple alignments –Profiles –Profile HMMs

Fa05CSE 182 Databases of protein domains

Fa05CSE Also at Sanger

Fa05CSE 182 PROSITE

Fa05CSE 182

Fa05CSE 182

Fa05CSE 182

Fa05CSE 182 HMMER programs Hmmalign –Align a sequence to an HMM Hmmbuild –Build a model from a multiple alignment Hmmemit –Emits a probabilistic sequence from an HMM Hmmpfam –Search PFAM with a sequence query Hmmsearch –Search a sequence database with an HMM query

Fa05CSE 182

Fa05CSE 182 Post-translational modification Residues undergo modification, usually by addition of a chemical group. Key mechanism for signal transduction, and many other cellular functions Some modifications might require single residues (Ex: phosphorylation). Others might require a pattern

Fa05CSE 182

Fa05CSE 182 Protein targeting

Fa05CSE 182 Protein targeting In 1970, Gunter Blobel showed that proteins have an N-terminal signal sequence which directs proteins to the membrane. Proteins have to be transported to other organelles: nucleus, mitochondria,… Can we computationally identify the ‘signal’ which distinguishes the cellular compartment?

Fa05CSE 182 For transmembrane proteins, can we predict the transmembrane, outer, and inner regions?

Fa05CSE 182

Fa05CSE 182 Multiple alignment tools

Fa05CSE 182 Tools for secondary structure prediction Each residue must be given a state: Helix, Loop, Strand HMMs/Neural networks are used to predict

Fa05CSE 182 Next topic: Gene finding