Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.

Slides:



Advertisements
Similar presentations
Using Ontology Reasoning to Classify Protein Phosphatases K.Wolstencroft, P.Lord, L.tabernero, A.brass, R.stevens University of Manchester.
Advertisements

Pfam(Protein families )
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
Mutiple Motifs Charles Yan Spring Mutiple Motifs.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
©CMBI 2005 Exploring Protein Sequences - Part 2 Part 1: Patterns and Motifs Profiles Hydropathy Plots Transmembrane helices Antigenic Prediction Signal.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
© Wiley Publishing All Rights Reserved. Analyzing Protein Sequences.
InterPro/prosite UCSC Genome Browser Exercise 3. Turning information into knowledge  The outcome of a sequencing project is masses of raw data  The.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Protein RNA DNA Predicting Protein Function. Biochemical function (molecular function) What does it do? Kinase??? Ligase??? Page 245.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Matching Problems in Bioinformatics Charles Yan Fall 2008.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Tutorial 5 Motif discovery.
What’s next ?? Today 3.3 Protein function 10.3 Protein secondary structure prediction 17.3 Protein tertiary structure prediction 24.3Gene expression &
Protein Modules An Introduction to Bioinformatics.
Pattern databases in protein analysis Arthur Gruber Instituto de Ciências Biomédicas Universidade de São Paulo AG-ICB-USP.
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Today’s menu: -SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
Protein Structure and Function Prediction. Predicting 3D Structure –Comparative modeling (homology) –Fold recognition (threading) Outstanding difficult.
Protein and Function Databases
Today’s menu: -UniProt - SwissProt/TrEMBL -PROSITE -Pfam -Gene Onltology Protein and Function Databases Tutorial 7.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Automatic methods for functional annotation of sequences Petri Törönen.
Wellcome Trust Workshop Working with Pathogen Genomes Module 3 Sequence and Protein Analysis (Using web-based tools)
PROTEIN SEQUENCE ANALYSIS. Need good protein sequence analysis tools because: As number of sequences increases, so gap between seq data and experimental.
Identification of Protein Domains Eden Dror Menachem Schechter Computational Biology Seminar 2004.
Protein Bioinformatics Course
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Good solutions are advantageous Christophe Roos - MediCel ltd Similarity is a tool in understanding the information in a sequence.
Scoring Matrices Scoring matrices, PSSMs, and HMMs BIO520 BioinformaticsJim Lund Reading: Ch 6.1.
Database 5: protein domain/family. Protein domain/family: some definitions Most proteins have « modular » structures Estimation: ~ 3 domains / protein.
Biology 224 Instructor: Tom Peavy Feb 21 & 26, Protein Structure & Analysis.
Identification of Protein Domains. Orthologs and Paralogs Describing evolutionary relationships among genes (proteins): Two major ways of creating homologous.
Sequence analysis: Macromolecular motif recognition Sylvia Nagl.
Day 2: Protein Sequence Analysis 1.Physico-chemical properties. 2.Cellular localization. 3.Signal peptides. 4.Transmembrane domains. 5.Post-translational.
Multiple Alignment and Phylogenetic Trees Csc 487/687 Computing for Bioinformatics.
Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’
Module 3 Sequence and Protein Analysis (Using web-based tools) Working with Pathogen Genomes - Uruguay 2008.
BLOCKS Multiply aligned ungapped segments corresponding to most highly conserved regions of proteins- represented in profile.
Bioinformatics Ayesha M. Khan 9 th April, What’s in a secondary database?  It should be noted that within multiple alignments can be found conserved.
Protein and RNA Families
Proteins to Proteomes The InterPro Database
Motif discovery and Protein Databases Tutorial 5.
Copyright OpenHelix. No use or reproduction without express written consent1.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
PROTEIN PATTERN DATABASES. PROTEIN SEQUENCES SUPERFAMILY FAMILY DOMAIN MOTIF SITE RESIDUE.
Sequence Based Analysis Tutorial March 26, 2004 NIH Proteomics Workshop Lai-Su L. Yeh, Ph.D. Protein Science Team Lead Protein Information Resource at.
(H)MMs in gene prediction and similarity searches.
Protein domain/family db Secondary databases are the fruit of analyses of the sequences found in the primary sequence db Either manually curated (i.e.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
V diagonal lines give equivalent residues ILS TRIVHVNSILPSTN V I L S T R I V I L P E F S T Sequence A Sequence B Dot Plots, Path Matrices, Score Matrices.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Protein families, domains and motifs in functional prediction May 31, 2016.
Sequence: PFAM Used example: Database of protein domain families. It is based on manually curated alignments.
Protein families, domains and motifs in functional prediction
Bio/Chem-informatics
Protein Families, Motifs & Domains.
Functional manual annotation including GO
Demo: Protein Information Resource
Sequence based searches:
Pfam: multiple sequence alignments and HMM-profiles of protein domains
Genome Annotation Continued
Sequence Based Analysis Tutorial
Sequence Based Analysis Tutorial
Presentation transcript:

Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015

Outline Usefulness of protein domain analysis Types of protein domain databases Interpro scan of multiple domain DB Using the SMART database Predicting post-translational modifications

When annotation is NOT enough You’ve got a list of genes, most of which have been annotated with gene ontology and a potential protein function Why would you want to go on and look more specifically at the protein domains?

Limitations of annotation Even in a model organism with large amount of resources, most genes are still annotated by similarity Often, the name given is based on the BEST match to a particular domain or known protein But…

Limitations of BLAST Likelihood of finding a homolog to a sequence: >80% bacteria >70% yeast ~60% animal Rest are truly novel sequences ~900/6500 proteins in yeast without a known function NAME: Similar to yeast protein YAL7400 not very informative

Limitations of similarity Proteins with more than one domain cause problems. Numerous matches to one domain can mask matches to other domains. Increased size of protein databases Number related sequences rises and less related sequence hits may be lost Low-complexity regions can mask domain matches

Proteins are modular Individual domains can and often do fold independently of other domains within the same protein Domains can function as an independent unit (or truncation experiments would never work) Thus identity of ALL protein domains within a sequence can provide further clues about their function

Proteins can have >1 domain The name: protein kinase receptor UFO doesn’t necessarily tell you that this protein also contains IgG and fibronectin domains or that it has a transmembrane domain

Domains are not always functional If a critical residue is missing in an active site, it’s not likely to be functional A similarity score won’t pick that up

Multiple protein domain databases

Protein signature databases Identify domains or classify proteins into families to allow inference of function Approaches include: regular expressions and profiles position-specific scoring matrix-based fingerprints automated sequence clustering Hidden Markov Models (HMMs)

PROSITE Regular expression patterns describing functional motifs M-x-G-x(3)-[IV]2-x(2)-{FWY} Enzyme catalytic sites Prosthetic group attachment sites Ligand or metal binding sites Either matches or not Some families/domains defined by co-occurrence x any amno acid [] any of the amino acids within square braces {} any amino acid except those within the curly braces Numbers at the end of a given pattern indicates the number of times that pattern is repeated

G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R Citrate synthase G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R

PRINTS Similar to PROSITE patterns Multiple-motif approach using either identity or weight-matrix as basis Groups of conserved motif provide diagnostic protein family signatures Can be created at super-family, family and sub-family level http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php

Profile-HMMs Models generated from alignments of many homologues then counting frequency of occurrence for each amino acid in each column of the alignment (profile). Profile-HMMs used to create probabilities of occurrence against background evolutionary model that accounts for possible substitutions. Provides convenient and powerful way of identifying homology between sequences. Find domains in sequences that would never be found by BLAST alone

HMM domain databases Pfam SMART TIGRFAMs PANTHER Classify novel sequences into protein domain profiles Most comprehensive; >13,000 protein families (v26) SMART Signaling, extracellular and chromatin proteins Identification of catalytic site conservation for enzymes TIGRFAMs Families of proteins from prokaryotes PANTHER Classification based on function using literature evidence

PFAM >16,230 manually curated profiles Can use the profile to search a genome for matches

Can submit a protein to PFAM Limited to single protein submission Output gives you an e-value that estimates the likelihood that the domain is there Up to you to determine if domain is functional http://pfam.xfam.org

Keyword search

PFAM Summary

PFAM Domain Organization

PFAM Interactions

SMART database SMART: Simple Modular Architecture Research Tool Use? Focus on signaling, extracellular and chromatin-associated proteins Curated models for >1200 domains Use? I have several kinase domains in my protein list and want to know which ones are functional. What other domains are found in signaling proteins?

Uniprot or Ensemble Protein Accession number Search for matches Uniprot or Ensemble Protein Accession number Protein sequence Add other searches

Mouse over for information SMART Output Mouse over for information Prediction of FUNCTIONAL catalytic activity

Can browse the domains

InterPro Scan Combines search methods from several protein databases Uses tools provided by member databases Uses threshold scores for profiles & motifs Interpro convenient means of deriving a consensus among signature methods

Define which domain databases to search

Example InterProScan search Submitting an olfactory receptor gene (member of the GPCR class of proteins) to InterPro

InterPro family 2nd InterPro family

Submitting a different human GPCR protein to Interpro

Same InterPro family New InterPro family

InterProScan Families

InterProScan annotation

SMART & PFAM search SMART DB results: PFAM DB results:

Are 2 proteins homologs? S. cerevisiae Ste3 is a GPCR pheromone receptor Similarity to C. gatti protein: 25% identical, 45% similar, E-value 10-25

Very similar domain content and arrangement

Advantage of InterProScan Interpro integrates the different databases to create a protein family signature. Pfam/SMART/PANTHER/Gene3D & TIGR-FAM will find domain families PROSITE can find very specific signature patterns PRINTS can distinguish related members of same protein family Cannot change the statistical cut-off for what is considered a significant match

Function from sequence Membrane bound or secreted? GPI anchored? Cellular localization? Post-translational modification sites?

CBS prediction services Protein sorting SignalP, TargetP, others Post-translational modification Acetylation, phosphorylation, glycosylation Immunological features Epitopes, MHC allele binding, ect Protein function & structure Transmembrane domains, co-evolving positions

Transmembrane domain prediction

Phosphorylation prediction

O-glycosylation

EMBOSS Open source software for molecular biology Predict antigenic sites Useful if want to design a peptide antibody Look for specific motifs, even degenerate Known phosphorylation motifs Find motifs in multiple sequences with one submission Get stats on proteins/nucleic acid sequences Sequence manipulation of all kinds

Today in lab Tutorial on protein information sites From a sublist generated using DAVID, generate a list of protein IDs and obtain the sequences Obtain protein accession numbers for the cluster Submit to SMART database to characterize/analyze the domains Pick 2 proteins to do additional predictions