Download presentation
Presentation is loading. Please wait.
Published bySara Joseph Modified over 8 years ago
1
Protein families, domains and motifs in functional prediction May 31, 2016
2
Outline Usefulness of protein domain analysis Types of protein domain databases Interpro integrated protein domain database SMART database Predicting post-translational modifications
3
Protein families Groups of homologous sequences (within and across species) that share similar functions and domains Examples: – Carbonic anhydrases (14 in humans) – Chitin synthases (8 in C. neoformans) – Ser/Thr kinases
4
Protein domains Conserved part of protein sequence that can evolve, function and exist independent of the rest of the protein chain Often independently stable and folded Can recombine or evolve from gene duplications into proteins with different combinations of domains
5
Protein motifs Short linear peptide sequences that serve a specific function for the protein, but will not be stable or fold independent of the rest of chain Protein-protein interaction, ligand interactions, cleavage sites, targeting Examples: – 14-3-3: Interaction with kinases – KELCH: ubiquitin targeting – SUMO: site recognized for modification by SUMO
6
Predicting function for unknown proteins Do they belong (by sequence homology) to a protein family? Do they contain known protein domains? Do they have motifs that suggest a specific function?
7
When annotation is NOT enough You’ve got a list of genes, most of which have been annotated with gene ontology and a potential protein function Why would you want to go on and look more specifically at the protein domains?
8
Limitations of annotation Even in a model organism with large amount of resources, most genes are still annotated by similarity Often, the name given is based on the BEST match to a particular domain or known protein But…
9
Limitations of BLAST Likelihood of finding a homolog to a sequence: – >80% bacteria – >70% yeast – ~60% animal Rest are truly novel sequences ~900/6500 proteins in yeast without a known function NAME: Similar to yeast protein YAL7400 not very informative
10
Limitations of similarity Proteins with more than one domain cause problems. – Numerous matches to one domain can mask matches to other domains. Increased size of protein databases – Number related sequences rises and less related sequence hits may be lost Low-complexity regions can mask domain matches
11
Proteins are modular Individual domains can and often do fold independently of other domains within the same protein Domains can function as an independent unit (or truncation experiments would never work) Thus identity of ALL protein domains within a sequence can provide further clues about their function
12
Proteins can have >1 domain The name: protein kinase receptor UFO doesn’t necessarily tell you that this protein also contains IgG and fibronectin domains or that it has a transmembrane domain
13
Domains are not always functional If a critical residue is missing in an active site, it’s not likely to be functional A similarity score won’t pick that up
14
Protein signature databases Identify domains or classify proteins into families to allow inference of function Approaches include: – regular expressions and profiles – position-specific scoring matrix-based fingerprints – automated sequence clustering – Hidden Markov Models (HMMs)
15
PROSITE Regular expression patterns describing functional motifs M-x-G-x(3)-[IV]2-x(2)-{FWY} – Enzyme catalytic sites – Prosthetic group attachment sites – Ligand or metal binding sites Either matches or not Some families/domains defined by co-occurrence
16
Citrate synthase G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R
17
PRINTS Similar to PROSITE patterns Multiple-motif approach using either identity or weight-matrix as basis Groups of conserved motif provide diagnostic protein family signatures Can be created at super-family, family and sub-family level http://www.bioinf.manchester.ac.uk/dbbrowser/PRINTS/index.php
18
Profile-HMMs Models generated from alignments of many homologues then counting frequency of occurrence for each amino acid in each column of the alignment (profile). Profile-HMMs used to create probabilities of occurrence against background evolutionary model that accounts for possible substitutions. Provides convenient and powerful way of identifying homology between sequences. Find domains in sequences that would never be found by BLAST alone
19
HMM domain databases PFAM – Classify novel sequences into protein domain profiles – Most comprehensive; >16,000 protein families (v29) SMART – Signaling, extracellular and chromatin proteins – Identification of catalytic site conservation for enzymes TIGRFAMs – Families of proteins from prokaryotes PANTHER – Classification b ased on function using literature evidence
20
PFAM Manually curated profiles a statistical measure of the likelihood that an alignment occurred by chance alone Does not indicate functionality
21
PFAM Summary
22
PFAM Domain Organization
23
SMART database SMART: Simple Modular Architecture Research Tool – Focus on signaling, extracellular and chromatin-associated proteins – Curated models for >1200 domains Use? – I have several kinase domains in my protein list and want to know which ones are functional. – What other domains are found in signaling proteins?
24
SMART: Search interface Uniprot or Ensemble Protein Accession number Add other searches
25
SMART Output
26
InterPro Scan Combines search methods from several protein databases Uses tools provided by member databases – Uses threshold scores for profiles & motifs Interpro convenient means of deriving a consensus among signature methods Interpro records integrated with Uniprot. If have a Uniprot accession number, access the Interpro information from Uniprot
27
MAPk14 Interpro record
28
MAPK14 – Uniprot record
29
Function from sequence Membrane bound or secreted? GPI anchored? Cellular localization? Post-translational modification sites?
30
CBS prediction services Protein sorting – SignalP, TargetP, others Post-translational modification – Acetylation, phosphorylation, glycosylation Immunological features – Epitopes, MHC allele binding, ect Protein function & structure – Transmembrane domains, co-evolving positions
31
Transmembrane domain prediction
32
Phosphorylation prediction
33
O-glycosylation
34
EMBOSS Open source software for molecular biology Predict antigenic sites – Useful if want to design a peptide antibody Look for specific motifs, even degenerate – Known phosphorylation motifs – Find motifs in multiple sequences with one submission Get stats on proteins/nucleic acid sequences Sequence manipulation of all kinds
35
Today in lab Tutorial on protein information sites From a sublist generated using DAVID, generate a list of protein IDs and obtain the sequences Obtain protein accession numbers for the cluster Submit to SMART database to characterize/analyze the domains Pick 2 proteins to do additional predictions
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.