Presentation is loading. Please wait.

Presentation is loading. Please wait.

Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015.

Similar presentations


Presentation on theme: "Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015."— Presentation transcript:

1 Predicting Function (& location & post-tln modifications) from Protein Sequences
June 15, 2015

2 Outline Usefulness of protein domain analysis
Types of protein domain databases Interpro scan of multiple domain DB Using the SMART database Predicting post-translational modifications

3 When annotation is NOT enough
You’ve got a list of genes, most of which have been annotated with gene ontology and a potential protein function Why would you want to go on and look more specifically at the protein domains?

4 Limitations of annotation
Even in a model organism with large amount of resources, most genes are still annotated by similarity Often, the name given is based on the BEST match to a particular domain or known protein But…

5 Limitations of BLAST Likelihood of finding a homolog to a sequence:
>80% bacteria >70% yeast ~60% animal Rest are truly novel sequences ~900/6500 proteins in yeast without a known function NAME: Similar to yeast protein YAL7400 not very informative

6 Limitations of similarity
Proteins with more than one domain cause problems. Numerous matches to one domain can mask matches to other domains. Increased size of protein databases Number related sequences rises and less related sequence hits may be lost Low-complexity regions can mask domain matches

7 Proteins are modular Individual domains can and often do fold independently of other domains within the same protein Domains can function as an independent unit (or truncation experiments would never work) Thus identity of ALL protein domains within a sequence can provide further clues about their function

8 Proteins can have >1 domain
The name: protein kinase receptor UFO doesn’t necessarily tell you that this protein also contains IgG and fibronectin domains or that it has a transmembrane domain

9 Domains are not always functional
If a critical residue is missing in an active site, it’s not likely to be functional A similarity score won’t pick that up

10 Multiple protein domain databases

11 Protein signature databases
Identify domains or classify proteins into families to allow inference of function Approaches include: regular expressions and profiles position-specific scoring matrix-based fingerprints automated sequence clustering Hidden Markov Models (HMMs)

12 PROSITE Regular expression patterns describing functional motifs
M-x-G-x(3)-[IV]2-x(2)-{FWY} Enzyme catalytic sites Prosthetic group attachment sites Ligand or metal binding sites Either matches or not Some families/domains defined by co-occurrence x any amno acid [] any of the amino acids within square braces {} any amino acid except those within the curly braces Numbers at the end of a given pattern indicates the number of times that pattern is repeated

13 G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R
Citrate synthase G-[FYAV]-[GA]-H-x-[IV]-x(1,2)-[RKTQ]-x(2)-[DV]-[PS]-R

14 PRINTS Similar to PROSITE patterns
Multiple-motif approach using either identity or weight-matrix as basis Groups of conserved motif provide diagnostic protein family signatures Can be created at super-family, family and sub-family level

15 Profile-HMMs Models generated from alignments of many homologues then counting frequency of occurrence for each amino acid in each column of the alignment (profile). Profile-HMMs used to create probabilities of occurrence against background evolutionary model that accounts for possible substitutions. Provides convenient and powerful way of identifying homology between sequences. Find domains in sequences that would never be found by BLAST alone

16 HMM domain databases Pfam SMART TIGRFAMs PANTHER
Classify novel sequences into protein domain profiles Most comprehensive; >13,000 protein families (v26) SMART Signaling, extracellular and chromatin proteins Identification of catalytic site conservation for enzymes TIGRFAMs Families of proteins from prokaryotes PANTHER Classification based on function using literature evidence

17 PFAM >16,230 manually curated profiles
Can use the profile to search a genome for matches

18 Can submit a protein to PFAM
Limited to single protein submission Output gives you an e-value that estimates the likelihood that the domain is there Up to you to determine if domain is functional

19 Keyword search

20 PFAM Summary

21 PFAM Domain Organization

22 PFAM Interactions

23 SMART database SMART: Simple Modular Architecture Research Tool Use?
Focus on signaling, extracellular and chromatin-associated proteins Curated models for >1200 domains Use? I have several kinase domains in my protein list and want to know which ones are functional. What other domains are found in signaling proteins?

24 Uniprot or Ensemble Protein Accession number
Search for matches Uniprot or Ensemble Protein Accession number Protein sequence Add other searches

25 Mouse over for information
SMART Output Mouse over for information Prediction of FUNCTIONAL catalytic activity

26 Can browse the domains

27

28 InterPro Scan Combines search methods from several protein databases
Uses tools provided by member databases Uses threshold scores for profiles & motifs Interpro convenient means of deriving a consensus among signature methods

29 Define which domain databases to search

30 Example InterProScan search
Submitting an olfactory receptor gene (member of the GPCR class of proteins) to InterPro

31 InterPro family 2nd InterPro family

32 Submitting a different human GPCR protein to Interpro

33 Same InterPro family New InterPro family

34 InterProScan Families

35 InterProScan annotation

36 SMART & PFAM search SMART DB results: PFAM DB results:

37 Are 2 proteins homologs? S. cerevisiae Ste3 is a GPCR pheromone receptor Similarity to C. gatti protein: 25% identical, 45% similar, E-value 10-25

38 Very similar domain content and arrangement

39 Advantage of InterProScan
Interpro integrates the different databases to create a protein family signature. Pfam/SMART/PANTHER/Gene3D & TIGR-FAM will find domain families PROSITE can find very specific signature patterns PRINTS can distinguish related members of same protein family Cannot change the statistical cut-off for what is considered a significant match

40 Function from sequence
Membrane bound or secreted? GPI anchored? Cellular localization? Post-translational modification sites?

41 CBS prediction services
Protein sorting SignalP, TargetP, others Post-translational modification Acetylation, phosphorylation, glycosylation Immunological features Epitopes, MHC allele binding, ect Protein function & structure Transmembrane domains, co-evolving positions

42 Transmembrane domain prediction

43 Phosphorylation prediction

44 O-glycosylation

45 EMBOSS Open source software for molecular biology
Predict antigenic sites Useful if want to design a peptide antibody Look for specific motifs, even degenerate Known phosphorylation motifs Find motifs in multiple sequences with one submission Get stats on proteins/nucleic acid sequences Sequence manipulation of all kinds

46 Today in lab Tutorial on protein information sites
From a sublist generated using DAVID, generate a list of protein IDs and obtain the sequences Obtain protein accession numbers for the cluster Submit to SMART database to characterize/analyze the domains Pick 2 proteins to do additional predictions


Download ppt "Predicting Function (& location & post-tln modifications) from Protein Sequences June 15, 2015."

Similar presentations


Ads by Google