Download presentation
Presentation is loading. Please wait.
Published byKory Jennings Modified over 9 years ago
1
Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of Biochemistry, UCL rison@biochem.ucl.ac.uk
2
Outline Collecting functional information: Small scale (single gene) Large scale (sets of genes) Function annotation schemes Problems with functional assignments [Comparing current schemes]
3
Collecting information for single genes from 1° databases from 2° databases from Genome Databases (Model organisms) by homology not by homology
4
Annotation in databases: 1° and 2° databases Some information can be found in 'primary' databases (sequence and structure databases) Usually limited although sometimes can be quite informative (e.g. SwissProt) Core data: sequence, citation information and taxonomic data Annotation: Protein function; post-translational modifications; domains and sites; Associated diseases; Sequence conflicts/Variant Most primary databases link to a number of value-added (2°) databases (e.g. motif databases or disease databases) which are often rich in information
5
Annotation in 1° databases: SwissProt ID HEM3_HUMAN STANDARD; PRT; 361 AA. AC P08397; P08396; Q16012; … DE PORPHOBILINOGEN DEAMINASE (EC 4.3.1.8) (HYDROXYMETHYLBILANE SYNTHASE) DE (HMBS) (PRE-UROPORPHYRINOGEN SYNTHASE) (PBG-D). GN HMBS OR PBGD. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. …(literature references)… CC FUNCTION: TETRAPOLYMERIZATION OF THE MONOPYRROLE PBG INTO THE CC HYDROXYMETHYLBILANE PREUROPORPHYRINOGEN IN SEVERAL DISCRETE STEPS. CC CATALYTIC ACTIVITY: 4 PORPHOBILINOGEN + H(2)O = HYDROXYMETHYLBILANE + 4 NH(3). CC COFACTOR: BINDS A DIPYRROMETHANE COFACTOR TO WHICH PORPHOBILINOGEN SUBUNITS… CC PATHWAY: THIRD STEP IN PORPHYRIN BIOSYNTHESIS BY THE SHEMIN PATHWAY. INVOLVED… CC ALTERNATIVE PRODUCTS: THERE ARE TWO ISOZYMES OF THIS ENZYME IN MAMMALS; THEY CC AREPRODUCED BY THE SAME GENE FROM ALTERNATIVE SPLICING… CC DISEASE: DEFECTS IN HMBS ARE THE CAUSE OF ACUTE INTERMITTENT PORPHYRIA (AIP); AN CC AUTOSOMAL DOMINANT DISEASE CHARACTERIZED BY ACUTE ATTACKS OF NEUROLOGICAL CC DYSFUNCTION… CC SIMILARITY: BELONGS TO THE HMBS FAMILY. … (links to related databases - secondary databases) … KW Porphyrin biosynthesis; Heme biosynthesis; Lyase; KW Alternative splicing; Disease mutation. … (Sequence variations/Sequence)
6
Annotation in Motif databases: INTERPRO http://interpro.ebi.ac.uk/servlet/IEntry?ac=IPR000860
7
Genome databases Some deal with single organisms (e.g. SubtiList for B. subtilis; Sanger Centre M. tuberculosis) Some deal with multiple genomes (e.g. TIGR microbial genomes database) The level of annotation can be extensive Many are much more than sequence repositories extending the sequence with tons of information (e.g. mutants; strains; complementation plasmids etc.) If you are working with a model organism, chances of obtaining reliable functional annotations are improved
8
Genome database: YPD http://www.proteomehttp://www.proteome.com/databases/YPD/reports/HEM3.htmlYPD/reports/HEM3.html
9
Function assignment by homology I If you just have a sequence The most common bioinformatics procedure Search your protein of interest against primary databases; chances are if you find a homologue with high-identity, it performs a similar function Many, many tools (BLAST, FASTA, S-W Search) Beware of annotation by homology relationship between seq. similarity and function not straightforward danger of propagation of incorrect functional information
10
Function assignment by homology II Consider databases which distinguish experimental function assignments from homology based ones (e.g. YPD/WormPD, EcoCyc) Or use databases which employ more rigorous automated annotation tools (e.g. HAMAP @ SwissProt) “Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.”
11
Genome database: YPD http://www.proteomehttp://www.proteome.com/databases/YPD/reports/HEM3.htmlYPD/reports/HEM3.html
12
Functional assignment “without homology” Novel functional assignment methods now exists which don’t make use of ‘direct’ homology searches They exploit other relationships between proteins which are used as indicators of shared function Phylogenetic profiles “Rosetta stone genes”
13
Phylogenetic profiles Pellegrini M et al., “Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.” PNAS (1999) 96(8):4285-8
14
Rosetta Stone method
15
More methods… Marcotte EM, et al., Nature (1999) 402:83-86 Enright AJ, et al., Nature (1999) 404:86-90
16
Functional assignment “without homology”
17
Some access over the WWW but experiemental and only for certain organisms (Yeast, E. coli, M. tuberculosis) many proprietary methods Considered one of the most promising solution for preliminary annotation of “unknown function” proteins in genome sequencing projects
18
Collecting information for many genes Usually for “large-scale biology” (e.g. micro-array experiments) Genome Databases Functional classification schemes
19
Genome Databases Genome sequencing project are now the primary driving force for extensive functional annotation We have the genes (ORFs), we want the functions FUNCTIONAL GENOMICS
20
(… more ’omes)
21
Functional classification schemes I Dealing with large sets of genes functional classification schemes Tentative schemes as early as 1983; use driven by genome sequencing projects First extensive scheme published in 1993 by Monica Riley [regularly updated (GenProtEC; EcoCyc)] The majority of current schemes are heavily influenced by the ‘Riley scheme’ ‘2nd generation’ schemes are now being developed
22
Functional classification schemes II Most schemes can be thought of as trees Progression along the tree (root to leaves) represents increasingly specific functions ORFs are generally associated with leaf nodes (but of course, they are also associated with intermediary nodes) Examples of use: create gene sets linked by functionality (e.g. to detect functional motifs) validate a functional connection between genes (e.g. gene expression studies)
23
(112 ORFs) An example scheme… Metabolism of small molecules Amino Acids Central Intermediary Metabolism Energy Metabolism etc. Aerobic respiration Fermentation Glycolysis etc. Alanine Amino sugars 2 ORFs 8 ORFs 32 ORFs 18 ORFs 22 ORFs (900 ORFs) GeneProtEC
24
Issues Functions: Apple and Oranges Multi-dimensionality Multi-functionality
25
Issues: Apples and Oranges Function is an umbrella catch-all term Schemes do not distinguish between aspects of functions Most commonly they mix gene product type (T), activity (A) and cellular role (R) Cell division (R) : DNA replication (A) Osmotic adaptation (R) : Ion channel (T,A)
26
Issues - Multi-dimensionality I Human trypsin functions: Biochemical: peptide bond hydrolysis Molecular: proteolytic enzyme Cellular: protein degradation Physiological: digestion Could conceive a number of other dimensions Cellular location Regulation
27
Issues - Multi-dimensionality II Why differentiate function and process? Figure of cell cycle-dependent Yeast gene expression clusters (Pat Brown lab - Stanford)
28
Issues - Multi-functionality Inherent: e.g. lac repressor; carbohydrate metabolism and osmoprotection Multi-subunit: e.g. succinic dehydrogenase; whole - enzyme in TCA; subunit 1 - electron transport chain; subunit 2 - cell structure Circumstantial: e.g. acetate kinase; acetate only environment - acetate metabolism; acetate absent - fermentation enzyme
29
Gene Ontology - a collaboration Drosophila (fruit fly) - FlyBase Saccharomyces Genome Database (SGD) Mus (mouse) - Mouse Genome Database (MGD)
30
Gene Ontology - the next generation Multi-dimensional: functional primitive: “a capability that a physical gene product (or gene product group) carries as a potential” (e.g. transporter or adenylate cyclase) process: “a biological objective accomplished via one or more ordered assemblies of functions” (e.g. cell growth and maintenance or purine metabolism) cellular component Extensive: depth 11; nearly 4000 terms More complex organisation: away from tree structure Theoretically applicable to all species (designed for multicellular eukaryotes)
31
Gene Ontology - Process
32
Gene Ontology - current status http://www.geneontology.org/
33
Where to look for functional information - single protein With 1 or a few genes: Primary databases (e.g. SwissProt) Model organism databases (e.g. GenProtEC; SGD; WormPD) Metabolic/Pathway databases (e.g. KEGG) Value-added databases (e.g. Motif databases; Disease databases) By homology Not by homology
34
Where to look for functional information - protein sets Need some sort of functional classification scheme: Tree like schemes (e.g. TIGR, GenProtEC) Gene Ontology (FlyBase, MGD, SGD) For comparative genomics, need schemes applied to multiple organisms (e.g. PEDANT, TIGR) Currently, greatest genome coverage is by PEDANT (but non-manually curated)
35
Conclusions Functional information is available but it is rarely centralised Function is a very broad definition; hard to know if the information you need will be available at the level you need it New schemes (e.g. GO) are emerging which try and cope with functional annotation better And new automated functional annotation tools are emerging (‘intelligent systems’; non-homology based) You still need to validate predictions experimentally
36
A survey of (some) current schemes 1) EcoCyc/GenProtEC: E. coli scheme (Riley scheme, MBL) 2) SubtiList: Bacillus subtilis scheme (Institut Pasteur) 3) MIPS/PEDANT: yeast scheme (applied to other organisms in PEDANT) (Munich Institute for Protein Science) 4) TIGR: microbial genomes scheme (The Institute for Genome Research) 5) KEGG: multi-organism scheme (metabolic and regulatory pathways) (Kyoto Encyclopaedia for Genes and Genomes) 6) WIT: multi-organism scheme (metabolic reconstruction) (What is There; ANL) 7) Gene Ontology: a 2nd generation functional classification scheme (EBI; FlyBase; MGD; SGD)
37
FuncWheel for the Combination Scheme
39
Conclusions - Scheme comparison I Similar in the coverage of function (although very varying ‘granularity’) ...yet different enough that direct comparison complex Essentially deal with unicellular microbial organisms (MIPS is tackling this) Certain ‘niche’ schemes (e.g. WIT/KEGG) ...or user community tailored schemes (e.g. SubtiList)
40
WWW sites I Primary databases (Sequence): SwissProt: http://www.expasy.ch/sprot PIR: http://www-nbrf.Georgetown.edu/ NCBI databases: http://www.ncbi.nlm.nih.gov/Database/index.html Primary databases (Structure) Protein Data Bank: http://www.rcsb.org/ Macromolecular Structure Database: http://msd.ebi.ac.uk/ Value added: INTERPRO: http://interpro.ebi.ac.uk/
41
WWW sites II Single genome databases: Subtilist: http://genolist.pasteur.fr/SubtiList/ Saccharomyces Genome Database: http://genomewww.stanford.edu/Saccharomyces/ EcoCyc: http://ecocyc.pangeasystems.com/ GenProtEC: http://genprotec.mdbl.edu/ FlyBase: http://flybase.bio.indiana.edu/ Mouse Genome Database (MGD): http://www.informatics.jax.org/ Yeast Protein Database (YPD) and WormPD: http://www.proteome.com/
42
WWW sites III Multiple genome databases The Institute for Genome Research: http://www.tigr.org/microbialdb MIPS/PEDANT: http://pedant.mips.biochem.mpg.de/ HAMAP: http://www.expasy.ch/sprot/hamap/ Pathway databases KEGG: http://www.genome.ad.jp/kegg/ WIT: http://igweb.integratedgenomics.com/IGwit/ Non-homology based function prediction Mycobacterium tuberculosis: http://www.doe- mbi.ucla.edu/people/sergio/TB/tb.html Yeast: http://www.doe-mbi.ucla.edu/people/marcotte/yeast.html A relevant paper http://www.biochem.ucl.ac.uk/~rison/Publications/index.html
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.