Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of Biochemistry, UCL

Slides:



Advertisements
Similar presentations
Www. GeneOntology.org Gene Ontology Collaboration.
Advertisements

Microarray Data Analysis Day 2
Genome databases and webtools for genome analysis Become familiar with microbial genome databases Use some of the tools useful for analyzing genome Visit.
Basics of Comparative Genomics Dr G. P. S. Raghava.
Gene Ontology John Pinney
Research Methodology of Biotechnology: Protein-Protein Interactions Yao-Te Huang Aug 16, 2011.
Curation of the EcoCyc Database: The EcoCyc Update Project Martha Arnaud Scientific Database Curator Bioinformatics Research Group SRI International
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Systems Biology Existing and future genome sequencing projects and the follow-on structural and functional analysis of complete genomes will produce an.
. Class 1: Introduction. The Tree of Life Source: Alberts et al.
Readings for this week Gogarten et al Horizontal gene transfer….. Francke et al. Reconstructing metabolic networks….. Sign up for meeting next week for.
Computational Molecular Biology (Spring’03) Chitta Baral Professor of Computer Science & Engg.
COG and GO tutorial.
Functional annotation and network reconstruction through cross-platform integration of microarray data X. J. Zhou et al
Sequence-Structure-Function Sequence Structure Function Threading Ab initio BLAST Folding: impossible but for the smallest structures Function prediction.
Internet tools for genomic analysis: part 2
Protein and Function Databases
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. (1999). Detecting protein function and protein-protein interactions from genome sequences.
Subsystem Approach to Genome Annotation National Microbial Pathogen Data Resource Claudia Reich NCSA, University of Illinois, Urbana.
Systematic Analysis of Interactome: A New Trend in Bioinformatics KOCSEA Technical Symposium 2010 Young-Rae Cho, Ph.D. Assistant Professor Department of.
GTL User Facilities Facility II: Whole Proteome Analysis Michelle V. Buchanan.
BTN323: INTRODUCTION TO BIOLOGICAL DATABASES Day2: Specialized Databases Lecturer: Junaid Gamieldien, PhD
Genome of the week - Deinococcus radiodurans Highly resistant to DNA damage –Most radiation resistant organism known Multiple genetic elements –2 chromosomes,
Metagenomic Analysis Using MEGAN4
ComPath Comparative Metabolic Pathway Analyzer Kwangmin Choi and Sun Kim School of Informatics Indiana University.
Chapter 14 Genomes and Genomics. Sequencing DNA dideoxy (Sanger) method ddGTP ddATP ddTTP ddCTP 5’TAATGTACG TAATGTAC TAATGTA TAATGT TAATG TAAT TAA TA.
Bioinformatics for biomedicine
Functional Linkages between Proteins. Introduction Piles of Information Flakes of Knowledge AGCATCCGACTAGCATCAGCTAGCAGCAGA CTCACGATGTGACTGCATGCGTCATTATCTA.
SPH 247 Statistical Analysis of Laboratory Data 1 May 12, 2015 SPH 247 Statistical Analysis of Laboratory Data.
© Wiley Publishing All Rights Reserved. Protein and Specialized Sequence Databases.
Functional Associations of Protein in Entire Genomes Sequences Bioinformatics Center of Shanghai Institutes for Biological Sciences Bingding.
Essential Bioinformatics and Biocomputing Module (Tutorial) Biological Databases Lecturer: Chen Yuzong Jan 2003 TAs: Cao Zhiwei Lee Teckkwong, Bernett.
SPH 247 Statistical Analysis of Laboratory Data 1May 14, 2013SPH 247 Statistical Analysis of Laboratory Data.
Biological Databases By : Lim Yun Ping E mail :
GENE ONTOLOGY FOR THE NEWBIES Suparna Mundodi, PhD The Arabidopsis Information Resources, Stanford, CA.
1 SRI International Bioinformatics GO Term Integration and Curation in Pathway Tools and EcoCyc Ingrid M. Keseler Bioinformatics Research Group SRI International.
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Protein and RNA Families
PROTEIN DATABASES. The ideal sequence database for computational analyses and data-mining: I t must be complete with minimal redundancy It must contain.
Functional Annotation and Functional Enrichment. Annotation Structural Annotation – defining the boundaries of features of interest (coding regions, regulatory.
Anis Karimpour-Fard ‡, Ryan T. Gill †,
Other biological databases and ontologies. Biological systems Taxonomic data Literature Protein folding and 3D structure Small molecules Pathways and.
Motif discovery and Protein Databases Tutorial 5.
Central dogma: the story of life RNA DNA Protein.
I. Prolinks: a database of protein functional linkage derived from coevolution II. STRING: known and predicted protein-protein associations, integrated.
Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.
Bioinformatics and Computational Biology
Genome Biology and Biotechnology The next frontier: Systems biology Prof. M. Zabeau Department of Plant Systems Biology Flanders Interuniversity Institute.
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
Annotation. Traditional genome annotation BLAST Similarities.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
MAPPING OF SEQUENCES TO GENE ONTOLOGY. GO consortium.
1 EMBL Outstation — The European Bioinformatics Institute Mus musculus - a model organism in SWISS-PROT.
InterPro Sandra Orchard.
Protein databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen and from CSC bio-opas
Gene Ontology TM (GO) Consortium
 What is MSA (Multiple Sequence Alignment)? What is it good for? How do I use it?  Software and algorithms The programs How they work? Which to use?
Bio/Chem-informatics
Demo: Protein Information Resource
Basics of Comparative Genomics
Sequence based searches:
MAPPING OF SEQUENCES TO GENE ONTOLOGY
Department of Genetics • Stanford University School of Medicine
Genome Annotation Continued
Genomes and Their Evolution
There are four levels of structure in proteins
Source Page Understanding for Heterogeneous Molecular Biological Data
Basics of Comparative Genomics
SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.
Presentation transcript:

Protein function Where to find it. How to predict it. How to classify it. Stuart Rison Department of Biochemistry, UCL

Outline  Collecting functional information:  Small scale (single gene)  Large scale (sets of genes)  Function annotation schemes  Problems with functional assignments  [Comparing current schemes]

Collecting information for single genes  from 1° databases  from 2° databases  from Genome Databases (Model organisms)  by homology  not by homology

Annotation in databases: 1° and 2° databases  Some information can be found in 'primary' databases (sequence and structure databases)  Usually limited although sometimes can be quite informative (e.g. SwissProt)  Core data: sequence, citation information and taxonomic data  Annotation: Protein function; post-translational modifications; domains and sites; Associated diseases; Sequence conflicts/Variant  Most primary databases link to a number of value-added (2°) databases (e.g. motif databases or disease databases) which are often rich in information

Annotation in 1° databases: SwissProt ID HEM3_HUMAN STANDARD; PRT; 361 AA. AC P08397; P08396; Q16012; … DE PORPHOBILINOGEN DEAMINASE (EC ) (HYDROXYMETHYLBILANE SYNTHASE) DE (HMBS) (PRE-UROPORPHYRINOGEN SYNTHASE) (PBG-D). GN HMBS OR PBGD. OS Homo sapiens (Human). OC Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; OC Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. …(literature references)… CC FUNCTION: TETRAPOLYMERIZATION OF THE MONOPYRROLE PBG INTO THE CC HYDROXYMETHYLBILANE PREUROPORPHYRINOGEN IN SEVERAL DISCRETE STEPS. CC CATALYTIC ACTIVITY: 4 PORPHOBILINOGEN + H(2)O = HYDROXYMETHYLBILANE + 4 NH(3). CC COFACTOR: BINDS A DIPYRROMETHANE COFACTOR TO WHICH PORPHOBILINOGEN SUBUNITS… CC PATHWAY: THIRD STEP IN PORPHYRIN BIOSYNTHESIS BY THE SHEMIN PATHWAY. INVOLVED… CC ALTERNATIVE PRODUCTS: THERE ARE TWO ISOZYMES OF THIS ENZYME IN MAMMALS; THEY CC AREPRODUCED BY THE SAME GENE FROM ALTERNATIVE SPLICING… CC DISEASE: DEFECTS IN HMBS ARE THE CAUSE OF ACUTE INTERMITTENT PORPHYRIA (AIP); AN CC AUTOSOMAL DOMINANT DISEASE CHARACTERIZED BY ACUTE ATTACKS OF NEUROLOGICAL CC DYSFUNCTION… CC SIMILARITY: BELONGS TO THE HMBS FAMILY. … (links to related databases - secondary databases) … KW Porphyrin biosynthesis; Heme biosynthesis; Lyase; KW Alternative splicing; Disease mutation. … (Sequence variations/Sequence)

Annotation in Motif databases: INTERPRO

Genome databases  Some deal with single organisms (e.g. SubtiList for B. subtilis; Sanger Centre M. tuberculosis)  Some deal with multiple genomes (e.g. TIGR microbial genomes database)  The level of annotation can be extensive  Many are much more than sequence repositories extending the sequence with tons of information (e.g. mutants; strains; complementation plasmids etc.)  If you are working with a model organism, chances of obtaining reliable functional annotations are improved

Genome database: YPD

Function assignment by homology I  If you just have a sequence  The most common bioinformatics procedure  Search your protein of interest against primary databases; chances are if you find a homologue with high-identity, it performs a similar function  Many, many tools (BLAST, FASTA, S-W Search)  Beware of annotation by homology  relationship between seq. similarity and function not straightforward  danger of propagation of incorrect functional information

Function assignment by homology II  Consider databases which distinguish experimental function assignments from homology based ones (e.g. YPD/WormPD, EcoCyc)  Or use databases which employ more rigorous automated annotation tools (e.g. SwissProt) “Among the peculiarities recognized by the programs are: size discrepancy, absence or mutation of regions involved in activity or binding (to metals, nucleotides, etc), presence of paralogs, contradiction with the biological context (i.e. if a protein belongs to a pathway supposed to be absent in a particular organism), etc. Such "problematic" proteins will not be automatically annotated.”

Genome database: YPD

Functional assignment “without homology”  Novel functional assignment methods now exists which don’t make use of ‘direct’ homology searches  They exploit other relationships between proteins which are used as indicators of shared function  Phylogenetic profiles  “Rosetta stone genes”

Phylogenetic profiles Pellegrini M et al., “Assigning protein functions by comparative genome analysis: protein phylogenetic profiles.” PNAS (1999) 96(8):4285-8

Rosetta Stone method

More methods… Marcotte EM, et al., Nature (1999) 402:83-86 Enright AJ, et al., Nature (1999) 404:86-90

Functional assignment “without homology”

 Some access over the WWW  but experiemental  and only for certain organisms (Yeast, E. coli, M. tuberculosis)  many proprietary methods  Considered one of the most promising solution for preliminary annotation of “unknown function” proteins in genome sequencing projects

Collecting information for many genes  Usually for “large-scale biology” (e.g. micro-array experiments)  Genome Databases  Functional classification schemes

Genome Databases  Genome sequencing project are now the primary driving force for extensive functional annotation  We have the genes (ORFs), we want the functions FUNCTIONAL GENOMICS

(… more ’omes)

Functional classification schemes I  Dealing with large sets of genes  functional classification schemes  Tentative schemes as early as 1983; use driven by genome sequencing projects  First extensive scheme published in 1993 by Monica Riley [regularly updated (GenProtEC; EcoCyc)]  The majority of current schemes are heavily influenced by the ‘Riley scheme’  ‘2nd generation’ schemes are now being developed

Functional classification schemes II  Most schemes can be thought of as trees  Progression along the tree (root to leaves) represents increasingly specific functions  ORFs are generally associated with leaf nodes (but of course, they are also associated with intermediary nodes)  Examples of use:  create gene sets linked by functionality (e.g. to detect functional motifs)  validate a functional connection between genes (e.g. gene expression studies)

(112 ORFs) An example scheme… Metabolism of small molecules Amino Acids Central Intermediary Metabolism Energy Metabolism etc. Aerobic respiration Fermentation Glycolysis etc. Alanine Amino sugars 2 ORFs 8 ORFs 32 ORFs 18 ORFs 22 ORFs (900 ORFs) GeneProtEC

Issues  Functions: Apple and Oranges  Multi-dimensionality  Multi-functionality

Issues: Apples and Oranges  Function is an umbrella catch-all term  Schemes do not distinguish between aspects of functions  Most commonly they mix gene product type (T), activity (A) and cellular role (R) Cell division (R) : DNA replication (A) Osmotic adaptation (R) : Ion channel (T,A)

Issues - Multi-dimensionality I  Human trypsin functions:  Biochemical: peptide bond hydrolysis  Molecular: proteolytic enzyme  Cellular: protein degradation  Physiological: digestion  Could conceive a number of other dimensions  Cellular location  Regulation

Issues - Multi-dimensionality II  Why differentiate function and process?  Figure of cell cycle-dependent Yeast gene expression clusters (Pat Brown lab - Stanford)

Issues - Multi-functionality  Inherent: e.g. lac repressor; carbohydrate metabolism and osmoprotection  Multi-subunit: e.g. succinic dehydrogenase; whole - enzyme in TCA; subunit 1 - electron transport chain; subunit 2 - cell structure  Circumstantial: e.g. acetate kinase; acetate only environment - acetate metabolism; acetate absent - fermentation enzyme

Gene Ontology - a collaboration  Drosophila (fruit fly) - FlyBase  Saccharomyces Genome Database (SGD)  Mus (mouse) - Mouse Genome Database (MGD)

Gene Ontology - the next generation  Multi-dimensional:  functional primitive: “a capability that a physical gene product (or gene product group) carries as a potential” (e.g. transporter or adenylate cyclase)  process: “a biological objective accomplished via one or more ordered assemblies of functions” (e.g. cell growth and maintenance or purine metabolism)  cellular component  Extensive: depth 11; nearly 4000 terms  More complex organisation: away from tree structure  Theoretically applicable to all species (designed for multicellular eukaryotes)

Gene Ontology - Process

Gene Ontology - current status

Where to look for functional information - single protein  With 1 or a few genes:  Primary databases (e.g. SwissProt)  Model organism databases (e.g. GenProtEC; SGD; WormPD)  Metabolic/Pathway databases (e.g. KEGG)  Value-added databases (e.g. Motif databases; Disease databases)  By homology  Not by homology

Where to look for functional information - protein sets  Need some sort of functional classification scheme:  Tree like schemes (e.g. TIGR, GenProtEC)  Gene Ontology (FlyBase, MGD, SGD)  For comparative genomics, need schemes applied to multiple organisms (e.g. PEDANT, TIGR)  Currently, greatest genome coverage is by PEDANT (but non-manually curated)

Conclusions  Functional information is available but it is rarely centralised  Function is a very broad definition; hard to know if the information you need will be available at the level you need it  New schemes (e.g. GO) are emerging which try and cope with functional annotation better  And new automated functional annotation tools are emerging (‘intelligent systems’; non-homology based)  You still need to validate predictions experimentally

A survey of (some) current schemes  1) EcoCyc/GenProtEC: E. coli scheme (Riley scheme, MBL)  2) SubtiList: Bacillus subtilis scheme (Institut Pasteur)  3) MIPS/PEDANT: yeast scheme (applied to other organisms in PEDANT) (Munich Institute for Protein Science)  4) TIGR: microbial genomes scheme (The Institute for Genome Research)  5) KEGG: multi-organism scheme (metabolic and regulatory pathways) (Kyoto Encyclopaedia for Genes and Genomes)  6) WIT: multi-organism scheme (metabolic reconstruction) (What is There; ANL)  7) Gene Ontology: a 2nd generation functional classification scheme (EBI; FlyBase; MGD; SGD)

FuncWheel for the Combination Scheme

Conclusions - Scheme comparison I  Similar in the coverage of function (although very varying ‘granularity’) ...yet different enough that direct comparison complex  Essentially deal with unicellular microbial organisms (MIPS is tackling this)  Certain ‘niche’ schemes (e.g. WIT/KEGG) ...or user community tailored schemes (e.g. SubtiList)

WWW sites I  Primary databases (Sequence):  SwissProt:   PIR:   NCBI databases:   Primary databases (Structure)  Protein Data Bank:   Macromolecular Structure Database:   Value added:  INTERPRO: 

WWW sites II  Single genome databases:  Subtilist:   Saccharomyces Genome Database:   EcoCyc:   GenProtEC:   FlyBase:   Mouse Genome Database (MGD):   Yeast Protein Database (YPD) and WormPD: 

WWW sites III  Multiple genome databases  The Institute for Genome Research:   MIPS/PEDANT:   HAMAP:   Pathway databases  KEGG:   WIT:   Non-homology based function prediction  Mycobacterium tuberculosis:  mbi.ucla.edu/people/sergio/TB/tb.html  Yeast:   A relevant paper 