Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Slides:



Advertisements
Similar presentations
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Advertisements

Modeling Functional Genomics Datasets CVM Lesson 3 13 June 2007Fiona McCarthy.
1 Welcome to the Protein Database Tutorial This tutorial will describe how to navigate the section of Gramene that provides collective information on proteins.
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Basics of Comparative Genomics Dr G. P. S. Raghava.
The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.
EBI is an Outstation of the European Molecular Biology Laboratory. Alex Mitchell InterPro team Using InterPro for functional analysis.
Bioinformatics for biomedicine Summary and conclusions. Further analysis of a favorite gene Lecture 8, Per Kraulis
Kate Milova MolGen retreat March 24, Microarray experiments: Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Biological Databases Notes adapted from lecture notes of Dr. Larry Hunter at the University of Colorado.
Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop.
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
The Protein Data Bank (PDB)
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:
We are developing a web database for plant comparative genomics, named Phytome, that, when complete, will integrate organismal phylogenies, genetic maps.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Protein and Function Databases
UniProt - The Universal Protein Resource
Kate Milova MolGen retreat March 24, Microarray experiments. Database and Analysis Tools. Kate Milova cDNA Microarray Facility March 24, 2005.
Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.
Pattern databasesPattern databasesPattern databasesPattern databases Gopalan Vivek.
Viewing & Getting GO COST Functional Modeling Workshop April, Helsinki.
Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.
Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.
UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)
BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.
Managing Data Modeling GO Workshop 3-6 August 2010.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Part I: Identifying sequences with … Speaker : S. Gaj Date
Strategies for functional modeling TAMU GO Workshop 17 May 2010.
Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,
Introduction to the GO: a user’s guide Iowa State Workshop 11 June 2009.
Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.
Protein and RNA Families
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Building WormBase database(s). SAB 2008 Wellcome Trust Sanger Insitute Cold Spring Harbor Laboratory California Institute of Technology ● RNAi ● Microarray.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Generic Database. What should a genome database do? Search Browse Collect Download results Multiple format Genome Browser Information Genomic Proteomic.
ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.
Introduction to the GO: a user’s guide NCSU GO Workshop 29 October 2009.
EBI is an Outstation of the European Molecular Biology Laboratory. PRIDE centric exercise: BioMart interface PRIDE team, Proteomics Services Group PANDA.
Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
March 28, 2002 NIH Proteomics Workshop Bethesda, MD Lai-Su Yeh, Ph.D. Protein Scientist, National Biomedical Research Foundation Demo: Protein Information.
Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.
Getting GO: how to get GO for functional modeling Iowa State Workshop 11 June 2009.
Guidelines for sequence reports. Outline Summary Results & Discussion –Sequence identification –Function assignment –Fold assignment –Identification of.
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
InterPro Sandra Orchard.
What is BLAST? Basic BLAST search What is BLAST?
Welcome to the GrameneMart Tutorial A tool for batch data sequence retrieval 1.Select a Gramene dataset to search against. 2.Add filters to the dataset.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
` Comparison of Gene Ontology Term Annotations Between E.coli K12 Databases REDDYSAILAJA MARPURI WESTERN KENTUCKY UNIVERSITY.
What is BLAST? Basic BLAST search What is BLAST?
Strategies for functional modeling
Optimizing Biological Data Integration
Basics of BLAST Basic BLAST Search - What is BLAST?
Demo: Protein Information Resource
Basics of Comparative Genomics
Department of Genetics • Stanford University School of Medicine
Functional Annotation of the Horse Genome
UniProt: Universal Protein Resource
Predicting Active Site Residue Annotations in the Pfam Database
PIR: Protein Information Resource
ID Mapping tools: Converting Accessions between Databases
Ensembl Genome Repository.
Schematic representation of proteogenomic annotation strategy.
Basics of Comparative Genomics
Basic Local Alignment Search Tool
Presentation transcript:

Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Bioinformatic analysis of proteomic data  Improving sequence identifications  Dealing with redundancy  Annotating protein hits  Adding value to protein lists  Accession number mapping & data integration  Gene Ontology analysis  Protein interaction networks  Example: identifying E. huxleyi proteins with multi-species and EST sequence databases  Open Discussion

Improving identifications: dealing with redundancy.

Identifying redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4:  Choice of database affects redundancy identification  SwissProt/IPI indicate splice variants  EnsEMBL peptides map back onto non-redundant gene IDs  Poor annotation  hard to differentiate variant/error/family

Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: Example: alpha tubulin protein family Identifying redundancy  Sometimes, identification cannot be conclusive

Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: Basic peptide grouping scenarios Identifying redundancy  Sometimes, identification cannot be conclusive  Different scenarios can present different problems  How important is it to study?  Might need to identify protein(s) through further experiments ? ?? ? ? ? ?

Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: A simplified example of a protein summary list Identifying redundancy  Final protein list:  Conclusive IDs  Protein groups  Inconclusive IDs  Are inconclusive/ group hits redundant?  Same protein from different species  Splice variants  Does it matter?  Inflated numbers  Biased analyses  Comparisons between experiments Unique to protein Unique to group No unique

Homology groupings  Can use BLAST to identify groups of related proteins  Help identify possible redundancies  Need to look at peptides  Particularly useful for “off-species” identifications  Tendency for many hits to same protein in different species Clustering proteins by %identity

Improving identifications: annotating protein hits.

Protein annotation Database Protein List NOISE  Poorly (un)annotated proteins  Real proteins or database noise?  Reliable annotation?

 Most of our protein data comes from DNA sequences  PDB: 53,660 structures = 3D  SwissProt: 392,667 = Curated  TrEMBL: >6 million & UniParc: >16 million = Most inferred from DNA  Most annotation inferred through sequence analysis  Protein data from translated DNA  Lots of errors!  Sequence errors  Annotation errors AnnotationTranslation Where does the data come from?

Protein annotation  Use standard sequence analysis tools  Manual guidance/care = better than automated databases!  Homology searching  BLAST vs. UniProtKB  Protein domain searches, e.g. PFam  Conservation analysis  Multiple sequence alignment with homologues  Are functionally important sites conserved?  Phylogenetic analysis  Evolutionary relationships can help distinguish function  Assignment to protein subfamily etc.  Useful where BLAST hits have competing annotation

Beyond proteomics: adding value to protein lists.

What Bioinformatics cannot (usually) do  Magic  Replace hypothesis driven research  Directed analysis is always better than “fishing” (e.g. GO)  Provide a definitive answer  Ranking/prioritising better

Follow-up analyses  Many possibilities  What was the aim of the study?  What resources are available for your organism?  Imitation is the sincerest form of flattery  Find a good study and copy the best bits  Easier to describe  Easier to justify to reviewers  Hypothesis-driven analysis is best  Many tools facilitate hypothesis generation (data exploration)  Be aware of risk of testing a hypothesis on data used to generate it  Be aware of multiple testing issues

Follow-up analyses  EBI and NCBI both provide many useful tools  EBI run many good courses at Hinxton

Seek collaborations Time / Energy Reward Bioinformatics  Find a tame bioinformatician to help if needed  Good collaboration = Trade  Papers / Grants / improving the bioinformatics  E.g. adding your organism/database to an online resource ©Gary Larson

Accession number mapping  Other databases may contain better/specific annotation  UniProtKB, OMIM etc.  Results from searches against older databases may need updating  EBI tool: PICR [Protein Identifier Cross-Reference Service]  BioMart: Query & Xref tool for many databases 

BioMart

Gene Ontology analysis  Gene Ontology [GO] = gene annotation project  Controlled vocabulary allows standardisation & comparisons

Gene Ontology analysis  Many Gene Ontology exploration tools  AmiGO, GOA, FatiGO, DAVID etc.  Depend on source databases  May need to map IDs using PICR first  GO enrichment  Assess frequency of GO terms in your list against expectation  Often a big multiple testing issue  Be aware of biases – how is expectation derived  E.g. Abundant, conserved proteins more likely to be annotated & more likely to be identified in a proteomics experiment  Best if hypothesis-driven or used for data confirmation  E.g. Enrichment of certain subcellular fraction

Protein interaction networks  Can be useful for identifying protein complexes in data  E.g. STRING [

Example: identifying E. huxleyi proteins with multi-species and EST sequence databases

Combined search strategy  Genome unavailable (for download & searching) dbEST Thalassiosira pseudonana Taxa-limited Database 90,000 E hux ESTs Protein List :Rhodophyta: :Stramenopiles: :Haptophyceae: :Alveolata: :Cryptophyta:

EST dataset BLAST database MS/MS data MASCOT hits Translated to 6RFs RFs and MASCOT peptides filtered FIESTA consensus & annotation Final protein identifications BUDAPEST CORE Poor quality RFs removed OPTIONAL (MANUAL or AUTOMATED) 90,000 E hux ESTs 173 ESTs RFs Taxa-limited Database 117 Cons Cons Cons 287  173 EST hits (728 peptides)  83 Consensus sequences  40 Clusters by homology (variants/isoforms)  287 Peptides  239 Unique to one consensus  48 Shared within one cluster

Annotating EST Consensus Sequences  Homology searching & phylogenetics Sequence Database Consensus UniProt Taxa-limited Database Alignment

Protein family identification

Redundancy/ Variants

Combined search strategy  Genome unavailable (for download & searching) dbEST Thalassiosira pseudonana Taxa-limited Database 90,000 E hux ESTs 173 Hits 83 Consensus 40+ Proteins 96 Hits 26+ Proteins :Rhodophyta: :Stramenopiles: :Haptophyceae: :Alveolata: :Cryptophyta: 64+ Proteins (12 Common)

Conclusions.

Summary  Extra analysis of raw protein lists adds value  False positives vs. Real proteins  Annotation of uncharacterised hits  Numerous tools for mining protein lists  Data exploration and/or hypothesis testing  Community/Organism dependent  Worth contacting bioinformaticians for further development  Development of customised bioinformatics solutions can greatly increase power of study  Increased availability of high throughput technologies  Poor annotation & high error rates  Increased need for bioinformatics post-processing to improve quality

Open Discussion