Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop.

Slides:

Advertisements

Similar presentations

Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.

Advertisements

Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○

The design, construction and use of software tools to generate, store, annotate, access and analyse data and information relating to Molecular Biology.

1 Computational Molecular Biology MPI for Molecular Genetics DNA sequence analysis Gene prediction Gene prediction methods Gene indices Mapping cDNA on.

Protein databases Morten Nielsen. Background- Nucleotide databases GenBank, National Center for Biotechnology Information.

Sequence Analysis MUPGRET June workshops. Today What can you do with the sequence? What can you do with the ESTs? The case of SNP and Indel.

1 Gene Finding Charles Yan. 2 Gene Finding Genomes of many organisms have been sequenced. We need to translate the raw sequences into knowledge. Where.

Protein Databases EBI – European Bioinformatics Institute

Genome Related Biological Databases. Content DNA Sequence databases Protein databases Gene prediction Accession numbers NCBI website Ensembl website.

Comparative ab initio prediction of gene structures using pair HMMs

Bioinformatics. Analysis of proteomic data. Dr Richard J Edwards 28 August 2009; CALMARO workshop. ©Gary Larson (In not much detail)

Class European Resources Protein Focused. Protein Databases EBI – European Bioinformatics Institute

EBI is an Outstation of the European Molecular Biology Laboratory. UniProt Jennifer McDowall, Ph.D. Senior InterPro Curator Protein Sequence Database:

Chapter 2 Sequence databases A list of the databases’ uniform resource locators (URLs) discussed in this section is in Box 2.1.

Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.

UniProt - The Universal Protein Resource

Genome Annotation BCB 660 October 20, From Carson Holt.

Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.

Protein Sequence Analysis - Overview Raja Mazumder Senior Protein Scientist, PIR Assistant Professor, Department of Biochemistry and Molecular Biology.

The Ensembl Gene set The “Genebuild” 21 April 2008.

Databases in Bioinformatics and Systems Biology Carsten O. Daub Omics Science Center RIKEN, Japan May 2008.

Bioinformatics for biomedicine

Introduction to databases Tuomas Hätinen. Topics File Formats Databases -Primary structure: UniProt -Tertiary structure: PDB Database integration system.

Secondary Databases Ansuman sahoo Roll: Y Bioinformatics Class Presentation 30 Jan 2013.

NCBI’s Bioinformatics Resources Michele R. Tennant, Ph.D., M.L.I.S. Health Science Center Libraries U.F. Genetics Institute January 2015.

NCBI Review Concepts Chuong Huynh. NCBI Pairwise Sequence Alignments Purpose: identification of sequences with significant similarity to (a)

UniProt Non-redundant Reference Cluster (UniRef) Databases Swiss Institute of Bioinformatics (SIB) European Bioinformatics Institute (EMBL-EBI)

Corrections. - The cacao genome is currently being sequenced - Human Chromosome 1 sequence Search ‘Genome’

Common parameters At the beginning one need to set up the parameters.

Searching Molecular Databases with BLAST. Basic Local Alignment Search Tool How BLAST works Interpreting search results The NCBI Web BLAST interface Demonstration.

Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:

Part I: Identifying sequences with … Speaker : S. Gaj Date

Organizing information in the post-genomic era The rise of bioinformatics.

Biological Databases Biology outside the lab. Why do we need Bioinfomatics? Over the past few decades, major advances in the field of molecular biology,

Web Databases for Drosophila Introduction to FlyBase and Ensembl Database Wilson Leung6/06.

Alastair Kerr, Ph.D. WTCCB Bioinformatics Core An introduction to DNA and Protein Sequence Databases.

Proteomic Characterization of Alternative Splicing and Coding Polymorphism Nathan Edwards Center for Bioinformatics and Computational Biology University.

Biological databases Exercises. Discovery of distinct sequence databases using ensembl.

Protein Identification via Database searching Attila Kertész-Farkas Protein Structure and Bioinformatics Group, ICGEB, Trieste.

Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.

1 EMBL Outstation — The European Bioinformatics Institute Removing redundancy in SWISS-PROT and TrEMBL.

EMBL – EBI European Bioinformatics Institute UniProt - The Universal Protein Resource Claire O’Donovan.

Genome annotation and search for homologs. Genome of the week Discuss the diversity and features of selected microbial genomes. Link to the paper describing.

Bioinformatics and Computational Biology

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Faster, more sensitive peptide identification from tandem mass spectra by sequence database compression Nathan J. Edwards Center for Bioinformatics & Computational.

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Bioinformatics Workshops 1 & 2 1. use of public database/search sites - range of data and access methods - interpretation of search results - understanding.

Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,

Novel Peptide Identification using ESTs and Genomic Sequence Nathan Edwards Center for Bioinformatics and Computational Biology University of Maryland,

Peptide-assisted annotation of the Mlp genome Philippe Tanguay Nicolas Feau David Joly Richard Hamelin.

Copyright OpenHelix. No use or reproduction without express written consent1.

1 of 28 Evaluating Genes and Transcripts (“Genebuild”)

What is BLAST? Basic BLAST search What is BLAST?

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Reliable Identification of Genomic Variants from RNA-seq Data Robert Piskol, Gokul Ramaswami, Jin Billy Li PRESENTED BY GAYATHRI RAJAN VINEELA GANGALAPUDI.

Considerations for multi-omics data integration Michael Tress CNIO,

Bioinformatics Shared Resource Bioinformatics : How to… Bioinformatics Shared Resource Kutbuddin Doctor, PhD.

What is BLAST? Basic BLAST search What is BLAST?

Protein Identification via Database searching

Basics of BLAST Basic BLAST Search - What is BLAST?

Functional Annotation of the Horse Genome

UniProt: Universal Protein Resource

Protein Sequence Analysis - Overview -

Protein Sequence Analysis - Overview -

Protein identification using MS/MS.

Basic Local Alignment Search Tool

Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.

SUBMITTED BY: DEEPTI SHARMA BIOLOGICAL DATABASE AND SEQUENCE ANALYSIS.

Presentation transcript:

Bioinformatics. Strategies for proteomics: which database? Dr Richard J Edwards 27 August 2009; CALMARO workshop.

Bioinformatic Strategies for proteomics: which database?  Sequence databases  The importance of sequence databases in proteomics  Database options  Proteomics search strategies  Completeness vs. Redundancy  Local vs. online databases  Species-specific vs. multi-species  EST libraries  Which database?  Decoy databases  Open Discussion

Sequence databases.

The importance of the database Database Protein List

 Idealised assumptions for matching spectra  Protein sample is pure  All peaks correspond to protein fragments  Protein fragmentation is complete and perfect  Can predict peaks from sequence  Every peak is present & precise  Peaks can be matched exactly  Exact protein will be present in database  Every peak should find in silico match  Reality  Impurities  Incomplete digestion  Measurement error  Post-translational modifications The importance of the database NOISE Incomplete

Quality vs. Quantity  Trade-off in any analysis Quantity (Completeness) Quality (Accuracy) Ideal Trade-off Assumptions Database Reality

 Idealised database for matching spectra  Every protein in sample is present in database  100% coverage  No unnecessary duplication of data  Non-redundant  Sequence present in database matches sequence in sample  Every peak should find exact in silico match  Reality  Incomplete proteome annotation  Proteins missing  Duplicate entries (redundancy)  Incomplete protein annotation  Truncated sequences  Sequencing errors  Missing sequence variants The importance of the database NOISE Incomplete

The importance of the database Database Protein List NOISE Incomplete NOISE Incomplete

The importance of the database Database NOISE Incomplete NOISE Incomplete NOISE Incomplete Protein List

Database options.

Common proteomics databases DatabaseDescription; source databasesOrgan- isms Update frequency UniProt/ Swiss-Prot [EBI] Expertly curated; high level of annotation; minimum level of redundancy; high level of integration with other databases. ManyRelease every 4 months; updates every 2 weeks UniProt/ TrEMBL [EBI] Computer-annotated supplement to Uni-Prot/Swiss-Prot. Contains translated coding sequences from GenBankTM nucleotide database, protein sequences extracted from the literature or submitted to Uni-Prot/Swiss-Prot but not yet manually curated. ManyRelease every 4 months; updates every 2 weeks RefSeq [NCBI] Ongoing curation by NCBI staff; non-redundant; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. ManyRelease every 3 months Ensembl [EBI] Created using automated genome annotation pipeline; eukaryotic genomes only; explicitly linked nucleotide and protein sequences; stable reference; high level of integration with other databases. Peptides identified by MS/MS can be mapped to the genome via Ensembl Protein database and visualized using Ensembl Genome Browser. SeveralEvery 1–2 months IPI [EBI] Good balance between degree of redundancy and completeness; references to the primary data sources; attempts to maintain stable identifiers (with incremental versioning), but still in flux. Assembled from Uni-Prot (Swiss-Prot + TrEMBL), RefSeq, Emsembl, H-Invitational database. A fewMonthly Entrez Protein (NCBInr) [NCBI] More complete with regard to sequence polymorphisms and splice forms; annotations extracted from curated databases; high degree of sequence redundancy makes interpretation difficult. Assembled from GenBankTM and RefSeq coding sequence translations, Protein Information Resource (PIR), Protein Data Bank (PDB), Uni-Prot/Swiss-Prot, Protein Research Foundation (PRF). ManyFrequent updates

Which database?  What resources are available?  What is most important for your analysis? Quantity (Completeness) Quality (Accuracy) Database Trade-off Protein List NOISE Incomplete Protein List

Generic Protein Databases  UniProtKB / NCBI  Advantages  Ease of access  Completeness  Updated  Disadvantages:  Redundancy  High noise levels  Inappropriate species etc.  In silico annotation  Not made with proteomics in mind  May not have relevant variants Patent Data UniParc WormBaseFlyBase Sub/ Peptide Data PDBVEGAEnsemblRefSeq INSDC (incl. WGS, Env.) UniProtKBUniMes UniRef 100 UniRef 90 UniRef 50 UniSave Database sources Proteome Sets IPI UniProt data sources and data flow

Genome databases  EnsEMBL/FlyBase/Wormbase etc.  Advantages:  Organism-specific information  Potentially high (~100%) coverage  Potentially very low redundancy/noise  Disadvantages:  Very dependent on annotation level/quality  Poor annotation = low completeness  May need other databases to interpret results  Best database to use if well-annotated genome available

EST libraries  NCBI dbEST / Organism-specific EST Projects  Generate your own!  Advantages:  Reasonable coverage of high expression proteins  Matches proteomics bias  Species-specific = more accurate matches  Enables identification of species-specific proteins  Not so reliant on annotation  Sequence variants  Transcripts without known homology/function  Disadvantages:  Often very poor annotation = Extra work  High levels of redundancy = Extra work  Sequence fragments = missed indentifications  Search as DNA in six reading frames or annotate proteins first

Proteomics search strategies.

Proteomics search strategies  No universal “best” strategy  Trade-offs: best depends on focus  Common trade-offs:  Completeness vs. Redundancy  Local vs. Online  Translated vs. Untranslated ESTs  Species-specific vs. Generic database  Why are you doing the experiment?  What do you want to identify?

Completeness vs. Redundancy  Redundancy  Multiple identifications of essentially the same protein  Sequence variants within a species  Same protein (family) in different species  More redundancy = extra work  Which hits are unique? (Different peptides)

Redundancy: identification issues Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: Sequences of identified peptides often do not allow discrimination between different protein isoforms

Completeness vs. Redundancy  Redundancy  Multiple identifications of essentially the same protein  Sequence variants within a species  Same protein (family) in different species  More redundancy = extra work  Which hits are unique? (Different peptides)  Poor annotation  Splice variant vs. Protein family?  More redundancy = lower sensitivity  Larger databases = more random hits = stricter score thresholds

Completeness vs. Redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: Protein sequence databases differ in terms of their completeness and the degree of sequence redundancy Quantity (Completeness) Quality (Non Redundancy) Trade-off Database

NCBI Redundancy Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4:

Completeness vs. Redundancy Large, Redundant DatabaseSmall, Non-Redundant Database Every major protein and specific variant is important Happy to group similar hits to a single protein/family (e.g. HSP90) Willing and able to perform extensive post-identification analysis Want to minimise need for additional cleanup/data analysis  Try to maximise both quality and quantity  Quality genome, IPI or custom-built search database Quantity (Completeness) Quality (Non- Redundancy) dbEST Genome X IPI

Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4: An example of a protein family: alpha tubulins Inconclusive identification

Local vs. Online  Standard databases plug in to search engines  Can also search “Local” databases, stored on own machine Standard Online DatabaseLocal Database Regularly updated: latest sequence data & annotation More control of content – customise species & sequences Easy to describe & referenceStable database for multiple searches Don’t have to worry about sequence formats/naming etc. Ease comparisons & redundancy removal across multiple experiments Eases generation of decoy database

EST libraries  Can search untranslated in six reading frames (RFs)  Or Assemble, annotate & search proteins UntranslatedAssembled & Annotated Quick and easy preparationTime-consuming and difficult (unless already done!) Suffers from short fragmentsLonger sequences = more chance of multiple peptides Potential to detect SNPs/isoformsAssembly may incorrectly assimilate/remove variants Low coverage: more robust to sequencing error High coverage: more robust to sequencing error Large quantity of random translations (UTRs + 5 incorrect RFs) Smaller, higher quality dataset = less False Positives Detect novel proteins (no homology to known proteins) ORFs without homology to known proteins may be removed Need to annotate hits

Search species  Best to use species-specific data where possible  More chance of identical peptide sequences  More chance of family member/isoform discrimination  Can identify taxa-specific proteins  Can search in other species and infer  Only works for conserved proteins  Increases noise and False Positive rate  Compromise  Subset of well-annotated & closely-related species  Maximise completeness, minimise noise

Poor genome vs. Wrong species  Annotation generally based on homology to known proteins  Proteins similar enough to be found by proteomics will be easy to find & annotate in genome  What is the genome coverage?  High coverage = little to be gained by additional search  Low coverage = may be many conserved proteins missing  Multi-species search may find extra proteins  Compromise:  Search available species data (genome/EST)  Second search against selected taxa (UniProtKB)  Bacteria: can search genome in 6RF! (No introns)

Which database? How to choose.

How to choose?  What do you want to do?  Priority  Reproducibility/Comparability/Hypothesis testing  Fewer, high quality identifications  Smaller, more focused database  Hypothesis generation  More identifications, more potential false positives  Larger/multiple search databases  How much post-identification analysis?  Not much  Higher quality, lower numbers  Detailed manual analysis  More identifications, more potential false positives  Always sensible to look for probable identifications  Are sequences missing from your search database?

Experimental design Copyright ©2005 American Society for Biochemistry and Molecular Biology Nesvizhskii, A. I. (2005) Mol. Cell. Proteomics 4:  Protein separation  Better discrimination of variants  Cope with more redundancy  Shotgun (no separation)  Less discrimination

Decoy databases.

False positives: NOISE Database NOISE Incomplete NOISE Incomplete NOISE Incomplete Protein List

False positives: NOISE  How much noise?  How many False Positives due to noise? NOISE Incomplete Database Decoy Database Protein List NOISE RANDOM Protein List

Conclusions.

Summary  Selection of database is very important for quality of results  Primarily a trade-off between completeness & noise  Choice of database depends on:  Availability of data  Experimental design  Aims/objectives of study  Priorities of analysis  Well annotated genomes (proteomes!) best  Poorly annotated genomes & ESTs can be supplemented with searching related taxa  Local databases give more control/repeatability  Decoy databases can help estimate false positive rates

Open Discussion