Proteomics Informatics – Databases, data repositories and standardization (Week 7)

Slides:



Advertisements
Similar presentations
PSI Mass Spectrometry Standards Working Group Summary HUPO PSI MS Standards Working Group.
Advertisements

Genomes and Proteomes genome: complete set of genetic information in organism gene sequence contains recipe for making proteins (genotype) proteome: complete.
David Campbell 1,, Eric Deutsch 1, Henry Lam 1, Hamid Mirzaei 1, Paola Picotti 2, Jeff Ranish 1, Ning Zhang 1, and Ruedi Aebersold 1,2,3 1.Institute for.
Proteomics Examination Yvonne (Bonnie) Eyler Technology Center 1600 Art Unit 1646 (703)
UC Mass Spectrometry Facility & Protein Characterization for Proteomics Core Proteomics Capabilities: Examples of Protein ID and Analysis of Modified Proteins.
Creating NCBI The late Senator Claude Pepper recognized the importance of computerized information processing methods for the conduct of biomedical research.
Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.
MS-Viewer – A Web Based Spectral Viewer For Database Search Results Peter R. Baker 1, Alma L. Burlingame 1 and Robert J. Chalkley 1 1 Mass Spectrometry.
EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.
Proteomics Informatics – Databases, data repositories and standardization (Week 6)
Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○
Kelly Ruggles, Ph.D. Proteomics Informatics Week 9
Introduction to Proteomics. First issue of Proteomics- Jan. 1, 2001.
Lecture 2.21 Retrieving Information: Using Entrez.
Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.
Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.
CISC667, F05, Lec24, Liao1 CISC 667 Intro to Bioinformatics (Fall 2005) DNA Microarray, 2d gel, MSMS, yeast 2-hybrid.
Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.
Kelly Ruggles, Ph.D. Proteomics Informatics March 31, 2015
Proteomics Informatics – Overview of Mass spectrometry (Week 2) Ion Source Mass Analyzer Detector mass/charge intensity.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
A combination of the words Proteomics and Genomics. Proteogenomics commonly refer to studies that use proteomic information, often derived from mass spectrometry,
Claire O’Donovan EMBL-EBI. In UniProtKB, we aim to provide… o A high quality protein sequence database A non redundant protein database, with maximal.
Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information
Enzymatic Function Module (KEGG, MetaCyc, and EC Numbers)
Proteomics Informatics Workshop Part II: Protein Characterization David Fenyö February 18, 2011 Top-down/bottom-up proteomics Post-translational modifications.
Proteomics Informatics – Overview of Mass spectrometry (Week 2)
Proteome.
New Tools Samifier: A tool which converts results from protein tandem mass spectrometry into SAM format. This enables co-visualization of genomics, transcriptomics,
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.
Biological Databases By : Lim Yun Ping E mail :
Finish up array applications Move on to proteomics Protein microarrays.
Karl Clauser Proteomics and Biomarker Discovery Breast Cancer Proteomics and the use of TCGA Mutational Data - Broad Institute update/issues Karl Clauser.
Verna Vu & Timothy Abreo
Data Standards Submission 1 st CHr-16 Workshop. Miraflores de la Sierra August, 28 th -29 th 2012 Alberto Medina.
Sackler Medical School
Lecture 9. Functional Genomics at the Protein Level: Proteomics.
Biological databases Exercises. Discovery of distinct sequence databases using ensembl.
Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.
Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.
XML Standards for Proteomics Data Andrew Jones, Dr Jonathan Wastling and Dr Ela Hunt Department of Computing Science and the Institute of Biomedical and.
Protein Sequence Analysis - Overview - NIH Proteomics Workshop 2007 Raja Mazumder Scientific Coordinator, PIR Research Assistant Professor, Department.
Johannes Griss PSI Meeting Heidelberg, April 2011 EBI is an Outstation of the European Molecular Biology Laboratory. mzTab Proposal for.
Central dogma: the story of life RNA DNA Protein.
Bioinformatics and Computational Biology
Proteogenomic Novelty in 105 TCGA Breast Tumors
Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.
EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.
Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information
Central hub for biological data UniProtKB/Swiss-Prot is a central hub for biological data: over 120 databases are cross-referenced (EMBL/DDBJ/GenBank,
Peptide-assisted annotation of the Mlp genome Philippe Tanguay Nicolas Feau David Joly Richard Hamelin.
UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.
Accessing and visualizing genomics data
High throughput biology data management and data intensive computing drivers George Michaels.
Using public resources to understand associations Dr Luke Jostins Wellcome Trust Advanced Courses; Genomic Epidemiology in Africa, 21 st – 26 th June 2015.
Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information
Lydie Lane, HUPO meeting 2013, Yokohama Integration of proteomics data in.
Proteomics Informatics – Overview of Mass spectrometry (Week 2)
Creation of assays using repositories
Annotation: linking literature to gene products
There are four levels of structure in proteins
Searching the NCBI Databases
Proteomics Informatics David Fenyő
A perspective on proteomics in cell biology
Annotation Presentation
Proteomics Informatics –
Part II SeqViewer AraCyc Help
Proteomics Informatics David Fenyő
Protein Identification David Fenyő
Presentation transcript:

Proteomics Informatics – Databases, data repositories and standardization (Week 7)

Protein Sequence Databases

RefSeq Distinguishing Features of the RefSeq collection include: non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current knowledge of sequence data and biology data validation and format consistency ongoing curation by NCBI staff and collaborators, with reviewed records indicated

Ensembl genome information for sequenced chordate genomes. evidenced-based gene sets for all supported species large-scale whole genome multiple species alignments across vertebrates variation data resources for 17 species and regulation annotations based on ENCODE and other data sets.

UniProt The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information.

Species-Centric Consortia For some organisms, there are consortia that provide high-quality databases: Yeast ( Fly ( Arabidopsis (

FASTA RefSeq: >gi| |ref|NP_ | zinc finger protein 683 [Homo sapiens] MKEESAAQLGCCHRPMALGGTGGSLSPSLDFQLFRGDQVFSACRPLPDMVDAHGPSCASWLCPLPLAPGRSALLACLQDL DLNLCTPQPAPLGTDLQGLQEDALSMKHEPPGLQASSTDDKKFTVKYPQNKDKLGKQPERAGEGAPCPAFSSHNSSSPPP LQNRKSPSPLAFCPCPPVNSISKELPFLLHAFYPGYPLLLPPPHLFTYGALPSDQCPHLLMLPQDPSYPTMAMPSLLMMV NELGHPSARWETLLPYPGAFQASGQALPSQARNPGAGAAPTDSPGLERGGMASPAKRVPLSSQTGTAALPYPLKKKNGKI LYECNICGKSFGQLSNLKVHLRVHSGERPFQCALCQKSFTQLAHLQKHHLVHTGERPHKCSVCHKRFSSSSNLKTHLRLH SGARPFQCSVCRSRFTQHIHLKLHHRLHAPQPCGLVHTQLPLASLACLAQWHQGALDLMAVASEKHMGYDIDEVKVSSTS QGKARAVSLSSAGTPLVMGQDQNN Ensembl: >ENSMUSP pep:known supercontig:NCBIM37:NT_166407:104574:105272:-1 gene:ENSMUSG transcript:ENSMUST MFSLMKKRRRKSSSNTLRNIVGCRISHCWKEGNEPVTQWKAIVLGQLPTNPSLYLVKYDGIDSIYGQELYSDDRILNLKVL PPIVVFPQVRDAHLARALVGRAVQQKFERKDGSEVNWRGVVLAQVPIMKDLFYITYKKDPALYAYQLLDDYKEGNLHMIPD TPPAEERSGGDSDVLIGNWVQYTRKDGSKKFGKVVYQVLDNPSVFFIKFHGDIHIYVYTMVPKILEVEKS UniProt: >sp|Q16695|H31T_HUMAN Histone H3.1t OS=Homo sapiens GN=HIST3H3 PE=1 SV=3 MARTKQTARKSTGGKAPRKQLATKVARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLMREIAQDFK TDLRFQSSAVMALQEACESYLVGLFEDTNLCVIHAKRVTIMPKDIQLARRIRGERA

PEFF - PSI Extended Fasta Format >sp:P06748 \ID=NPM_HUMAN \Pname=(Nucleophosmin) (NPM) (Nucleolar phosphoprotein B23) (Numatrin) (Nucleolar protein NO38) \NcbiTaxId=9606 \ModRes=(125|MOD:00046)(199|MOD:00047) \Length=294 >sp:P00761 \ID=TRYP_PIG \Pname=(Trypsin precursor) (EC ) \NcbiTaxId=9823 \Variant=(20|20|V) \Processed=(1|8|PROPEP)(9|231|CHAIN) \Length=231

Sample-specific protein sequence databases Protein DB Identified and quantified peptides and proteins MS Samples Peptides

Sample-specific protein sequence databases Next-generation sequencing of the genome and transcriptome Sample-specific Protein DB Identified and quantified peptides and proteins MS Samples Peptides

Sample-specific protein sequence databases Next-generation sequencing of the genome and transcriptome Sample-specific Protein DB Identified and quantified peptides and proteins MS Samples Peptides

Proteomics and Transcriptomics of Breast Tumors , , , , , , , , , ,000 ABI 5600 Triple TOF Primary Breast tumor Xenograft tumor Illumina HiSeq RNA-Seq MS/MS

Germline and Somatic Variants The frequency of proteins as a function of the number of amino acid changes due to germline and somatic variants for the basal and luminal breast tumor xenografts

Alternative Splicing The number of exon/exon junctions as a function of the number of RNA-Seq reads for the basal breast tumor xenograft.

Protein identification using sample-specific sequence databases Protein DB + germline / somatic variants Tumor genome sequence Germline variants Somatic variants Potentially novel peptides Tumor RNA-Seq 1114 Potentially novel peptides Spans splice site 70

Data Repositories

ProteomeExchange

PRIDE

PeptideAtlas

The Global Proteome Machine Databases (GPMDB)

Comparison with GPMDB Most proteins show very reproducible peptide patterns

Comparison with GPMDB Query Spectrum Best match In GPMDB Second best match In GPMDB

GPMDB usage last month

GPMDB Data Crowdsourcing Any lab performs experiments Raw data sent to public repository (TRANCHE, PRIDE) Data imported by GPMDB Data analyzed & accepted/rejected General community uses information and inspects data Accepted information loaded into public collection

Information for including a data set in GPMDB 1.MS/MS data (required) 1.MS raw data files 2.ASCII files: mzXML, mzML, MGF, DTA, etc. 3.Analysis files: DAT, MSF, BIOML 2.Sample Information (supply if possible) 1.Species : human, yeast 2.Cell/tissue type & subcellular localization 3.Reagents: urea, formic acid, etc. 4.Quantitation: SILAC, iTRAQ 5.Proteolysis agent: trypsin, Lys-C 3.Project information (suggested) 1.Project name 2.Contact information

How to characterize the evidence in GPMDB for a protein? High confidence Medium confidence Low confidence No observation

Star t EndN SkewKurt Statistical model for 212 observations of TP53

Statistical model for observations of DNAH2

Statistical model for observations of GRAP2

DNA Repair

TP53BP1:p, tumor protein p53 binding protein 1

Sequence Annotations

TP53BP1:p, tumor protein p53 binding protein 1

Peptide observations, catalase Peptide SequenceObservations FSTVAGESGSADTVR2633 FNTANDDNVTQVR2432 AFYVNVLNEEQR1722 LVNANGEAVYCK1701 GPLLVQDVVFTDEMAHFDR1637 LSQEDPDYGIR1560 LFAYPDTHR1499 NLSVEDAAR1400 FYTEDGNWDLVGNNTPIFFIR1386 ADVLTTGAGNPVGDK1338

Peptide Sequenceω FSTVAGESGSADTVR0.08 FNTANDDNVTQVR0.07 AFYVNVLNEEQR0.05 LVNANGEAVYCK0.05 GPLLVQDVVFTDEMAHFDR0.05 LSQEDPDYGIR0.04 LFAYPDTHR0.04 NLSVEDAAR0.04 FYTEDGNWDLVGNNTPIFFIR0.04 ADVLTTGAGNPVGDK0.04 Peptide frequency (ω), catalase

ω Peptide sequences Global frequency of observation (ω), catalase

For any set peptides observed in an experiment assigned to a particular protein (1 to j ): Omega (Ω) value for a protein identification

Protein IDΩ (z=2)Ω (z=3) SERPINB SNRPD CFL SNRPE PPIA CSTA PFN CAT GLRX CALM FABP Protein Ω’s for a set of identifications

Retention Time Distribution

Mass Accuracy

GO Cellular Processes

KEGG Pathways

Open-Source Resources

ProteoWizard

Protein Prospector

PROWL

Proteogenomics - PGx

UCSC Genome Browser

Slice - Scalable Data Sharing for Remote Mass Informatics Most mass spectrometry data is acquired in discovery mode, meaning that the data is amenable to open-ended analysis as our understanding of the target biochemistry increases. In this sense, mass spectrometry based discovery work is more akin to an astronomical survey, where the full list of object-types being imaged has not yet been fully elucidated, as opposed to e.g. micro-array work, where the list of probes spotted onto the slide is finite and well understood. slice.ionomix.com Developed by Manor Askenazi

Standardization

Standardization - MIAPE

Standardization – MIAPE-MSI

Standardization – XML Formats mzML - experimental results obtained by mass spectrometric analysis of biomolecular compounds mzIdentML - describe the outputs of proteomics search engines TraML - exchange and transmission of transition lists for selected reaction monitoring (SRM) experiments mzQuantML - describe the outputs of quantitation software for proteomics mzTab - defines a tab delimited text file format to report proteomics and metabolomics results. MIF - decribes the molecular interaction data exchange format. GelML - describes the processing and separations of proteins in samples using gel electrophoresis, within a proteomics experiment.

Standardization - mzML

Standardization - mzIdentML

Proteomics Informatics – Databases, data repositories and standardization (Week 7)