Proteomics Informatics –

Slides:

Advertisements

Similar presentations

PSI Mass Spectrometry Standards Working Group Summary HUPO PSI MS Standards Working Group.

Advertisements

David Campbell 1,, Eric Deutsch 1, Henry Lam 1, Hamid Mirzaei 1, Paola Picotti 2, Jeff Ranish 1, Ning Zhang 1, and Ruedi Aebersold 1,2,3 1.Institute for.

Prof. Carolina Ruiz Computer Science Department Bioinformatics and Computational Biology Program WPI WELCOME TO BCB4003/CS4803 BCB503/CS583 BIOLOGICAL.

EBI Proteomics Services Team – Standards, Data, and Tools for Proteomics Henning Hermjakob European Bioinformatics Institute SME forum 2009 Vienna.

MIAPE Extractor Tutorial SHPP meeting, 28 Aug 2012 La Cristalera, Miraflores de la Sierra, Madrid Salvador Martínez de Bartolomé Izquierdo CNB-CSIC / ProteoRed.

Proteomics Informatics – Databases, data repositories and standardization (Week 6)

Basic Genomic Characteristic  AIM: to collect as much general information as possible about your gene: Nucleotide sequence Databases ○ NCBI GenBank ○

Lecture 2.21 Retrieving Information: Using Entrez.

Introduction to Bioinformatics - Tutorial no. 5 MEME – Discovering motifs in sequences MAST – Searching for motifs in databanks TRANSFAC – The Transcription.

Proteomics: A Challenge for Technology and Information Science CBCB Seminar, November 21, 2005 Tim Griffin Dept. Biochemistry, Molecular Biology and Biophysics.

Modeling Functional Genomics Datasets CVM Lesson 1 13 June 2007Bindu Nanduri.

Proteomics Informatics – Overview of Mass spectrometry (Week 2) Ion Source Mass Analyzer Detector mass/charge intensity.

Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information

Proteomics Informatics (BMSC-GA 4437) Course Director David Fenyö Contact information

Proteomics Informatics Workshop Part II: Protein Characterization David Fenyö February 18, 2011 Top-down/bottom-up proteomics Post-translational modifications.

Proteomics Informatics – Overview of Mass spectrometry (Week 2)

Data standards from the Proteomics Standards Initiative Andy Jones University of Liverpool.

Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –

How to assure MIAPE compliance of the data using the ProteoRed MIAPE Extractor tool HUPO-PSI meeting - Liverpool (15th April 2013) Salvador Martínez-Bartolomé.

GENOME-CENTRIC DATABASES Daniel Svozil. NCBI Gene Search for DUT gene in human.

BASys: A Web Server for Automated Bacterial Genome Annotation Gary Van Domselaar †, Paul Stothard, Savita Shrivastava, Joseph A. Cruz, AnChi Guo, Xiaoli.

Managing Data Modeling GO Workshop 3-6 August 2010.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics Lab v1 | Saurabh Sinha1 Powerpoint by Casey Hanson.

Finish up array applications Move on to proteomics Protein microarrays.

Data Standards Submission 1 st CHr-16 Workshop. Miraflores de la Sierra August, 28 th -29 th 2012 Alberto Medina.

Proteomics Informatics – Databases, data repositories and standardization (Week 7)

Sackler Medical School

Genomics II: The Proteome Using high-throughput methods to identify proteins and to understand their function.

Software Project MassAnalyst Roeland Luitwieler Marnix Kammer April 24, 2006.

XML Standards for Proteomics Data Andrew Jones, Dr Jonathan Wastling and Dr Ela Hunt Department of Computing Science and the Institute of Biomedical and.

Regulatory Genomics Lab Saurabh Sinha Regulatory Genomics | Saurabh Sinha | PowerPoint by Casey Hanson.

Johannes Griss PSI Meeting Heidelberg, April 2011 EBI is an Outstation of the European Molecular Biology Laboratory. mzTab Proposal for.

ID Mapping to accessions from different databases. COST Functional Modeling Workshop April, Helsinki.

Bioinformatics and Computational Biology

Primary vs. Secondary Databases Primary databases are repositories of “raw” data. These are also referred to as archival databases. -This is one of the.

EBI is an Outstation of the European Molecular Biology Laboratory. UniProtKB Sandra Orchard.

Proteomics Informatics (BMSC-GA 4437) Instructor David Fenyö Contact information

Peptide-assisted annotation of the Mlp genome Philippe Tanguay Nicolas Feau David Joly Richard Hamelin.

UCSC Genome Browser Zeevik Melamed & Dror Hollander Gil Ast Lab Sackler Medical School.

Accessing and visualizing genomics data

High throughput biology data management and data intensive computing drivers George Michaels.

Proteomics Informatics (BMSC-GA 4437) Course Directors David Fenyö Kelly Ruggles Beatrix Ueberheide Contact information

Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.

Democratization of ‘Omics Data Availability and Review Robert Chalkley UCSF Data Management Editor - MCP.

Web Resources for Genomics Kei Cheung, Ph.D. Assistant Professor Yale Center for Medical Informatics (MBB 452a Genomics & Bioinformatics) Oct. 8, 2003.

Introduction to Genes and Genomes with Ensembl

Getting GO annotation for your dataset

Algorithms and Computation: Bottom-Up Data Analysis Workflows

Proteomics Informatics – Overview of Mass spectrometry (Week 2)

Jarrett Egertson, Ph.D. MacCoss Lab

Bottom-Up Proteomics Data collection

Regulatory Genomics Lab

Protein Identification via Database searching

Using ArrayExpress.

Creation of assays using repositories

Modified from slides from Jim Hu and Suzi Aleksander Spring 2016

Annotation: linking literature to gene products

There are four levels of structure in proteins

Proteomics Informatics David Fenyő

A perspective on proteomics in cell biology

Annotation Presentation

Regulatory Genomics Lab

Problems from last section

Session 1: WELCOME AND INTRODUCTIONS

(A) Design of the PhosphoPep database.

Part II SeqViewer AraCyc Help

Proteomics Informatics David Fenyő

Protein Identification David Fenyő

Regulatory Genomics Lab

Presentation transcript:

Proteomics Informatics – Databases, data repositories and standardization (Week 8)

Protein Sequence Databases

http://www.ncbi.nlm.nih.gov/books/NBK21091/ RefSeq Distinguishing Features of the RefSeq collection include: non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current knowledge of sequence data and biology data validation and format consistency ongoing curation by NCBI staff and collaborators, with reviewed records indicated http://www.ncbi.nlm.nih.gov/books/NBK21091/

http://www.ensembl.org/ Ensembl genome information for sequenced chordate genomes. evidenced-based gene sets for all supported species large-scale whole genome multiple species alignments across vertebrates variation data resources for 17 species and regulation annotations based on ENCODE and other data sets. http://www.ensembl.org/

http://www.uniprot.org/ UniProt The mission of UniProt is to provide the scientific community with a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. http://www.uniprot.org/

Species-Centric Consortia For some organisms, there are consortia that provide high-quality databases: Yeast (http://yeastgenome.org/) Fly (http://flybase.org/) Arabidopsis (http://arabidopsis.org/)

http://en.wikipedia.org/wiki/FASTA_format FASTA RefSeq: >gi|168693669|ref|NP_001108231.1| zinc finger protein 683 [Homo sapiens] MKEESAAQLGCCHRPMALGGTGGSLSPSLDFQLFRGDQVFSACRPLPDMVDAHGPSCASWLCPLPLAPGRSALLACLQDL DLNLCTPQPAPLGTDLQGLQEDALSMKHEPPGLQASSTDDKKFTVKYPQNKDKLGKQPERAGEGAPCPAFSSHNSSSPPP LQNRKSPSPLAFCPCPPVNSISKELPFLLHAFYPGYPLLLPPPHLFTYGALPSDQCPHLLMLPQDPSYPTMAMPSLLMMV NELGHPSARWETLLPYPGAFQASGQALPSQARNPGAGAAPTDSPGLERGGMASPAKRVPLSSQTGTAALPYPLKKKNGKI LYECNICGKSFGQLSNLKVHLRVHSGERPFQCALCQKSFTQLAHLQKHHLVHTGERPHKCSVCHKRFSSSSNLKTHLRLH SGARPFQCSVCRSRFTQHIHLKLHHRLHAPQPCGLVHTQLPLASLACLAQWHQGALDLMAVASEKHMGYDIDEVKVSSTS QGKARAVSLSSAGTPLVMGQDQNN Ensembl: >ENSMUSP00000131420 pep:known supercontig:NCBIM37:NT_166407:104574:105272:-1 gene:ENSMUSG00000092057 transcript:ENSMUST00000167991 MFSLMKKRRRKSSSNTLRNIVGCRISHCWKEGNEPVTQWKAIVLGQLPTNPSLYLVKYDGIDSIYGQELYSDDRILNLKVL PPIVVFPQVRDAHLARALVGRAVQQKFERKDGSEVNWRGVVLAQVPIMKDLFYITYKKDPALYAYQLLDDYKEGNLHMIPD TPPAEERSGGDSDVLIGNWVQYTRKDGSKKFGKVVYQVLDNPSVFFIKFHGDIHIYVYTMVPKILEVEKS UniProt: >sp|Q16695|H31T_HUMAN Histone H3.1t OS=Homo sapiens GN=HIST3H3 PE=1 SV=3 MARTKQTARKSTGGKAPRKQLATKVARKSAPATGGVKKPHRYRPGTVALREIRRYQKSTELLIRKLPFQRLMREIAQDFK TDLRFQSSAVMALQEACESYLVGLFEDTNLCVIHAKRVTIMPKDIQLARRIRGERA http://en.wikipedia.org/wiki/FASTA_format

PEFF - PSI Extended Fasta Format >sp:P06748 \ID=NPM_HUMAN \Pname=(Nucleophosmin) (NPM) (Nucleolar phosphoprotein B23) (Numatrin) (Nucleolar protein NO38) \NcbiTaxId=9606 \ModRes=(125|MOD:00046)(199|MOD:00047) \Length=294 >sp:P00761 \ID=TRYP_PIG \Pname=(Trypsin precursor) (EC 3.4.21.4) \NcbiTaxId=9823 \Variant=(20|20|V) \Processed=(1|8|PROPEP)(9|231|CHAIN) \Length=231 http://www.psidev.info/node/363

Sample-specific protein sequence databases Identified and quantified Peptides MS Protein DB Identified and quantified peptides and proteins

Sample-specific protein sequence databases Next-generation sequencing of the genome and transcriptome Samples Peptides MS Sample-specific Protein DB Identified and quantified peptides and proteins

Sample-specific protein sequence databases Next-generation sequencing of the genome and transcriptome Samples Peptides MS Sample-specific Protein DB Identified and quantified peptides and proteins

Data Repositories

ProteomeExchange http://www.proteomeexchange.org/

PRIDE http://www.ebi.ac.uk/pride/

PeptideAtlas http://www.peptideatlas.org/

Chorus Key Aspects: Upload and share raw data with collaborators Analyze data with available tools and workflows Create projects and experiments Select from public files and (re-)analyze/visualize Download selected files

MassIVE Key Aspects: Upload files Spectra and Spectrum libraries, Analysis Results, Sequence Databases, Methods and Protocol) Perform analysis using available tools Browse public datasets Download data

The Global Proteome Machine Databases (GPMDB) http://gpmdb.thegpm.org

Comparison with GPMDB Most proteins show very reproducible peptide patterns

Comparison with GPMDB Query Spectrum Best match In GPMDB Second

GPMDB Data Crowdsourcing Any lab performs experiments Raw data sent to public repository (TRANCHE, PRIDE) Data imported by GPMDB Data analyzed & accepted/rejected Accepted information loaded into public collection General community uses information and inspects data

Information for including a data set in GPMDB MS/MS data (required) MS raw data files ASCII files: mzXML, mzML, MGF, DTA, etc. Analysis files: DAT, MSF, BIOML Sample Information (supply if possible) Species : human, yeast Cell/tissue type & subcellular localization Reagents: urea, formic acid, etc. Quantitation: SILAC, iTRAQ Proteolysis agent: trypsin, Lys-C Project information (suggested) Project name Contact information

How to characterize the evidence in GPMDB for a protein? High confidence Medium confidence Low confidence No observation

Statistical model for 212 observations of TP53 Start End N -2 -3 -4 -5 -6 -7 -8 -9 -10 -11 Skew Kurt 214 248 539 0.15 0.18 0.22 0.17 0.07 0.03 0.01 0.00 -0.01 -2.01 249 267 1010 0.04 0.09 0.13 0.16 0.14 0.06 0.05 -0.08 -1.89 182 196 832 0.20 0.19 -0.12 -1.84 250 4 0.25 0.48 -2.28 1 24 269 0.10 0.12 -0.33 -0.88 65 51 0.08 0.02 0.47 -1.62 66 101 334 0.11 -1.21 273 60 0.45 -1.36 242 10 0.30 0.54 -1.39 239 32 -0.99 111 120 117 0.26 0.29 0.62 251 16 0.24 -0.60 241 14 0.21 0.87 -0.97 159 174 100 0.31 0.99 -1.07 68 0.86 -0.91 235 30 0.23 0.81 -0.82

Statistical model for observations of DNAH2

Statistical model for observations of GRAP2

DNA Repair

DNA Repair

TP53BP1:p, tumor protein p53 binding protein 1

TP53BP1:p, tumor protein p53 binding protein 1

Sequence Annotations

TP53BP1:p, tumor protein p53 binding protein 1

TP53BP1:p, tumor protein p53 binding protein 1

Peptide observations, catalase Peptide Sequence Observations FSTVAGESGSADTVR 2633 FNTANDDNVTQVR 2432 AFYVNVLNEEQR 1722 LVNANGEAVYCK 1701 GPLLVQDVVFTDEMAHFDR 1637 LSQEDPDYGIR 1560 LFAYPDTHR 1499 NLSVEDAAR 1400 FYTEDGNWDLVGNNTPIFFIR 1386 ADVLTTGAGNPVGDK 1338

Peptide frequency (ω), catalase Peptide Sequence ω FSTVAGESGSADTVR 0.08 FNTANDDNVTQVR 0.07 AFYVNVLNEEQR 0.05 LVNANGEAVYCK GPLLVQDVVFTDEMAHFDR LSQEDPDYGIR 0.04 LFAYPDTHR NLSVEDAAR FYTEDGNWDLVGNNTPIFFIR ADVLTTGAGNPVGDK

Global frequency of observation (ω), catalase Peptide sequences

Omega (Ω) value for a protein identification For any set peptides observed in an experiment assigned to a particular protein (1 to j ):

Protein Ω’s for a set of identifications Protein ID Ω (z=2) Ω (z=3) SERPINB1 0.88 0.82 SNRPD1 0.59 CFL1 0.81 0.87 SNRPE 0.8 PPIA 0.79 0.64 CSTA 0.36 PFN1 0.76 0.61 CAT 0.71 0.78 GLRX 0.66 CALM1 0.62 FABP5 0.57 0.17

Retention Time Distribution

Mass Accuracy

GO Cellular Processes

KEGG Pathways

Open-Source Resources

ProteoWizard http://proteowizard.sourceforge.net

Protein Prospector http://prospector.ucsf.edu/

PROWL http://prowl.rockefeller.edu/

Proteogenomics - PGx http://pgx.fenyolab.org/

UCSC Genome Browser http://genome.ucsc.edu/

Slice - Scalable Data Sharing for Remote Mass Informatics Developed by Manor Askenazi openslice.fenyolab.org Most mass spectrometry data is acquired in discovery mode, meaning that the data is amenable to open-ended analysis as our understanding of the target biochemistry increases. In this sense, mass spectrometry based discovery work is more akin to an astronomical survey, where the full list of object-types being imaged has not yet been fully elucidated, as opposed to e.g. micro-array work, where the list of probes spotted onto the slide is finite and well understood.

Standardization

Standardization - MIAPE

Standardization – MIAPE-MSI

Standardization – XML Formats mzML - experimental results obtained by mass spectrometric analysis of biomolecular compounds mzIdentML - describe the outputs of proteomics search engines TraML - exchange and transmission of transition lists for selected reaction monitoring (SRM) experiments mzQuantML - describe the outputs of quantitation software for proteomics mzTab - defines a tab delimited text file format to report proteomics and metabolomics results. MIF - decribes the molecular interaction data exchange format. GelML - describes the processing and separations of proteins in samples using gel electrophoresis, within a proteomics experiment.

Standardization - mzML

Standardization - mzIdentML

Proteomics Informatics – Databases, data repositories and standardization (Week 8)