Presentation is loading. Please wait.

Presentation is loading. Please wait.

Data standards from the Proteomics Standards Initiative Andy Jones University of Liverpool.

Similar presentations


Presentation on theme: "Data standards from the Proteomics Standards Initiative Andy Jones University of Liverpool."— Presentation transcript:

1 Data standards from the Proteomics Standards Initiative Andy Jones andrew.jones@liv.ac.uk University of Liverpool

2 Overview HUPO-PSI background Data formats – Protein and peptide separations GelML spML – Mass spectrometry and proteomics informatics – mzML – mzIdentML – mzQuantML

3 HUPO-PSI background HUPO was founded in 2001 with several objectives: –Consolidate worldwide proteome organisations –Assist in the coordination of public proteome initiatives –Engage in scientific and educational activities Tissue proteome projects and other initiatives: –Plasma, Liver, Brain, Glyco and Antibody initiative –Proteomics Standards Initiative (PSI) HUPO-PSI “The HUPO Proteomics Standards Initiative (PSI) defines community standards for data representation in proteomics to facilitate data comparison, exchange and verification.” Main outputs are: Minimum reporting guidelines (MIAPE modules) Data exchange formats (usually in XML) Ontologies or Controlled vocabularies

4 PSI main outputs MIAPE – minimum information about a proteomics experiment –Information that should be recorded about a proteomics experiment (Taylor et al. Nature Biotechnology 25, 887-893; 2007) –Modules: gel electrophoresis, gel image informatics, capillary electrophoresis, column chromatography, mass spectrometry, mass spectrometry informatics and molecular interactions Data formats for: – molecular interactions – mass spectrometry – protein identifications – gel electrophoresis and other separation methods Plus supporting controlled vocabularies for each format All outputs must pass a stringent standardisation process – Specifications reviewed by public comment and anonymous review – PSI editor will not sign off specification until reviewers’ comments have been satisfied

5 PSI data formats mzML (Mass spec) mzML (Mass spec) mzIdentML (Protein Identifications) mzIdentML (Protein Identifications) mzQuantML (Protein Quantifications) mzQuantML (Protein Quantifications) Protein separationMass spectrometry Proteomics Informatics GelML spML 2007-01-18 GelML 1.0 Current: GelML 1.1 (no formal release yet) 2007 - milestone 2 No active development... 2008-06-01 mzML 1.0.0 released 2009-06-01 mzML 1.1.0 released Previous /related standards mzData v1.0.5 (PSI) mzXML (from ISB) 20-08-2009 mzIdentML 1.0.0 Early drafting only MI (molecular interactions) MI (molecular interactions) Version 2.5

6 GelML Data format for exchanging protocols and image data resulting from gel electrophoresis, extension of FuGE Contents: – Models of 1D and 2D separation, electrophoresis protocol, detection, and includes DIGE Status: – v1.0 was built by extending complete FuGE model; version 1.1 extends from “FuGElight” – v1.1 simplified protocols e.g. for electrophoresis (free-text not parameterized) – v1.1 shares the same CV structure as mzML and mzIdentML – v1.1 implemented in ProteoRed MIAPE database, beta implementation in MIAPEGelDB (SIB)

7 spML Data exchange format for non-gel based separations, extension of FuGE Contents: – Multi-dimensional chromatography, generic model for other types of separation (capillary electrophoresis, rotofors, centrifugation etc.) Status: – Milestone 2 extended from FuGE; – some work has been done to convert this to same structure as GelML v1.1 – No active development for some time, decision to be taken at next PSI meeting about community requirement for format

8 mzML History mzData 1.05 mzXML 3.0 mzML 0.90 SFO 2006-05 dataXML 0.6 DC 2006-09 ISB 2006-11 Lyon 2007-04 EBI 2007-06 mzML 0.91 PSI Doc Proc 2007-11 mzML 0.99 RC Toledo 2008-04 mzML 1.0.0 Release! 2008-06 mzML 1.1.0RC5 Turku 2009-04 mzML 1.1.0 Release! 2009-06 Early Development Final Development

9 mzML run spectrum spectrumDescription binaryDataArray precursorList scan spectrumList spectrum cvList referenceableParamGroupList sampleList acquisitionSettingsList dataProcessingList softwareList instrumentConfigurationList chromatogramList chromatogram binaryDataArray Each spectrum contains a header with scan information and optionally precursor information, followed by two or more base 64 encoded binary data arrays. Chromatograms may be encoded in mzML in a special element that contains cvParams to describe the type of chromatogram, followed by two base64-encoded binary data arrays.

10 mzML implementations

11 mzIdentML overview Various software packages for searching: – MASCOT, SEQUEST, X!Tandem, Omssa, Inspect... – Each piece of software has own output format – User interacts with results formatted as web pages – Not easy to submit to databases or re-analyse results mzIdentML – Standard format for results of searches with mass spec data – Can capture results from PMF and tandem MS – Flexible model of peptide and protein identifications – Capture search engine parameters, scores and modifications using controlled vocabulary terms

12 mzIdentML cvList AnalysisSoftwareList AnalysisSampleCollection SequenceCollection AnalysisCollection AnalysisProtocolCollection DataCollection Software packages Biological samples DB entries of protein / peptide sequences inputs = external spectra 1..n output = SpectrumIdentificationList 1 SpectrumIdentificationProtocol ProteinDetectionProtocol SpectrumIdentificationProtocol AdditionalSearchParams ModificationParams Enzymes DatabaseFilters Inputs AnalysisData SpectrumIdentificationList The database searched and the input file converted to mzIdentML SpectrumIdentificationResult SpectrumIdentificationItem ProteinDetectionList ProteinAmbiguityGroup ProteinDetectionHypothesis All identifications made from searching one spectrum One (poly)peptide- spectrum match A set of related protein identifications e.g. conflicting peptide-protein assignments A single protein identification SpectrumIdentification ProteinDetection Inputs= SpectrumIdentificationLists output =ProteinDetectionList mzIdentML Schema overview

13 mzIdentML SpectrumIdentificationList 1 SpectrumIdentificationResult 1 SequenceCollection DBSequence Accession = “HSP7D_MANSE” Seq = “MAKAPAVGIDLGTTYSCVGVF... “ Peptide Seq = “DAGMISGLNVLR” Mod = Methionine oxidation (pos 4) SpectrumIdentificationItem 1_1 Score = 67.2 E-value = 0.000867 Rank = 1 DBSequence Accession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...” PeptideEvidence 1_1_B start=160 end=171 pre=K post=L SpectrumIdentificationItem 1_2 PeptideEvidence 1_2_A start=54 end=65 pre=K post=T Score = 54.4 E-value = 0.026 Rank = 2 external data spectrum mzIdentML Peptide identifications PeptideEvidence 1_1_A start=161 end=172 pre=K post=I

14 mzIdentML ProteinDetectionList ProteinAmbiguityGroup 1 SpectrumIdentificationList SpectrumIdentificationResult 3 SpectrumIdentificationResult 2 SpectrumIdentificationResult 1 SpectrumIdentificationItem 2_1 PeptideEvidence 2_1_A SpectrumIdentificationItem 3_1 PeptideEvidence 3_1_A ProteinDetectionHypothesis 1_1 PeptideHypothesis (3_1_A) PeptideHypothesis (2_1_A) PeptideHypothesis (1_1_A) Score = 141 Peptide coverage = 17% E-value = 0.0034 PeptideEvidence 3_1_B ProteinDetectionHypothesis 1_2 PeptideHypothesis (1_1_B) PeptideHypothesis (3_1_B) Score = 85 Peptide coverage = 12% E-value = 0.055 SpectrumIdentificationItem 1_1 PeptideEvidence 1_1_A PeptideEvidence 1_1_B ProteinAmbiguityGroup 2 SequenceCollection DBSequence Accession = “HSP7D_MANSE” Seq = “MAKAPAVGIDLGTTYSCVGVF... “ DBSequence Accession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...” ProteinDetectionHypothesis 2_1 mzIdentML Protein identifications Protein ambiguity group -Groups proteins that share the same set of peptides (protein inference problem) Protein Detection Hypothesis - One potential protein hit supported by peptide evidence

15 mzIdentML ProteinDetectionList ProteinAmbiguityGroup 1 SpectrumIdentificationList SpectrumIdentificationResult 3 SpectrumIdentificationResult 2 SpectrumIdentificationResult 1 SpectrumIdentificationItem 2_1 PeptideEvidence 2_1_A SpectrumIdentificationItem 3_1 PeptideEvidence 3_1_A ProteinDetectionHypothesis 1_1 PeptideHypothesis (3_1_A) PeptideHypothesis (2_1_A) PeptideHypothesis (1_1_A) Score = 141 Peptide coverage = 17% E-value = 0.0034 PeptideEvidence 3_1_B ProteinDetectionHypothesis 1_2 PeptideHypothesis (1_1_B) PeptideHypothesis (3_1_B) Score = 85 Peptide coverage = 12% E-value = 0.055 SpectrumIdentificationItem 1_1 PeptideEvidence 1_1_A PeptideEvidence 1_1_B ProteinAmbiguityGroup 2 SequenceCollection DBSequence Accession = “HSP7D_MANSE” Seq = “MAKAPAVGIDLGTTYSCVGVF... “ DBSequence Accession = “HSP70_ECHGR” Seq =“MMSKGPAVGIDLGTTFSCVGV...” ProteinDetectionHypothesis 2_1 mzIdentML Protein identifications ProteinDetectionHypothesis 1_1 has 3 peptides: ESTLHLVLR TLSDYNIQK TITLEVEPSDTIENVK ProteinDetectionHypothesis 1_2 has 2 peptides: ESTLHLVLR TLSDYNIQK Stronger evidence supporting hypothesis 1 but they are placed within the same ambiguity group

16 mzIdentML now available for export from Mascot in the next release

17 Sequest converter produced by MPC (Germany) as part of ProDac consortium: http://www.medizinisches-proteom- center.de Thermo also working on an “official” exporter Basic scripts available for converting other search engine formats (X!Tandem, Omssa, pepXML) Export in next version of Scaffold Database implementation in PRIDE is coming...

18 mzQuantML Format to capture proteins quantified from MS data – Very early drafting Many methods of quantification – Label/tag based Stable isotopes (SILAC) Tags: ICAT / iTRAQ – Label-free Extracted ion chromatogram – align parallel runs Spectral counting Methods still in flux – New methods reported frequently in the literature Will need to reference back to spectra (+chromatograms) and identifications – Needs more community input – please offer to help!

19 Acknowledgements PSI workgroups: – Protein separation Chair: Juan-Pablo Albar (ProteoRed) – Mass spectrometry Chair: Eric Deutsch (ISB) – Proteomics Informatics Chair: Andy Jones (Liverpool) Co-Chair: David Creasy (Matrix Science) – Molecular interactions Chair: Henning Hermajakob (and chair of PSI) and many developers worldwide... See: http://www.psidev.info/http://www.psidev.info/


Download ppt "Data standards from the Proteomics Standards Initiative Andy Jones University of Liverpool."

Similar presentations


Ads by Google