Open source tools for data analysis

Slides:



Advertisements
Similar presentations
Protein Quantitation II: Multiple Reaction Monitoring
Advertisements

Big Data & the CPTAC Data Portal Nathan Edwards, Peter McGarvey Mauricio Oberti, Ratna Thangudu Shuang Cai, Karen Ketchum Georgetown University & ESAC.
MS-Viewer – A Web Based Spectral Viewer For Database Search Results Peter R. Baker 1, Alma L. Burlingame 1 and Robert J. Chalkley 1 1 Mass Spectrometry.
De Novo Sequencing v.s. Database Search Bin Ma School of Computer Science University of Waterloo Ontario, Canada.
Proteomics Informatics – Protein identification II: search engines and protein sequence databases (Week 5)
Mass Spectrometry. What are mass spectrometers? They are analytical tools used to measure the molecular weight of a sample. Accuracy – 0.01 % of the total.
FIGURE 5. Plot of peptide charge state ratios. Quality Control Concept Figure 6 shows a concept for the implementation of quality control as system suitability.
Scaffold Download free viewer:
My contact details and information about submitting samples for MS
Spectral Counting. 2 Definition The total number of identified peptide sequences (peptide spectrum matches) for the protein, including those redundantly.
Evaluated Reference MS/MS Spectra Libraries Current and Future NIST Programs.
Tryptic digestion Proteomics Workflow for Gel-based and LC-coupled Mass Spectrometry Protein or peptide pre-fractionation is a prerequisite for the reduction.
Raul Garcia-Sanchez Research Investigator: Dr. Paul R. Mahaffy Code 699, NASA Goddard Space Flight Center Research Mentor: Dr. Prabhakar Misra Department.
Introduction The GPM project (The Global Proteome Machine Organization) Salvador Martínez de Bartolomé Bioinformatics support –
NGS data analysis CCM Seminar series Michael Liang:
Common parameters At the beginning one need to set up the parameters.
Novel Empirical FDR Estimation in PepArML David Retz and Nathan Edwards Georgetown University Medical Center.
Data Standards Submission 1 st CHr-16 Workshop. Miraflores de la Sierra August, 28 th -29 th 2012 Alberto Medina.
Analysis of Complex Proteomic Datasets Using Scaffold Free Scaffold Viewer can be downloaded at:
Additional file 1 1.1Workflow of large-scale proteomic analysis of normal human kidney glomerulus 1.2Detailed procedure of LC-MS/MS analysis Additional.
A Comprehensive Comparison of the de novo Sequencing Accuracies of PEAKS, BioAnalyst and PLGS Bin Ma 1 ; Amanda Doherty-Kirby 1 ; Aaron Booy 2 ; Bob Olafson.
Laxman Yetukuri T : Modeling of Proteomics Data
INF380 - Proteomics-101 INF380 – Proteomics Chapter 10 – Spectral Comparison Spectral comparison means that an experimental spectrum is compared to theoretical.
Untargeted Metabolomics: Tandem LC-MSMS. Column and Flow Rate Selection Insert Barnes table for flow rates and sensitivity gain. Reverse Phase and Normal.
Improving the Detection of Hydrophilic Peptides for Increased Protein Sequence Coverage and Enhanced Proteomic Analyses Brian S. Hampton 1 and Amos H.
CISC 849 : Applications in Fintech Namami Shukla Dept of Computer & Information Sciences University of Delaware iCARE : A Framework for Big Data Based.
Isotope Labeled Internal Standards in Skyline
Salamanca, March 16th 2010 Participants: Laboratori de Proteomica-HUVH Servicio de Proteómica-CNB-CSIC Participants: Laboratori de Proteomica-HUVH Servicio.
Metabolomics MS and Data Analysis PCB 5530 Tom Niehaus Fall 2015.
Deducing protein composition from complex protein preparations by MALDI without peptide separation.. TP #419 Kenneth C. Parker SimulTof Corporation, Sudbury,
Workflows to set up acquisition methods for scheduled sMRM-HR on the TripleTOF 5600 Start from a data dependent acquisition (DDA) Perform data base search.
Using Scaffold OHRI Proteomics Core Facility. This presentation is intended for Core Facility internal training purposes only.
김지형. Introduction precursor peptides are dynamically selected for fragmentation with exclusion to prevent repetitive acquisition of MS/MS spectra.
Finding the unexpected in SWATH™ Data Sets – Implications for Protein Quantification Ron Bonner; Stephen Tate; Adam Lau AB SCIEX, 71 Four Valley Drive,
Identify proteins. Proteomic workflow Trypsin A typical sample We add a solution of 50 mM NH 4 HCO 3 (pH 7.8) containing trypsin ( µg/µl). Volume.
Agenda Welcome from the Skyline team!
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
Protein identification by mass spectrometry The shotgun proteomics strategy, based on digesting proteins into peptides and sequencing them using tandem.
ABRF 2017 Annual Meeting Workflow Interest Network (WIN) Presentation A QC And Benchmark Study Of LC-MS/MS Methods Among MS Laboratories.
Table 1. Quality Parameters Being Considered for Evaluation
Multi-Analyte LC-MS/MS Methods – Best Practice.
Algorithms and Computation: Bottom-Up Data Analysis Workflows
Mass Spectrometry makes it possible to measure protein/peptide masses (actually mass/charge ratio) with great accuracy Major uses Protein and peptide identification.
Large Scale DIA With Skyline
Jarrett Egertson, Ph.D. MacCoss Lab
Accelerating Research in Life Sciences
A Database of Peak Annotations of Empirically Derived Mass Spectra
Tools for Identifying Differences Among Samples
MassMatrix Search Results Explained
View  text zoom  large Set properties text size to 14 point
Accelerating Research in Life Sciences
Fig. 1. proFIA approach for peak detection and quantification
Agenda Welcome from the Skyline team!
Systems Medicine Automated Real-Time Quality Control of LC-MS Metabolomics Data: QC4Metabolomics stanstrup.github.io.
AUTOMATED SESSION PLANNING. In the present world, everything has become automated. By, a click everything is being processed. But the preparation of the.
Presentation Title NEMC 2018 Dale Walker, Bruce Quimby Agilent
Proteomics Informatics David Fenyő
Douglas Walker 1, Karan Uppal 2, Dean Jones 2, Tianwei Yu 3,*
Now, More Than Ever, Proteomics Needs Better Chromatography
A perspective on proteomics in cell biology
Best Practices for Identification and Quantitation
Dtk-tools Benoit Raybaud, Research Software Manager.
NoDupe algorithm to detect and group similar mass spectra.
Mass Spectrometry THE MAIN USE OF MS IN ORG CHEM IS:
Is Proteomics the New Genomics?
A. Menegolli, University of Pavia and INFN Pavia
Processing of fragment ion information in DTA files to remove isotope ions and noise. Processing of fragment ion information in DTA files to remove isotope.
Sim and PIC scoring results for standard peptides and the test shotgun proteomics dataset. Sim and PIC scoring results for standard peptides and the test.
Kuen-Pin Wu Institute of Information Science Academia Sinica
Operation manual of AI SIDA
Presentation transcript:

Open source tools for data analysis Achim Treumann

Parameter Category Information Limitation Spectral count MS1 RawData Should be within a known range for an external standard. Sudden increases could indicate that an insufficient number of MS2 spectra is acquired Spectral count MS2 Should be within a known range for an external standard. Sudden decreases could indicate sensitivity problems (clean MS, check sample) FileSize Median injection time Should be within a known range for an external standard. Sudden increases indicate sensitivity problems (clean MS, check sample) Only relevant for ion trap mass spectrometers (Paul trap or Orbitrap). # features in LCMS experiment Indication of the complexity of the sample. Should be within a known range for an external std External standards contain only information about the machine performance, not about the samples to be analysed. intensity m/z 421 Indicator of the ratio of sample/trypsin Dependent on the batch of trypsin and on the complexity of the sample. unexpected peaks (peptides or contaminants) present in external standard Indicator of carryover issues. Clean or possibly change trap and/or analytical column Tricky to check automatically, but often obvious on manual checks by eye RT specific peptide (external) Std ExtStd Should be within a known range. Deviations indicate chromatography problems that could either be due to the HPLC setup (Column, trap, dead volumes) FWHM specific peptide (external) Std Peak intensity specific peptide (external) Std Should be within a known range. Deviations indicate sensitivity problems (clean MS, check sample, check MS method)

Parameter Category Information Limitation RT specific peptide (internal) Std IntStd Should be within a known range. Deviations indicate chromatography problems that could either be due to the HPLC setup (Column, trap, dead volumes) or to the presence of contaminants in the sample that mess up chromatography Dependent on internal standard (likely to involve costs). Could cause problems with the acquisition of the 'real' sample, if not carefully chosen. FWHM specific peptide (internal) Std Dependent on internal standard (likely to involve costs). Potentially difficult to extract from the data. Peak intensity specific peptide (internal) Std Should be within a known range but could also be related to the complexity of the sample. Deviations might indicate sensitivity problems (clean MS, check sample, check MS method) Dependent on internal standard (likely to involve costs). # ID'ed peptides SearchRes Should be within a known range for an external standard. Sudden decreases could indicate sensitivity problems (clean MS, check sample) External standards contain only information about the machine performance, not about the samples to be analysed. # ID'ed proteins Search engine score of highest confidence protein # missed cleavages Indication of possible digestion problems Possibly difficult to extract ratio of # 2+ / 3+ peptides Indication of the ionisation conditions in the source. DMSO in HPLC solvents increases the number of +2 ions

General Workflow Download all data from MassIVE Using msconvert convert to mzML Using OpenMS (knime version) perform MSGF+ searches Data will now be mzIdent and mzTab Using R convert mzTab results into more accessible peptide.tsv (thanks to Julianus Pfeuffer) Using Perseus generate heatmap Using R perform a full join of the q-values for all files into one large table

msconvert Part of the proteowizard suite: http://proteowizard.sourceforge.net/tools.shtml Used default parameters for conversion of all files into mzML This retains most (all?) of the information in the files, including metadata about data acquisition Proteowizard does use libraries that have been supplied by MS manufacturers Files increase in size between 1.5 and 6-fold (file sizes for HeLa digests between 0.5 GByte and 8 GByte. This can be avoided by specifying the number of peaks per MSMS spectrum (600 is sensible) in the conversion process 42 files converted

OpenMS Platform that allows you to do almost everything with your MS data (particularly within proteomics) Works with data from all manufacturers Tutorial is here: http://open-ms.sourceforge.net/wp-content/uploads/2016/01/handout1.pdf Used this to search all datasets and calculate FDR

OpenMS Workflows are constructed within Knime (v 3.3.1) Each worknode can have many parameters that can be set (e.g. for a search) Default parameters do not always work and need to be tuned, but it is possible to generate a workflow that produces results for all datasets

mzTab tsv conversion The default output for search results is mzIdentML, a format that is great for computers and contains all metadata, but not very human readable or useable A more usable output standard is mzTab mzTab contains protein lists and peptide lists in one mixed table – not good for further processing Julianus Pfeuffer and Lars Nilse (OpenMS team) have written and R script that I have modified to generate a table of only peptide results, discarding Q-Values > 0.01. This script is called make_tsv.R and it generates files that are called psm.tsv (one file for each dataset) Now not necessary anymore – the OpenMS team has developed an improved mzTab exporter

Summarise all data All psm.tsv files were pulled together in one large file that contains all identified peptides (q<0.01) Using the dplyr library in R, we performed a full join of all individual tables and extracted for each dataset only the Q-Value (as a measure of identification confidence) Then we used Perseus to visualise the data in a heatmap

Visualise results Results could be visualised using an R script, but I did not have time, so we used Perseus (not open source, but free for academics and several papers published) Perseus tutorials and lectures on youtube: http://www.coxdocs.org/doku.php?id=perseus:user:tutorials

Heatmap of peptide identifications Red colour codes for high confidence identifications, blue for lower confidence Grey are missing values Clustering was performed with a Euclidean distance function I think that this heatmap does show reasonable reproducibility Don’t know yet for sure how to get the best interpretation

Conclusions We have learnt a great deal about improving our QC experiments and procedures Cross-platform data analysis for QC is difficult, but can be implemented Commercial standards (external or internal) cost money, but are important (cross-laboratory reproducibility) ID based and non-ID based QC parameters are very complementary For phase II we want to produce a generally applicable data analysis workflow that can be distributed to all participants (providing qcML output)

Thank you