Open source tools for data analysis

Open source tools for data analysis
Achim Treumann

Parameter Category Information Limitation Spectral count MS1 RawData Should be within a known range for an external standard. Sudden increases could indicate that an insufficient number of MS2 spectra is acquired Spectral count MS2 Should be within a known range for an external standard. Sudden decreases could indicate sensitivity problems (clean MS, check sample) FileSize Median injection time Should be within a known range for an external standard. Sudden increases indicate sensitivity problems (clean MS, check sample) Only relevant for ion trap mass spectrometers (Paul trap or Orbitrap). # features in LCMS experiment Indication of the complexity of the sample. Should be within a known range for an external std External standards contain only information about the machine performance, not about the samples to be analysed. intensity m/z 421 Indicator of the ratio of sample/trypsin Dependent on the batch of trypsin and on the complexity of the sample. unexpected peaks (peptides or contaminants) present in external standard Indicator of carryover issues. Clean or possibly change trap and/or analytical column Tricky to check automatically, but often obvious on manual checks by eye RT specific peptide (external) Std ExtStd Should be within a known range. Deviations indicate chromatography problems that could either be due to the HPLC setup (Column, trap, dead volumes) FWHM specific peptide (external) Std Peak intensity specific peptide (external) Std Should be within a known range. Deviations indicate sensitivity problems (clean MS, check sample, check MS method)

Parameter Category Information Limitation RT specific peptide (internal) Std IntStd Should be within a known range. Deviations indicate chromatography problems that could either be due to the HPLC setup (Column, trap, dead volumes) or to the presence of contaminants in the sample that mess up chromatography Dependent on internal standard (likely to involve costs). Could cause problems with the acquisition of the 'real' sample, if not carefully chosen. FWHM specific peptide (internal) Std Dependent on internal standard (likely to involve costs). Potentially difficult to extract from the data. Peak intensity specific peptide (internal) Std Should be within a known range but could also be related to the complexity of the sample. Deviations might indicate sensitivity problems (clean MS, check sample, check MS method) Dependent on internal standard (likely to involve costs). # ID'ed peptides SearchRes Should be within a known range for an external standard. Sudden decreases could indicate sensitivity problems (clean MS, check sample) External standards contain only information about the machine performance, not about the samples to be analysed. # ID'ed proteins Search engine score of highest confidence protein # missed cleavages Indication of possible digestion problems Possibly difficult to extract ratio of # 2+ / 3+ peptides Indication of the ionisation conditions in the source. DMSO in HPLC solvents increases the number of +2 ions

General Workflow Download all data from MassIVE
Using msconvert convert to mzML Using OpenMS (knime version) perform MSGF+ searches Data will now be mzIdent and mzTab Using R convert mzTab results into more accessible peptide.tsv (thanks to Julianus Pfeuffer) Using Perseus generate heatmap Using R perform a full join of the q-values for all files into one large table

msconvert Part of the proteowizard suite: Used default parameters for conversion of all files into mzML This retains most (all?) of the information in the files, including metadata about data acquisition Proteowizard does use libraries that have been supplied by MS manufacturers Files increase in size between 1.5 and 6-fold (file sizes for HeLa digests between 0.5 GByte and 8 GByte. This can be avoided by specifying the number of peaks per MSMS spectrum (600 is sensible) in the conversion process 42 files converted

OpenMS Platform that allows you to do almost everything with your MS data (particularly within proteomics) Works with data from all manufacturers Tutorial is here: Used this to search all datasets and calculate FDR

OpenMS Workflows are constructed within Knime (v 3.3.1)
Each worknode can have many parameters that can be set (e.g. for a search) Default parameters do not always work and need to be tuned, but it is possible to generate a workflow that produces results for all datasets

mzTab tsv conversion The default output for search results is mzIdentML, a format that is great for computers and contains all metadata, but not very human readable or useable A more usable output standard is mzTab mzTab contains protein lists and peptide lists in one mixed table – not good for further processing Julianus Pfeuffer and Lars Nilse (OpenMS team) have written and R script that I have modified to generate a table of only peptide results, discarding Q-Values > 0.01. This script is called make_tsv.R and it generates files that are called psm.tsv (one file for each dataset) Now not necessary anymore – the OpenMS team has developed an improved mzTab exporter

Summarise all data All psm.tsv files were pulled together in one large file that contains all identified peptides (q<0.01) Using the dplyr library in R, we performed a full join of all individual tables and extracted for each dataset only the Q-Value (as a measure of identification confidence) Then we used Perseus to visualise the data in a heatmap

Visualise results Results could be visualised using an R script, but I did not have time, so we used Perseus (not open source, but free for academics and several papers published) Perseus tutorials and lectures on youtube:

Heatmap of peptide identifications
Red colour codes for high confidence identifications, blue for lower confidence Grey are missing values Clustering was performed with a Euclidean distance function I think that this heatmap does show reasonable reproducibility Don’t know yet for sure how to get the best interpretation

Conclusions We have learnt a great deal about improving our QC experiments and procedures Cross-platform data analysis for QC is difficult, but can be implemented Commercial standards (external or internal) cost money, but are important (cross-laboratory reproducibility) ID based and non-ID based QC parameters are very complementary For phase II we want to produce a generally applicable data analysis workflow that can be distributed to all participants (providing qcML output)

Thank you

Open source tools for data analysis

Similar presentations

Presentation on theme: "Open source tools for data analysis"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Open source tools for data analysis

Similar presentations

Presentation on theme: "Open source tools for data analysis"— Presentation transcript:

Similar presentations

About project

Feedback