Feature extraction and alignment for LC/MS data Tianwei Yu Department of Biostatistics and Bioinformatics Rollins School of Public Health Emory University April 25, 2019
LC/MS Liquid chromotography retention time Take “slices” in retention time, send to MS Liquid chromotography Mass-to-charge ratio (m/z)
Here is an example of LC/MS data. (a) Original data; (b) square root-transformed data to show smaller peaks; (c) A portion of the data showing details.
LC/MS
LC/MS
Some computational issues in LC/MS Noise reduction & feature detection Modeling peaks and feature quantification Retention time correction. Feature alignment. Grouping multiple features from one molecule caused by (1) isotopes (2) multiple charge states Mapping MS2 for identification
Some notations of the data
Some notations of the data
Feature detection
Feature detection
Feature detection Katajamaa&Oresic (2007) J Chr. A 1158:318
Feature detection XCMS Matched filter Coefficients are equal to a second-derivative Gaussian function. The filtered chromatogram crosses the x-axis roughly at the peak inflection points.
Feature detection XCMS Centwave Directly scan for regions where at least pmin centroids with a deviation less than μ ppm occur. Peak detection on multiple scales using Continuous Wavelet Transform (CWT), which reliably detects chromatographic peaks of differing width.
Feature detection Adaptive binning
Feature detection apLCMS run filter Subject to:
Peak modeling and quantification In high-resolution LC/MS data, every peak is a thin slice --- there is no need to model the MS dimension. Modeling the LC dimension is important for quantification. Models have been developed for traditional LC data, which can be applied here. Most empirical peak shape models were derived from Gaussian model. Changes were made to account for asymmetry in the peak shape.
Peak modeling and quantification Generalized exponential function Data Analysis and signal processing in chromatography. A. Felinger
Peak modeling and quantification Log-normal function. Data Analysis and signal processing in chromatography. A. Felinger
Peak modeling and quantification The bi-Gaussian model: Data Analysis and signal processing in chromatography. A. Felinger
Peak modeling and quantification Some peaks share m/z and partially overlap in RT. Some heuristic methods (require low noise): Data Analysis and signal processing in chromatography. A. Felinger
Peak modeling and quantification Statistical approach: Select a set of smoother window sizes; Using each of the window size, run smoother & EM-like algorithm to fit the data; find corresponding BIC value, Choose the result with minimum BIC value.
Peak modeling and quantification Bi-Gaussian mixture Gaussian mixture
Retention time correction With every run, the LC dimension data has some fluctuation. Identify “reliable” peaks in both samples, use non-linear curve fitting to adjust the retention time. Anal Chem. 2006 Feb 1;78(3):779-87.
Retention time correction Select a sample as reference Every other sample is corrected against the reference To correct: Pair peaks in the two samples m/z close enough no multiple peaks at same m/z Fit a nonlinear curve through their RT values Correct based on the nonlinear fit
Peak alignment
Peak alignment Dynamic programming. BMC Bioinformatics 2007, 8:419
Peak alignment First align m/z dimension by binning. Use kernel density estimation to find “meta-peaks”. Anal Chem. 2006 Feb 1;78(3):779-87.
An example of the overall strategy in LC/MS metabolomics Anal Chem. 2006 Feb 1;78(3):779-87.
On real data
On real data
Semi-supervised detection
Semi-supervised detection
Semi-supervised detection To reduce false-positives
Semi-supervised detection Example of features found by hybrid approach only.
Semi-supervised detection
Semi-supervised detection
The data matrix