Min Zhang, MD PhD Purdue University Joint work with Yanzhu Lin, Dabao Zhang
Outline Data Summary Methods Data Analysis Procedure Preliminary Results Preprocessing GC GC-MS Data Methods
CCE Data Summary Phenotype summary for current available data for CCE project: HealthyColon Cancer Rectal Cancer PolypNATotal Lipidomics (Lipid) GProteomics (GP) NMR Teac Comet
Summary of Overlap Dataset Overlap between any 2 data sets: Overlap among any 3 data sets Overlap among any 4 data sets LipidGPNMRTeacComet Lipid GP NMR Teac Comet Lipid & GP & Teac41GP & NMR & Teac16 Lipid & GP &Comet37GP & Teac & Comet43 Lipid & Teac & Comet41NMR & Teac & Comet2 Lipid & GP & Teac & Comet37
Overlap of Different Omics Data
Methods for Integrating Omics Common methods: - Principal Component Analysis (Jolliffe, I. 1986), - Co-Inertia Analysis (Doledec, S. and Chessel, D.,1994) - Partial Least Squares (Wold, H., 1966) - Bayesian Analysis method (Webb-Robertson et. al., 2009) Our methods: We use iteratively weighted partial least squares method (IWPLS) to fit the model for each individual data set, then we use Bayesian method to integrate the results from individual data set.
Overlap B/W NMR and G-Proteomics NMR: 53 samplesGlobal Proteomics: 65 samples Overlap: 17 samples One sample: without phenotype information One sample: from blood draw 2 15 samples: all from blood draw 1 with phenotype as either “Healthy Control” or “Polyp”
Data Analysis Procedure Metabolomics (NMR) Data Preprocessing Ending with 1824 Variables IWPLS method Global Proteomics Data Preprocessing Ending with 5407 Variables IWPLS method Integrate Results
Analysis Results Our method:
Analysis Results (cont.) Summary: Other Methods Tried: - PLS: ending with 0 components; -Univariate t-test: none variables is significant. DataClassification Rate GProteomics100% NMR85.7% Integrated NMR and GProteomics 100%
Example: Overlap of Three Data Sets For overlap among three data sets, we focus on the overlap among Lipidomics, Teac and Comet. Data summary: -Phenotype summary: - Variable summary: Data analysis: we group patients of colon cancer and rectal cancer together as cancer group, while keeping the other two groups. The we try the following methods: Method 1: POCRE Method 2: ANOVA test PhenotypeHealthyPolypColonRectalTotal Sample size LipidomicsTeacComet Number of variables5212
Results Misclassification rate: Variables identified: POCREANOVA 17%39% POCRELipids: Teac: TEAC_mM ANOVALipids: Teac: TEAC_mM
Preprocessing GC x GC-MS Methods How to choose the reference sample for alignment? - Choose the chromatogram in the middle of the run sequence or the chromatogram containing the highest number of common chemical constituents (i.e. peaks) - Choose the chromatogram that is most similar to the loading of the first principal component in a PCA model on the unaligned data, or simply to the mean of all chromatogram. Similarity index method for choosing reference sample: For a given chromatogram, the similarity index is defined as: where The one with the maximum similarity index will be chosen as the reference sample. Ref: Skov, T. et al, Automated Alignment of Chromatographic Data, Journal of Chemometrics, Vol. 20, Issue 11-12, page: , 2007.
Results