A new R package statTarget Hemi Luan hemi.luan@gmail.com; hm-luan@msn.com Hong Kong Baptist University.

Slides:



Advertisements
Similar presentations
I OWA S TATE U NIVERSITY Department of Animal Science Using Basic Graphical and Statistical Procedures (Chapter in the 8 Little SAS Book) Animal Science.
Advertisements

CountrySTAT Team-I November 2014, ECO Secretariat,Teheran.
Linear regression models
1 Statistics & R, TiP, 2011/12 Linear Models & Smooth Regression  Linear models  Diagnostics  Robust regression  Bootstrapping linear models  Scatterplot.
Genomic Profiles of Brain Tissue in Humans and Chimpanzees II Naomi Altman Oct 06.
Latent Growth Curve Modeling In Mplus:
Datamining and statistical learning - lecture 9 Generalized linear models (GAMs)  Some examples of linear models  Proc GAM in SAS  Model selection in.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 12a, April 21, 2015 Revisiting Regression – local models, and non-parametric…
Introduction to Predictive Learning
Chapter 11 Multiple Regression.
Assumption of Homoscedasticity
Analyzing Metabolomic Datasets Jack Liu Statistical Science, RTP, GSK
Calibration & Curve Fitting
Hydrologic Statistics
Midterm Review. 1-Intro Data Mining vs. Statistics –Predictive v. experimental; hypotheses vs data-driven Different types of data Data Mining pitfalls.
How to Analyze Data? Aravinda Guntupalli. SPSS windows process Data window Variable view window Output window Chart editor window.
2007 GeneSpring MS GeneSpring for Metabolite BioMarker Analysis using Mass Spectrometry data Agilent Q-TOF VIP Visit Jan 16-17, 2007 Santa Clara, CA Thon.
Climate Predictability Tool (CPT) Ousmane Ndiaye and Simon J. Mason International Research Institute for Climate and Society The Earth.
1 Experimental Statistics - week 4 Chapter 8: 1-factor ANOVA models Using SAS.
ArrayCluster: an analytic tool for clustering, data visualization and module finder on gene expression profiles 組員:李祥豪 謝紹陽 江建霖.
Last lecture summary. Basic terminology tasks – classification – regression learner, algorithm – each has one or several parameters influencing its behavior.
The Examination of Residuals. Examination of Residuals The fitting of models to data is done using an iterative approach. The first step is to fit a simple.
ANOVA: Analysis of Variance.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Innovative Paths to Better Medicines Design Considerations in Molecular Biomarker Discovery Studies Doris Damian and Robert McBurney June 6, 2007.
Mr. Magdi Morsi Statistician Department of Research and Studies, MOH
Lecture 6: Point Interpolation
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Bias and Variance of the Estimator PRML 3.2 Ethem Chp. 4.
Data Analysis, Presentation, and Statistics
Tutorial I: Missing Value Analysis
Competition II: Springleaf Sha Li (Team leader) Xiaoyan Chong, Minglu Ma, Yue Wang CAMCOS Fall 2015 San Jose State University.
Applying MetaboAnalyst
Univariate Point Estimation Confidence Interval Estimation Bivariate: Linear Regression Multivariate: Multiple Regression 1 Chapter 4: Statistical Approaches.
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD.
EQUIPMENT and METHOD VALIDATION
Model Selection and the Bias–Variance Tradeoff All models described have a smoothing or complexity parameter that has to be considered: multiplier of the.
Chapter 4: Basic Estimation Techniques
SPSS: Using statistical software — a primer
Using Galaxy for Metabolomics
BINARY LOGISTIC REGRESSION
Regression Analysis AGEC 784.
Multiple Imputation using SOLAS for Missing Data Analysis
Scale Invariant Feature Transform (SIFT)
Fig. 1. proFIA approach for peak detection and quantification
Bias and Variance of the Estimator
Diagnostics and Transformation for SLR
Statistical Methods For Engineers
SEG 4630 E-Commerce Data Mining — Final Review —
Qc2 Development
What is Regression Analysis?
Concepts and Applications of Kriging
Qc2 Development
Getting the numbers comparable
Bootstrapping Jackknifing
Statistical Learning Dong Liu Dept. EEIS, USTC.
Softberry Mass Spectra (SMS) processing tools
HYDROLOGY Lecture 12 Probability
1. Homework #2 (not on posted slides) 2. Inferential Statistics 3
Basis Expansions and Generalized Additive Models (2)
15.1 The Role of Statistics in the Research Process
Parametric Methods Berlin Chen, 2005 References:
Feature extraction and alignment for LC/MS data
Diagnostics and Remedial Measures
Diagnostics and Transformation for SLR
Defining the Products: ‘GSICS Correction’
Clinical prediction models
Lecture 16. Classification (II): Practical Considerations
Statistics Review (It’s not so scary).
Presentation transcript:

A new R package statTarget Hemi Luan hemi.luan@gmail.com; hm-luan@msn.com Hong Kong Baptist University

statTarget An easy to use tool provide a graphical user interface for quality control based signal correction, integration of metabolomic data from multiple batches, and the comprehensive statistic analysis for non-targeted and targeted approaches.

Work flow of large scale metabolomcis Samples and QC samples Aliquot 1 (LC-MS-ESI+) Aliquot 2 (LC-MS-ESI-) Aliquot 3 (GC-MS-EI) Sample preparation (proteins precipitation with methanol) Sample preparation (proteins precipitation with methanol) Methyl chloroformate (MCF) derivatization Samples preparation LC-MS-ESI+ analysis Batch1 LC-MS-ESI+-analysis Batch1 GC-MS-EI analysis Batch1 Batch2 Batch2 Batch2 Batch3 Batch3 Batch3 Batch4 Batch4 Batch4 Batch5 Batch5 Batch5 Signal correction (QC-RLSC algorithm ) Signal correction (QC-RLSC algorithm ) Signal correction (QC-RLSC algorithm ) Quality control and quality assurance strategy Data analysis Data analysis

The robust QC based quality control procedures Bank (5) QC(5) True samples (5~10) QC True samples (5~10) QC QC Repeat (n) Experimental design QC-RLSC algorithm: The measured data of QC samples is smoothed by the Loess method. The coefficient values between QC samples are interpolated by the cubic-spline. the entire datasets is aligned to the spline result. Loplot *QC: Quality samples

The pipeline of statTarget Meta info. Metabolite Profile Metabolite Profile No No 80% rule Nosie 80% rule Multiple imputations Multiple imputations Missing value Missing value QC-RLSC QC based Signal Correction Variance stabilization and normalization Comprehensive statistic analysis PCA ROC & AUC PLS-DA P- value Box plot Random Forest Volcano plot Fold changes Descriptive statistics Odd. ratio

Main - GUI Component 1 – Shift Correction statTarget A graphical user interface, easy to use tool provide quality control based signal correction, integration of metabolomic data from multiple batches, and the comprehensive statistic analysis for non-targeted and targeted approaches. Component 2 Statistical Analysis

Parameters of Shift Correction Meta info. with CSV format Expression data with CSV format Removing peaks with more than 80% missing values (NA or 0) in each group. (Default: 0.8) The smoothing parameter which controls the bias-variance tradeoff. if the QCspan (0.20~0.75) is set at '0', the generalised cross-validation will be performed to avoid overfitting the observed data. (Default: 0) The parameter for imputation method.(i.e., nearest neighbor averaging, "KNN"; minimum values for imputed variables, "min", median values for imputed variables (Group dependent) "median” (Default: KNN) Lets you specify local constant regression (i.e., the Nadaraya-Watson estimator, degree=0), local linear regression (degree=1), or local polynomial fits (degree=2). (Default: 2)

Parameters for Statistical Analysis Expression data with CSV format Removing peaks with more than 80% missing values (NA or 0) in each group. (Default: 0) The parameter for imputation method.(i.e., nearest neighbor averaging, "KNN"; minimum values for imputed variables, "min", median values for imputed variables (Group dependent) "median” (Default: KNN) Generalised logarithm (glog) transformation for Variance stabilization (Default: TRUE) Scaling method before statistic analysis (PCA or PLS). Pareto can be used for specifying the Pareto scaling. Auto can be used for specifying the Auto scaling (or unit variance scaling). Vast can be used for specifying the vast scaling. Range can be used for specifying the Range scaling. (Default: Pareto) The number of permutation times for PLS-DA model (Default: 20) Multiple statistical analysis and univariate analysis (Default: TRUE) Principal components in PCA-PLS model for the x or y-axis (Default: 1 and 2) The number of variables in Gini plot of Randomforest model (=< 100). (Default: 20)

Data inputs for statTarget transX() is to generate statTarget inputs from Mass Spectrometry Data software, like XCMS. transX() directly read the .tsv file from diffreport function in XCMS software. datpath <- system.file("extdata",package = "statTarget") data <- paste(datpath,"xcmsOutput.tsv", sep="/") transX(data,"xcms") transCode(data,"xcms") #statTarget Version >= 1.5.6

Data input for shift correction Saved as .csv Meta information (Pheno File) Do not change the column name Class: The QC should be labeled as NA. Order : Injection sequence Batch: the analysis blocks or batches. Sample name should be consistent in Pheno file and Profile file. sample batch class order batch01_QC01 1 NA batch01_QC02 2 batch01_QC03 3 batch01_C05 4 batch01_S07 5 batch01_C10 6 batch01_QC04 7 Metabolites ID Sample name Expression data (Profile File) name batch01_QC03 batch01_QC01 batch01_QC02 batch01_C05 78.02055 14023 13071 15270 452.00345 22455 10737 27397 16898 138.96337 6635.4 8062.3 6294.6 6380.5 73.53838 26493 26141 25944 14949 385.12885 57625 56964 59045 78490 237.02815 105490 90166 92315 90442 222.02744 34676 34025 30253 30470 399.33298 42457 44250 41942 209710 394.23541 36120 29848 36707 14463 446.13947 7789.3 9748.6 8932.4

Data input for statistical analysis Saved as .csv --Stat File Sample name Class Factor The sample class or group should be matched with ordinal number, e.g., 1,2,3,4,5… Metabolites ID Name Group 1 2 3 4 5 21 0.0269394 0.212481308 0.421938647 0.072731082 0.476999498 71 0.004056653 0.104999071 0.014279731 0.024105957 0.167094773 50 0.570556469 0.171406702 0.02019163 0.061832702 14 0.014602305 4.650099858 0.022114445 0.007105695 0.156819962 74 0.045739012 0.36876884 0.019088228 0.006108264 0.278668387 9 0.008098186 0.235025541 0.182280343 0.025622084 0.046807751 32 0.019498917 0.683527006 0.035527196 0.014959372 0.430006029 20 0.060402839 0.015232417 0.007500019 0.149098523 168 0.015402529 1.036678761 0.155572884 0.022977097 0.152309431 28 0.036884932 6.041892945 0.103977726 0.017530557 0.299275534 62 0.031621389 6.028034401 0.352036582 0.086246923 0.935893131 Concentration

Output for Shift Correction

Output for Statistical Analysis

Typical plot in for Shift Correction function loplot Distribution of RSD

Typical plot in for Statistical Analysis PCA score plot PLS-DA score plot Volcano plot

Permutation for PLS-DA Random Forest MDSplot Random Forest Gini index plot