Interval selection complexity Reduced spectra (Xr) , size(Xr) = [m,k] C, approx. 1st interval … k-th interval V1,1 V1,2 V1,k V2,1 V2,2 V2,k .. Vm,1 Vm,2 Vm,k k n C 5 100 1011 10 1021 1000 1016 1031 10000 1041 Polynomial complexity WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 Solution to be found In the case of interval notation problem complexity is defined by this polynomial equation. Therefore the task for 1 one thousand variables is much lower than for binary notation. I1 I2 … Ik + width concerning k intervals. Each interval Ij can contain any number of variables, width less than number variables in spectra n. Complexity for that case: C =𝑛k+1
Idea The idea of JVSPO algorithm Search the optimal solution as a combination of intervals, practically their centers Simultaneously optimize interval width (one for all) Simultaneously optimize preprocessing from the list of chosen candidates Use no restrictions on interval arrangement: Let intervals to be on the spectral boundaries Let intervals to be overlapped Perform optimization using any appropriate routine (GA, SA, PSO, MC, etc.) WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016
Genetic algorithm based optimization WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 The skim of the genetic algorithm we use for optimization looks like this. The intervals and data preprocessing are applied inside of this fitness function.
Examples of parameters for optimization Interval wavelength selection only I1 .. In IW Interval selection and preprocessing optimization SNV, MSC, AS I1 .. In IW WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 SG.W SG.P SG.D I1 .. In IW Interval selection, preprocessing optimization and modeling metaparameters There are chromosome structures used in JVSPO. In general our parameter set consists of 3 parts: data preprocessing, intervals and modeling algorithm metaparameters. It can be tempting to optimize the maximum number of available parameters in one task. However this strategy doesn’t seem optimal because it may lead to the excessive complexity reducing the probability of finding a global optimum. Therefore we propose to keep the chromosome reasonably compact. It’s suggested to chose modeling strategies and try them instead of the total optimization. SG.W SG.P SG.D I1 .. In IW nLV Generalized structure Data pretreatment part Variable selection part Modeling part
Dataset 1 WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 The optimization has been tested on a previously published dataset including 1000 spectra of raw milk in visible and short-wave near infrared region used to predict fat and total protein content in the region 400–1100 nm in the presence of seasonal and geographical variability by means of PLS regression. Milk samples were collected and analyzed at a large dairy (Danone-Unimilk, Samara, Russia) over one year, from October 2013 to September 2014 from dairy farms in the district of Samara and surrounding regions. Reference: Melenteva A., Galyanin V., Savenkova E., Bogomolov A., Building global models for fat and total protein content in raw milk based on historical spectroscopic data in the visible and short-wave near infrared range. Food Chemistry 203 (2016) 190–198
Calibration and validation stats for Dataset 1 Algorithm nLV Calibration CV Prediction RMSE R2 Fat content raw data 5 0.163 0.882 0.172 0.867 0.165 0.881 iPLS 0.096 0.959 0.102 0.953 0.097 0.956 SG1D2.19+int. 0.093 0.961 0.091 SG1D2.19 0.154 0.893 0.167 0.875 0.160 Protein content 6 0.115 0.691 0.125 0.636 0.122 0.682 0.101 0.760 0.107 0.730 0.104 0.776 SG1D2.11+int. 0.797 0.758 0.805 SG1D2.11 0.121 0.657 0.135 0.571 0.677 WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 We have compared several regression methods for fat and protein content determination. JVSPO exhibited the best results using Savitzky-Golay derivative as a preprocessing. Derivatives parameters and interval width were included in to the optimization. Without variable selection the same preprocessing doesn’t give any noticeable improvement compaired to the raw data. Another interval method iPLS has also shown a good model performance, but it still worse than our JVSPO. Reference: Melenteva A., Galyanin V., Savenkova E., Bogomolov A., Building global models for fat and total protein content in raw milk based on historical spectroscopic data in the visible and short-wave near infrared range. Food Chemistry 203 (2016) 190–198
Models for Dataset 1 fat protein Full spectra models JVSPO models WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 JVSPO models
Dataset 2, IDRC 2014 shootout PCA a) c) b) WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 b) Another dataset was at software shootout at IDRC 2014. Using our JVSPO we won 2nd prize.
Validation stats for Dataset 2 Participant 1 (We, 2nd prize) Participant 2 Participant 3 (Winner) Participant 4 RMSEP 0.119 0.567 0.105 0.202 SEP 0.112 0.523 0.099 0.193 R2 0.935 0.002 0.984 0.921 Bias -0.039 0.220 -0.035 -0.059 WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 The winner has used local regression approach. He build several models for different sample groups. Reference: Benoit Igne, Andrey Bogomolov, Dongsheng Bu, Pierre Dardenne, Vladislav Galyanin and Peter Tillmann, Summary of the 2014 IDRC software shoot-out, NIR News 26 (2015) 8-14
Memento WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016
JVSPO is available online JVSPO was performed in TPT-cloud, the web-based chemometrics software Both models for milk available online for registered users fat: http://tptcloud.com/model/graph/369 protein: http://tptcloud.com/model/graph/370 JVSPO and interval selection can be performed online in a few clicks http://tptcloud.com/workspace http://tptcloud.com WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016
Conclusion JVSPO is very efficient algorithm for the optimization of the multivariate models. Typically JVSPO works better than either intervals or optimal preprocessing determined individually. JVSPO is especially advantageous when analyzing spectroscopy data with multicollinearity. Selected intervals and preprocessing may provide useful information on data interpretation. WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016
Thanks for the attention! v.galyanin@gmail.com WSC-10, Russia, samara, 29 Feb. – 4 Mar. 2016 That’s all. Thanks for attention.