Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow.

Slides:



Advertisements
Similar presentations
Estimating the detector coverage in a negative selection algorithm Zhou Ji St. Jude Childrens Research Hospital Dipankar Dasgupta The University of Memphis.
Advertisements

Significance Testing.  A statistical method that uses sample data to evaluate a hypothesis about a population  1. State a hypothesis  2. Use the hypothesis.
Multivariate Statistical Process Control and Optimization
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Application of NIR for counterfeit drug detection Another proof that chemometrics is usable: NIR confirmed by HPLC-DAD-MS and CE-UV Institute of Chemical.
Regression analysis Relating two data matrices/tables to each other Purpose: prediction and interpretation Y-data X-data.
Face Recognition and Biometric Systems Eigenfaces (2)
Quantifying soil carbon and nitrogen under different types of vegetation cover using near infrared-spectroscopy: a case study from India J. Dinakaran*and.
Lumex Instruments Group ISO 9001:2008www.lumex.biz PARCEL software as an instrument for InfraLUM type spectrometers calibration LUMEX GROUP.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
1 Simple Interval Calculation (SIC-method) theory and applications. Rodionova Oxana Semenov Institute of Chemical Physics RAS & Russian.
Simple Interval Calculation bi-linear modelling method. SIC-method Rodionova Oxana Semenov Institute of Chemical Physics RAS & Russian.
1 Status Classification of MVC Objects Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow.
WSC-6 Critical levels in projection Alexey Pomerantsev Semenov Institute of Chemical Physics, Moscow.
DIMENSIONALITY REDUCTION BY RANDOM PROJECTION AND LATENT SEMANTIC INDEXING Jessica Lin and Dimitrios Gunopulos Ângelo Cardoso IST/UTL December
Multivariate Data Analysis Chapter 4 – Multiple Regression.
Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.
Dynamic Integration of Virtual Predictors Vagan Terziyan Information Technology Research Institute, University of Jyvaskyla, FINLAND
1 Ensembles of Nearest Neighbor Forecasts Dragomir Yankov, Eamonn Keogh Dept. of Computer Science & Eng. University of California Riverside Dennis DeCoste.
Feature Subset Selection using Minimum Cost Spanning Trees Mike Farah Supervisor: Dr. Sid Ray.
Elaine Martin Centre for Process Analytics and Control Technology University of Newcastle, England The Conjunction of Process and.
Berkeley Parlab 1. INTRODUCTION A Comparison of Error Metrics for Learning Model Parameters in Bayesian Knowledge Tracing 2. CORRELATIONS TO THE GROUND.
,. Sugar measurements in soybeans using Near Infrared Spectroscopy Introduction  Soluble carbohydrates are the third compound of soybeans by weight (11%),
System Evaluation To evaluate the error probability of the designed Pattern Recognition System Resubstitution Method – Apparent Error Overoptimistic Holdout.
Rotation Forest: A New Classifier Ensemble Method 交通大學 電子所 蕭晴駿 Juan J. Rodríguez and Ludmila I. Kuncheva.
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Permeation is the passage of contaminants through porous and non-metallic materials. Permeation phenomenon is a concern for buried waterlines where the.
Gene expression profiling identifies molecular subtypes of gliomas
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
1 Chapter Seven Introduction to Sampling Distributions Section 1 Sampling Distribution.
Lecture at VUT 1 PAT solution to the drug release prediction Semenov Institute of Chemical Physics Russian chemometric society A.L. Pomerantsev,
Chemometric functions in Excel
Successive Bayesian Estimation Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometrics Society.
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.
WSC-4 Simple View on Simple Interval Calculation (SIC) Alexey Pomerantsev, Oxana Rodionova Institute of Chemical Physics, Moscow and Kurt Varmuza.
Statistical Methods Statistical Methods Descriptive Inferential
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
Calibrated imputation of numerical data under linear edit restrictions Jeroen Pannekoek Natalie Shlomo Ton de Waal.
Univariate Linear Regression Problem Model: Y=  0 +  1 X+  Test: H 0 : β 1 =0. Alternative: H 1 : β 1 >0. The distribution of Y is normal under both.
CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.
The strategies of coding in spatial memory V.A. Lyakhovetskii 1, E.V. Bobrova 2 1 St. Petersburg Electrotechnical University 2 Pavlov Institute of Physiology.
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
Gang WangDerek HoiemDavid Forsyth. INTRODUCTION APROACH (implement detail) EXPERIMENTS CONCLUSION.
Chimiometrie 2009 Proposed model for Challenge2009 Patrícia Valderrama
Alberto Porta Department of Biomedical Sciences for Health Galeazzi Orthopedic Institute University of Milan Milan, Italy Evaluating complexity of short-term.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Model-based evaluation of clustering validation measures.
Lecture VI Statistics. Lecture questions Mathematical statistics Sampling Statistical population and sample Descriptive statistics.
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
WSC-5 Hard and soft modeling. A case study Alexey Pomerantsev Institute of Chemical Physics, Moscow.
Content-Based Image Retrieval Using Color Space Transformation and Wavelet Transform Presented by Tienwei Tsai Department of Information Management Chihlee.
Standardization of NIR Instruments: How Useful Are the Existing Techniques? Benoit Igne Glen R. Rippke Charles.
An unsupervised conditional random fields approach for clustering gene expression time series Chang-Tsun Li, Yinyin Yuan and Roland Wilson Bioinformatics,
Margin of Error S-IC.4 Use data from a sample survey to estimate a population mean or proportion; develop a margin of error through the use of simulation.
Combining Bagging and Random Subspaces to Create Better Ensembles
St. Petersburg State University, St. Petersburg, Russia March 1st 2016
Cluster Analysis II 10/03/2012.
How to solve authentication problems
Dimension Review Many of the geometric structures generated by chaotic map or differential dynamic systems are extremely complex. Fractal : hard to define.
Introduction to Data Mining, 2nd Edition by
TITLE Authors Institution RESULTS INTRODUCTION CONCLUSION AIMS METHODS
Bird-species Recognition Using Convolutional Neural Network
A Unifying View on Instance Selection
Using decision trees and their ensembles for analysis of NIR spectroscopic data WSC-11, Saint Petersburg, 2018 In the light of morning session on superresolution.
Trip Generation II Meeghat Habibian Transportation Demand Analysis
Product moment correlation
Chapter 7: Introduction to Sampling Distributions
Data Driven SIMCA – more than One-Class Classifier
A Data Partitioning Scheme for Spatial Regression
Recognition of the 'high quality’ forgeries among medicines
Pre-training competencies and the productivity of apprentices
Presentation transcript:

Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow

Outline  Introduction. What is representative subset ?  Training set and Test set  Influential subset selection Boundary subset Kennard-Stone subset Models’ comparison  Conclusions

What is representative subset? YX XIXI Model Y(X) X II Y II X III Y III

Influential Subset Training set X(n  m), Y(n  k) Influential Subset X(l  m), Y(l  k) Model I (A factors) Model II (A factors) <l<n<l<n ~ Model 2 ~ Model 1 ? Model I (A factors) RMSEP 1 Model II (A factors) RMSEP 2 Quality of prediction

Training and Test Sets Entire Data Set K Entire Data Set K Training Set N Training Set N Test Set K-N Test Set K-N

Statistical Tests D. Jouan-Rimbaud, D.L.Massart, C.A. Saby, C. Puel Characterisation of the representativity of selected sets of samples in multivariate calibration and pattern recognition, Analitica Chimica Acta 350 (1997) Generalization of Bartlett’s test Hotelling T 2 -test Clouds orientation Dispersion around their means Similar position in space

RPV Influential Subset  Boundary Samples

Whole Wheat Samples (Data description) X- NIR Spectra of Whole Wheat (118 wave lengths) Y- moisture content N=139 Entire Set Data pre- processed. X- NIR Spectra of Whole Wheat (118 wave lengths) Y- moisture content N=139 Entire Set Data pre- processed. PLS-model, 4PCs SIC-modeling bsic=1.5 PLS-model, 4PCs SIC-modeling bsic=1.5 Training set = 99 objects Test set = 40 objects Training set = 99 objects Test set = 40 objects

Boundary subset l=19 Boundary samples Training set n  m n=99 Model 1 ‘Redundant subset’ n-l=80

Boundary Subset Training set Model 1 Training set Model 1 Boundary subset Model 2 Boundary subset Model 2 TEST SET n=99l=19 4 PLS comp-s  =1.5

SIC prediction Model1 (Training set)  Test set Model 2 (Boundary subset)  Test set

Quality of prediction (PLS models) ? RMSEC=0.303 RMSEP=0.337 Mean(Cal. Leverage)=0.051 Maximum(Cal. Leverage)=0.25 RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45 Model 1 (Training set)  Test setModel 2 (Boundary set)  Test set

Aim Kennard-Stone Method Objects are chosen sequentially in X or T space Select samples that are uniformly distributed over predictors’ space d jr, j=1,...k, is the square Euclidean distance from candidate object r, to the k objects in the subset

Kennard-Stone Subset Training set n=99 Model 1 4 PLS comp-s K-S subset l=19 Model 3 Boundary subset Model 2

Boundary Subset & K-S Subset (SIC prediction)

Boundary Subset & K-S Subset (PLS models) Model 2 (Boundary set)  Test set RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45 Model 3 (K-S set)  Test set RMSEC=0.229 RMSEP=0.368 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.73

‘Redundant samples’ Kennard-Stone set L=19 Model 3 Redundant Set RS_3 (RS_3) N-L=80 Boundary set L=19 Model 2 Redundant Set RS_2 (RS_2) N-L=80 Test set N1=40 Test set N1=40 Training set N=99 Model 1 PLS Cs=4 b sic =1.5 Training set N=99 Model 1 PLS Cs=4 b sic =1.5

Prediction of Redundant Sets Model 2 (Boundary set)  RS_2Model 3 (K-S set)  RS_3 RMSEP=0.267RMSEP=0.338

Model comparison Entire Data Set 139 objects Training Set 99 objects Test Set 40 objects Randomly 10 times In Average

Conclusions 1.The model constructed with the help of Boundary Subset can predict all other samples with accuracy that is not worse than the error of calibration evaluated on the whole data set. 2. Boundary Subset is indeed significantly smaller than the whole Training Set.Questions 1.Prediction ability, how to evaluate it? 2.Representativity, how to verify it?