X Y The significance of the structure of data on PLS predictions of protein involving both natural and human experimental design Åsmund Rinnan Lars Munck
Three Data-sets of barley B + C: The major substances protein, starch, cellulose, beta- glucan, fat and water are weighted to represent biological composition ABC NaturalSimulatedDoE All measured on NIR 6500 from nm with 2 nm intervals Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion Normal barley Protein mutants Carbohydrate mutants
Pre-processing of spectra Moving Window SNV with 130 nm window The nm spectral area visualizes the least differences between the three data sets Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
PCA nm Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Interval PCA selects nm giving the least differences between datasets. Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Predicting protein Using the three datasets NatSimDoE RMSE r2r nLV 525 intercept slope Regression coefficients Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
PLS diagnostics (to protein) A.Simple correlation coefficients: wave-length absorbtion to protein content. B.PLS Regression coefficients Natural Simulated DoE Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Isolating the chemical and biological components of the data-sets. ABC Natural Simulated Natural DoE Chemistry SimBiology RestBiology SimBiology Chemistry SimBiology = B – C RestBiology = (A – C) – (B – C) Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Predicting protein: by PLS: Chemistry and non simulated(rest) biology show high contributions while that of simulated biology is low. ChemistrySimBioRestBio RMSE R nLV 313 intercept slope Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Normalized regression coefficients Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Back to data, selected wavelengths Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion Full PLSCorrelation-PLS Wavelengths abs to protein Assignment PLS Phil Williams
Quick comparison Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Results: Summary Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Interpretation: We are working by ”Permutation science”: 1.By mathematical validation of models permutation of data in chemometrics i.e cross- validation Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
”Permutation science”: 2.Design of Experiments (DoE) Permutation of data through experiments by human design. Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
”Permutation science”: 1.By mathematical validation of models permutation of data in chemometrics i.e. crossvalidation 2.Design of Experiments (DoE) Permutation of data through experiments by human design. 3. Natural design Permutation by selection of unique natural states where nature reveals its principles in data. Question: In chemometrics why not combine them all rather than focusing on mathematical permutation alone? All three permutation approaches are in the heart of chemometric validation of models! Why not use them together as we have done here. They are complementary. Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Principles of natural processes are reflected in data The solar eclipse reveals solar eruptions The NIR barley endosperm mutant model developed since 1965 with expression control of genetics and environment Two types of mutants: regulative protein mutants – P and carbohydrate (starch) mutants – C (normal barley – N) *) *) J.Chemometrics 24: (2010) Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
How were the mutants found? By a bi-variate plot % protein to mmol DBC (Dye binding capacity by acilanorange) The Dyebinding Capacity (DBC) instrument for basic amino acids (lysine). Background: Development of screening methods for improving lysine and nutritional quality in barley LM at the nutritional laboratory of the Swedish seed Ass. Svalöf in High lysine Mutation Mutation recombinants Normal recombinants DBC % protein Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Selecting endosperm mutants J.Chemometrics 24: (2010) No data Vitamin E profileA/P vs. b-gulcan Conclusion: Each mutant produces a unique chemical fingerprint for each individual gene in a controlled genetic background (Bomi). The fingerprint is summerized on the level of chemical bonds by NIR spectroscopy. Cellular computation is soft like a PCA. Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion Any chemical (bi-)plot can select any mutant.
There are deterministic differential NIR spectra for each mutant to the gene background Bomi that reveals a spectral absorption reproducibility as high as MSC log 1/R for the P mutant lys3.a(blue) and the C mutant lys5.g (brown). Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Data structure is super-ordinate to chemometric analysis 3.2 3c 3a The 3a and 3c P mutants are differentiated in this PCA However, spectral differences in the area nm represent a much more finely tuned and informative change in β -glucan from 3.1% in 3a to 6.4% in 3c Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
How is the chemical composition of the cell decided? Through soft modeling of intercellular dynamics of the whole cell by quantum and chemical cross-talk as revealed by the movements of chromosomes at mitosis (click at the left figure). Cell emergence is like music as directed by the whole chemical orchestra of the cell Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion
Biological macro data are basically deterministic calculated in situ by “set probability” controlled by the whole cell Holistic analysis is limited by uncertainty specified as irreducibility “top down” and indeterminacy “bottom up” The structure of data is the king that rules mathematical modeling by data inspection Because of the determinism that here is demonstrated, data development of gentle data models (such as MSC) and data inspection software are of essential importance in avoiding a reduction of information. Chemometrics is excellent for over- views but the results have to be checked by data inspection, Rinnan Dataset Preprocessing PCA iPCA PLS Biology PLS - again Summary Munck Permutation Mutants Diff spec Data structure Genetics Conclusion