陆文聪 Data Mining Applied to Chemistry and chemical engineering Department of Chemistry, College of Sciences, Shanghai University, P. R. China
2 1 Introduction 1.1 Concept Data Mining is an analytic process designed to explore data in search of consistent patterns and/or systematic relationships
3 between variables, and then to validate the findings by applying the detected patterns to new subsets of data.
4 1.2 Main Focuses (1) Materials design How to find the best conditions of preparation or the structure-property relationship of materials, in order to make experimental design for new materials preparation or to predict the physico-chemical properties of unknown materials systems.
5 (2) Molecular design How to find the structure-active relationship of molecules, in order to design new compounds with expected biological activities or predict the physico- chemical properties of unknown molecules.
6 (3) Industrial optimization How to acquire the optimized conditions of processing productions, in order to achieve the good results of industrial production.
7 (1) Optimal map recognition The projection map with best separability can be selected out according to the rate of correctness for classification. 2. Methods in MASTER
8 Fig.1 OMR Comparison to PCA (a) Classification diagram by using Optimal Map Recognition (OMR) (b) Classification diagram by using Pincipal Component Analysis (PCA)
9 HP Model can be created in such a way that the optimal zone can be expressed by a series of inequalities to describe the boundaries of two types of samples. (2) Hyper-polyhedron (HP)
10 Fig.2 Conceptual HP model
11 (3) Optimal projection regression (OPR) The OPR method is a quantitative model with the data fusion of regression and Optimal Map Recognition (OMR) method. It utilizes the information of classification of data set to select the most appropriate features for regression.
12 Fig.3 Conceptual OPR model X1X1 X2X2 Projection from hyperspace to 2- dimensional space
13 (4) Inverse projection Fig.4 Projection from 2-dimensional space to hyperspace
14 (5) Hierachical projection model Fig.5 Conceptual hierachical projection
15 (6) Support Vector Machine Support Vector Classification:
16 回归超平面 支持向量 支持向量超曲面 支持向量 不敏感通道 Support Vector Regression:
17 3 Examples of Application 3.1 Applications in Materials Design (1) Optimization of high temperature superconductor A nonlinear function based on 5 terms with the PRESS value of was obtained. By using inverse projection and OPR method, the critical temperature was promoted from 116 K to 121 K.
18 Inverse projection result of high temperature superconductor
19 (2) Composition design of rare-earth containing phosphor By extrapolation we obtained a series of new compositions located outside of the scope of German patents. Our experimental work confirmed that the brightness of these newly designed phosphor was higher than those the German patents had declared.
Importance of features
Classification diagram using Fisher method
22 (3) Optimization of VPTC ceramic semiconductors By using MASTER, some proposed new composition and technological condition of VPTC materials gave much better result: the ratio of the electric resistance at 273K and minimum resistance was elevated from 20 to 27.3.
23 Partial Least Square (PLS) result of VPTC ceramic semiconductors
24 (4) Composition design of cathode materials of Ni/H battery By using Support Vector Machine (SVM), the mathematical models with powerful prediction ability had been built, and new formulations were predicted and proved by experiments.
25 Cal. vs Exp. values of C 400 /C 0
26 (5) Formation condition for amorphous phase of ternary fluorides By using OMR method, the inequalities obtained were used to predict whether a new ternary fluoride could form amorphous phase or not. The results predicted were in agreement with the experimental ones.
27 OMR result of formation condition for amorphous phase of ternary fluorides
28 (6) Formation condition of ternary intermetallic compounds Using 2400 known phase diagrams as training set, the regularities of formation condition of ternary intermetallic compounds were found. A series of newly discovered ternary intermetallic compounds were “predicted” in this way with good results.
29 OMR result of formation condition of ternary intermetallic compounds
30 (1) Molecular screening of guanidine compounds The Hyper polyhedron (HP) and Support Vector Classification (SVC) methods were used for the computer-aided molecular screening of guanidine compounds. It was found that the predicted results of HP and SVC were better than those of the PCA, KNN and FDV methods etc. 3.2 Applications in Molecular design
31 (2) Structure-activity relationship of antagonists SVC was used to investigate SAR of 26 compounds of antagonists. The results of leave-one-out cross-validation proved that the prediction ability of SVC method was better than those of the PCA, KNN and FDV methods etc.
32 (3) Molecular screening of triazoles compounds (1) OMR model was used for the molecular screening of new triazoles compounds with probable higher anti- fungicidal activities. (2) The predicted results of SVC were better than those of the PCA, KNN and FDV methods etc.
33 (4) Structure-property relationship of azo dyestuff Support Vector Regression (SVR) method was employed to predict the absorption maximum wavelength of 37 azo dyestuff molecules. The mean relative error is 4.22% for the training set and 4.52% for the predicted set, respectively.
Applications in industrial optimization (1) Optimization of nitriding technique for crankshaft production The problem is that the surface hardness of crankshaft products in the Factory of Wuxi Diesel Engine was too low. It was found that there existed an “optimal zone” in the multidimensional feature space. After optimization, the rate of rejection decreased from 1.7% to 0.3%.
35 (2) Springback prediction in sheet metal forming MASTER combining with FEA software (ANSYS/LS-DYNA 5.71) was used to predict the springback in V-type sheet steel forming. The relative error of springback predicted could be controlled within 10% compared with the experiments.
36 4 Conclusion (1) MASTER software package is a comprehensive system consisting of orthogonal design, statistical analysis, data visualization, pattern recognition, regression analysis, artificial neural networks (ANN) and support vector machine (SVM) etc.
37 4 Conclusion (2) MASTER could be used to optimize the formula and technological conditions predict the biological activities and physico-chemical properties improve the product quality and analyze the fault of processing production.
38 Thank you