Download presentation
Presentation is loading. Please wait.
Published byAugustus Glenn Modified over 9 years ago
1
Knowledge Discovery in Databases MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo
2
Profound Questions What basic properties are the formula for a good wine? – Wine making is believed to be an art. But is there a formula for a quality wine? – There was a paper on “Modeling wine preferences by Data Mining” submitted by the provider of the data set. How do my results compare with the paper’s?
3
Procedure Follow a data mining process Use SAS and SAS Enterprise Miner to execute the process SAS Enterprise Miner tool is modeled on the SAS Institute defined data mining process of SEMMA – Sample, Explore, Modify, Model, Assess SEMMA is similar to the CRISP DM process
4
Sample 1,599 records Set up a data partition – Training 40% – Validation 30% – Test 30%
5
Explore: Data Background Data source – UCI Machine Learning Repository. Wine Quality Data Set. – There are a red and white wine data set. I focused on the red wine set only. – There are 11 input variables and one target variable. » fixed acidity » volatile acidity » citric acid » residual sugar » chlorides » free sulfur dioxide » total sulfur dioxide » density » pH » sulphates » alcohol » Output variable (based on sensory data): quality (score between 0 and 10)
6
Explore: Target=Quality Quality – People gave a quality assessment of different wines on a scale of 0-10. Actual range 3-8. – An ordinal target
7
Explore: Inputs Correlation Analysis – Some correlation, but not enough to discard inputs ods graphics on; ods select MatrixPlot; proc corr data=wino.red PLOTS(MAXPOINTS=100000 ) plots=matrix(histogram nvar=all); var quality alcohol ph fixed_acidity density volatile_acidity sulphates citric_acid; run;
8
Explore: Correlation Graphs
9
Explore: Chi 2 Statistics of Inputs
10
Explore: Worth of Inputs
11
Explore: Worth Graph The Worth Tracks closely with the Chi Statistic
12
Modify At this stage, no modifications are done
13
Model: Selection Because I want to list the important elements in what is considered a quality wine, I choose a Decision Tree Configuration – The Splitting Rule is Entropy – Maximum Branch is set to 5 Therefore a C4.5 type of algorithm is being implemented
14
Assess: Initial Results A Bushy Tree using. The Resulting tree is too intricate for simple recommendation. – Over 20 Leaf nodes.
15
Modify: Target Change the target so that it becomes a binary. New variable in the model called isGood. Any rating over 6 is categorized as isGood. – SAS Code: data wino.xx; set wino.red; if (quality>6) then isgood=1; else isgood = 0; run; proc print data = wino.xx; title 'xx'; run;
16
Explore: Target = isGood
17
Model Strategy for isGood Model with Decision Tree to hope for more descriptive results. Also model with Neural Network to aid in assessment and do comparison
18
Model: Decision Tree ProbF splitting criteria at Significance Level.2 Maximum Branch size = 5
19
Assess: Decision Tree Results Much simpler Tree
20
Assess: Decision Tree Results 2 Leaf Statistics
21
Assess: Variable Importance Variable NameLabel Number of Splitting Rules Number of Surrogate RulesImportance Validation Importance Ratio of Validation to Training Importance alcohol10111 density010.77055175 1 volatile_acidity010.728868987 1 sulphates100.6716756280.4777105050.711222032 fixed_acidity010.5537197290.3938176710.711222032 citric_acid010.5497503610.3909945690.711222032 free_sulfur_dioxide0000NaN pH0000NaN chlorides0000NaN total_sulfur_dioxide0000NaN residual_sugar0000NaN Event Classification Table Data Role=TRAIN Target=isgood False NegativeTrue Negative False Positive True Positive 535391434 Data Role=VALIDATE Target=isgood False NegativeTrue Negative False Positive True Positive 434031221
22
Model: Neural Network Positive – better at predicting Negative – hard to interpret the model Configured with 3 Hidden Nodes
23
Modify: Input Variables to NN Because of the complexity of the NN, it is recommended to prune variables prior to running the network.
24
Modify: R 2 Filter Variable NameRole Measurement LevelReasons for Rejection alcoholINPUTINTERVAL chloridesINPUTINTERVAL citric_acidREJECTEDINTERVALVarsel:Small R-square value densityINPUTINTERVAL fixed_acidityINPUTINTERVAL free_sulfur_dioxideINPUTINTERVAL pHREJECTEDINTERVALVarsel:Small R-square value residual_sugarREJECTEDINTERVALVarsel:Small R-square value sulphatesINPUTINTERVAL total_sulfur_dioxideREJECTEDINTERVALVarsel:Small R-square value volatile_acidityINPUTINTERVAL
25
Model: NN Specify 3 Hidden Units in the Hidden Layer
26
Assess: NN Results Hard to interpret results to formulate a recipe The NEURAL Procedure Optimization Results Parameter Estimates Gradient Objective N Parameter Estimate Function 1 alcohol_H11 3.679818 -0.001411 2 chlorides_H11 0.520190 -0.000479 3 density_H11 -2.171623 0.000883 4 fixed_acidity_H11 -0.055929 0.000179 5 free_sulfur_dioxide_H11 0.403412 0.000139 6 sulphates_H11 -4.954290 -0.000224 7 volatile_acidity_H11 2.686209 0.000205 8 alcohol_H12 -0.313005 0.001209 9 chlorides_H12 0.200973 0.000759
27
Assess: Comparative Results Receiver Operating Characteristics (ROC) Chart for NN vs Decision Tree
28
Assess: Comparative Results Cumulative Lift for NN vs Decision Tree
29
Assess: Comparison with Reference Paper Used R-Miner Support Vector Machine (SVM) and Neural Network used He applied techniques to extract relative importance of variables He attempted to predict every quality level He noted the importance of alcohol and sulphates. “An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.”
30
Assess: Paper Variable Importance
31
Overall Project in SAS EM
32
References UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Wine http://archive.ics.uci.edu/ml/datasets/Wine P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547- 553, 2009. Modeling wine preferences by data mining from physicochemical properties, Paulo Cortez et. al http://www3.dsi.uminho.pt/pcortez/wine5.pdf
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.