Knowledge Discovery in Databases MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo.

Slides:



Advertisements
Similar presentations
DECISION TREES. Decision trees  One possible representation for hypotheses.
Advertisements

Brief introduction on Logistic Regression
Decision Tree Approach in Data Mining
Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Comparison of Data Mining Algorithms on Bioinformatics Dataset Melissa K. Carroll Advisor: Sung-Hyuk Cha March 4, 2003.
“I Don’t Need Enterprise Miner”
Haftu Shamini Thomas Temesgen Seyoum
Decision Tree Rong Jin. Determine Milage Per Gallon.
1 Pattern Recognition Pattern recognition is: 1. The name of the journal of the Pattern Recognition Society. 2. A research area in which patterns in data.
Data Mining Techniques Outline
Decision Tree Algorithm
1 Pattern Recognition Pattern recognition is: 1. The name of the journal of the Pattern Recognition Society. 2. A research area in which patterns in data.
Learning From Data Chichang Jou Tamkang University.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
Classification and Prediction by Yen-Hsien Lee Department of Information Management College of Management National Sun Yat-Sen University March 4, 2003.
Evaluation of Results (classifiers, and beyond) Biplav Srivastava Sources: [Witten&Frank00] Witten, I.H. and Frank, E. Data Mining - Practical Machine.
Ensemble Learning (2), Tree and Forest
Decision Tree Models in Data Mining
Microsoft Enterprise Consortium Data Mining Concepts Introduction to Directed Data Mining: Decision Trees Prepared by David Douglas, University of ArkansasHosted.
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
Introduction to Directed Data Mining: Decision Trees
Enterprise systems infrastructure and architecture DT211 4
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
1 Chapter 1: Introduction 1.1 Introduction to SAS Enterprise Miner.
Chapter 1: Introduction
Data Mining Techniques
Copyright © 2006, SAS Institute Inc. All rights reserved. Predictive Modeling Concepts and Algorithms Russ Albright and David Duling SAS Institute.
Assessment of Model Development Techniques and Evaluation Methods for Binary Classification in the Credit Industry DSI Conference Jennifer Lewis Priestley.
DATA MINING Team #1 Kristen Durst Mark Gillespie Banan Mandura University of DaytonMBA APR 09.
Lecture Notes 4 Pruning Zhangxi Lin ISQS
Building And Interpreting Decision Trees in Enterprise Miner.
INTRODUCTION TO MACHINE LEARNING. $1,000,000 Machine Learning  Learn models from data  Three main types of learning :  Supervised learning  Unsupervised.
Chapter 9 – Classification and Regression Trees
Copyright © 2010, SAS Institute Inc. All rights reserved. Applied Analytics Using SAS ® Enterprise Miner™
Using Neural Networks to Predict Claim Duration in the Presence of Right Censoring and Covariates David Speights Senior Research Statistician HNC Insurance.
Zhangxi Lin ISQS Texas Tech University Note: Most slides are from Decision Tree Modeling by SAS Lecture Notes 5 Auxiliary Uses of Trees.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A data mining approach to the prediction of corporate failure.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Decision Trees. Decision trees Decision trees are powerful and popular tools for classification and prediction. The attractiveness of decision trees is.
1 Pattern Recognition Pattern recognition is: 1. A research area in which patterns in data are found, recognized, discovered, …whatever. 2. A catchall.
Jennifer Lewis Priestley Presentation of “Assessment of Evaluation Methods for Prediction and Classification of Consumer Risk in the Credit Industry” co-authored.
Module networks Sushmita Roy BMI/CS 576 Nov 18 th & 20th, 2014.
1 STAT 5814 Statistical Data Mining. 2 Use of SAS Data Mining.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Mining BY JEMINI ISLAM. Data Mining Outline: What is data mining? Why use data mining? How does data mining work The process of data mining Tools.
Copyright © 2010 SAS Institute Inc. All rights reserved. Decision Trees Using SAS Sylvain Tremblay SAS Canada – Education SAS Halifax Regional User Group.
An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden.
MACHINE LEARNING 10 Decision Trees. Motivation  Parametric Estimation  Assume model for class probability or regression  Estimate parameters from all.
An Artificial Neural Network Approach to Surface Waviness Prediction in Surface Finishing Process by Chi Ngo ECE/ME 539 Class Project.
Data Mining and Decision Support
Final Report (30% final score) Bin Liu, PhD, Associate Professor.
Eco 6380 Predictive Analytics For Economists Spring 2016 Professor Tom Fomby Department of Economics SMU.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Dependency Networks for Inference, Collaborative filtering, and Data Visualization Heckerman et al. Microsoft Research J. of Machine Learning Research.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
An Empirical Comparison of Supervised Learning Algorithms
Fast Kernel-Density-Based Classification and Clustering Using P-Trees
Introduction to Data Mining and Classification
Data Mining Lecture 11.
Advanced Analytics Using Enterprise Miner
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Machine Learning Interpretability
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Modeling IDS using hybrid intelligent systems
Presentation transcript:

Knowledge Discovery in Databases MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo

Profound Questions What basic properties are the formula for a good wine? – Wine making is believed to be an art. But is there a formula for a quality wine? – There was a paper on “Modeling wine preferences by Data Mining” submitted by the provider of the data set. How do my results compare with the paper’s?

Procedure Follow a data mining process Use SAS and SAS Enterprise Miner to execute the process SAS Enterprise Miner tool is modeled on the SAS Institute defined data mining process of SEMMA – Sample, Explore, Modify, Model, Assess SEMMA is similar to the CRISP DM process

Sample 1,599 records Set up a data partition – Training 40% – Validation 30% – Test 30%

Explore: Data Background Data source – UCI Machine Learning Repository. Wine Quality Data Set. – There are a red and white wine data set. I focused on the red wine set only. – There are 11 input variables and one target variable. » fixed acidity » volatile acidity » citric acid » residual sugar » chlorides » free sulfur dioxide » total sulfur dioxide » density » pH » sulphates » alcohol » Output variable (based on sensory data): quality (score between 0 and 10)

Explore: Target=Quality Quality – People gave a quality assessment of different wines on a scale of Actual range 3-8. – An ordinal target

Explore: Inputs Correlation Analysis – Some correlation, but not enough to discard inputs ods graphics on; ods select MatrixPlot; proc corr data=wino.red PLOTS(MAXPOINTS= ) plots=matrix(histogram nvar=all); var quality alcohol ph fixed_acidity density volatile_acidity sulphates citric_acid; run;

Explore: Correlation Graphs

Explore: Chi 2 Statistics of Inputs

Explore: Worth of Inputs

Explore: Worth Graph The Worth Tracks closely with the Chi Statistic

Modify At this stage, no modifications are done

Model: Selection Because I want to list the important elements in what is considered a quality wine, I choose a Decision Tree Configuration – The Splitting Rule is Entropy – Maximum Branch is set to 5 Therefore a C4.5 type of algorithm is being implemented

Assess: Initial Results A Bushy Tree using. The Resulting tree is too intricate for simple recommendation. – Over 20 Leaf nodes.

Modify: Target Change the target so that it becomes a binary. New variable in the model called isGood. Any rating over 6 is categorized as isGood. – SAS Code: data wino.xx; set wino.red; if (quality>6) then isgood=1; else isgood = 0; run; proc print data = wino.xx; title 'xx'; run;

Explore: Target = isGood

Model Strategy for isGood Model with Decision Tree to hope for more descriptive results. Also model with Neural Network to aid in assessment and do comparison

Model: Decision Tree ProbF splitting criteria at Significance Level.2 Maximum Branch size = 5

Assess: Decision Tree Results Much simpler Tree

Assess: Decision Tree Results 2 Leaf Statistics

Assess: Variable Importance Variable NameLabel Number of Splitting Rules Number of Surrogate RulesImportance Validation Importance Ratio of Validation to Training Importance alcohol10111 density volatile_acidity sulphates fixed_acidity citric_acid free_sulfur_dioxide0000NaN pH0000NaN chlorides0000NaN total_sulfur_dioxide0000NaN residual_sugar0000NaN Event Classification Table Data Role=TRAIN Target=isgood False NegativeTrue Negative False Positive True Positive Data Role=VALIDATE Target=isgood False NegativeTrue Negative False Positive True Positive

Model: Neural Network Positive – better at predicting Negative – hard to interpret the model Configured with 3 Hidden Nodes

Modify: Input Variables to NN Because of the complexity of the NN, it is recommended to prune variables prior to running the network.

Modify: R 2 Filter Variable NameRole Measurement LevelReasons for Rejection alcoholINPUTINTERVAL chloridesINPUTINTERVAL citric_acidREJECTEDINTERVALVarsel:Small R-square value densityINPUTINTERVAL fixed_acidityINPUTINTERVAL free_sulfur_dioxideINPUTINTERVAL pHREJECTEDINTERVALVarsel:Small R-square value residual_sugarREJECTEDINTERVALVarsel:Small R-square value sulphatesINPUTINTERVAL total_sulfur_dioxideREJECTEDINTERVALVarsel:Small R-square value volatile_acidityINPUTINTERVAL

Model: NN Specify 3 Hidden Units in the Hidden Layer

Assess: NN Results Hard to interpret results to formulate a recipe The NEURAL Procedure Optimization Results Parameter Estimates Gradient Objective N Parameter Estimate Function 1 alcohol_H chlorides_H density_H fixed_acidity_H free_sulfur_dioxide_H sulphates_H volatile_acidity_H alcohol_H chlorides_H

Assess: Comparative Results Receiver Operating Characteristics (ROC) Chart for NN vs Decision Tree

Assess: Comparative Results Cumulative Lift for NN vs Decision Tree

Assess: Comparison with Reference Paper Used R-Miner Support Vector Machine (SVM) and Neural Network used He applied techniques to extract relative importance of variables He attempted to predict every quality level He noted the importance of alcohol and sulphates. “An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.”

Assess: Paper Variable Importance

Overall Project in SAS EM

References UCI Machine Learning Repository P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4): , Modeling wine preferences by data mining from physicochemical properties, Paulo Cortez et. al