指導教授:黃三益 教授 學生: M954020031 陳聖現 M954020033 王啟樵 M954020042 呂佳如.

Slides:



Advertisements
Similar presentations
Imbalanced data David Kauchak CS 451 – Fall 2013.
Advertisements

Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Learning Algorithm Evaluation
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Final Project- Mining Mushroom World. Agenda Motivation and Background Determine the Data Set (2) 10 DM Methodology steps (19) Conclusion.
Business Statistics for Managerial Decision
Chapter 7 – Classification and Regression Trees
Contraceptive Method Choice 指導教授 黃三益博士 組員 :B 王俐文 B 謝孟凌 B 陳怡珺.
Data Mining: A Closer Look Chapter Data Mining Strategies.
Evaluation.
Credibility: Evaluating what’s been learned. Evaluation: the key to success How predictive is the model we learned? Error on the training data is not.
Evaluation.
Data Mining.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Three kinds of learning
1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖.
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
Experimental Evaluation
Part I: Classification and Bayesian Learning
Data Mining: A Closer Look Chapter Data Mining Strategies 2.
Chapter 5 Data mining : A Closer Look.
Data Mining By Andrie Suherman. Agenda Introduction Major Elements Steps/ Processes Tools used for data mining Advantages and Disadvantages.
Measures of Central Tendency
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 01: Training, Testing, and Tuning Datasets.
Evaluating Classifiers
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
A Multivariate Biomarker for Parkinson’s Disease M. Coakley, G. Crocetti, P. Dressner, W. Kellum, T. Lamin The Michael L. Gargano 12 th Annual Research.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
©2012 Pearson Education, Auditing 14/e, Arens/Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
CLassification TESTING Testing classifier accuracy
Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 5 of Data Mining by I. H. Witten, E. Frank and M. A. Hall 報告人:黃子齊
Evaluation – next steps
1 Evaluating Model Performance Lantz Ch 10 Wk 5, Part 2 Right – Graphing is often used to evaluate results from different variations of an algorithm. Depending.
ECE 539 Final Project ANN approach to help manufacturing of a better car Prabhdeep Singh Virk Fall 2010.
Weka Project assignment 3
Basic Statistics Concepts Marketing Logistics. Basic Statistics Concepts Including: histograms, means, normal distributions, standard deviations.
Machine Learning CSE 681 CH2 - Supervised Learning.
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Chapter 9 – Classification and Regression Trees
WOW World of Walkover-weight “My God, it’s full of cows!” (David Bowman, 2001)
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Categorical data. Decision Tree Classification Which feature to split on? Try to classify as many as possible with each split (This is a good split)
Dr. Chen, Data Mining  A/W & Dr. Chen, Data Mining Chapter 2 Data Mining: A Closer Look Jason C. H. Chen, Ph.D. Professor of MIS School of Business Administration.
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
D ATA M INING Car Evaluation Database. D ATA S ET I NFORMATION PRICE Buying ( Buying price ) VHigh, High, Med, Low Maint ( Price of the maintenance )
Practical Issues of Classification Underfitting and Overfitting –Training errors –Generalization (test) errors Missing Values Costs of Classification.
1 Universidad de Buenos Aires Maestría en Data Mining y Knowledge Discovery Aprendizaje Automático 5-Inducción de árboles de decisión (2/2) Eduardo Poggi.
WEKA Machine Learning Toolbox. You can install Weka on your computer from
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
An Exercise in Machine Learning
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Evaluating Classification Performance
Data Mining and Decision Support
Brian Lukoff Stanford University October 13, 2006.
9-1 Copyright © 2016 McGraw-Hill Education. All rights reserved. No reproduction or distribution without the prior written consent of McGraw-Hill Education.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Classification Cheng Lei Department of Electrical and Computer Engineering University of Victoria April 24, 2015.
Machine Learning in Practice Lecture 2 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
©2012 Prentice Hall Business Publishing, Auditing 14/e, Arens/Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
Presented by Khawar Shakeel
Features & Decision regions
Report on Data Cleaning Framework
CSCI N317 Computation for Scientific Applications Unit Weka
Intro to Machine Learning
Developing a Hiring System
Presentation transcript:

指導教授:黃三益 教授 學生: M 陳聖現 M 王啟樵 M 呂佳如

We all like cars Cars sell cars are sold in Taiwan, 2005 (IEK-ITIS) Highest in 10 years Promotions New models / Renews Favorable loan / divided payment Presents Background & Motivation (1/2)

Price of daily goods Price of gasoline Greenhouse Effect How a selling car is? Background & Motivation (2/2)

Dataset and Data mining techniques Car Evaluation Database UCI Machine learning repository Classification ID3 Learning algorithm

Business problem: What kind of cars can get good evaluation? Evaluation as the target attribute Data mining problem: Find out the rules from other attributes Step One : Translate the Business Problem into a Data Mining Problem

What Is Available? The dataset is from the UCI Machine Learning Repository, which comes from University of California at Irvine, Donald Bren School of Information and Computer Science. This dataset is presented by Marko Bohanec and Blaz Zupan. Step Two : Select Appropriate Data(1/4)

Step Two : Select Appropriate Data(2/4) How Much Data Is Enough? Data mining techniques are used to discover useful knowledge from volumes of data, that is to say, the larger data we use, the better result we can get. But there are also some scholars think that a great deal of data don’t guarantee better result than little of data. Due to the resource is limited, the larger sample will result in much load and contain lots of exceptional cases when doing data mining tasks. The dataset our team choose has 1728 instances.

Step Two : Select Appropriate Data(3/4) How Many Variables? The dataset consists of 1728 instances and each record contains seven attributes which are: buying price, maintenance price, number of doors, capacity in terms of persons to carry, the size of luggage boot, estimated safety of the car, and car acceptability. The attribute of car acceptability is a class label which used to classify the degree of the car that customers accept, the other attributes are viewed as predictive variables.

Step Two : Select Appropriate Data(4/4) What Must the Data Contain? Attribute name DescriptionDomain b_priceBuy Pricev-high / high / med / low m_priceRepair pricev-high / high / med / low doorThe number of the door2 / 3 / 4 / 5-more personThe number of passenger2 / 4 / more sizeSuitcase capacitySmall / med / big safetySafety evalutionLow / med / high classlevel of customer acceptanceUnacc / acc / good / vgood

Step Three : Get to Know the Data(1/4) Examine Distributions Car Price Comfortable Safety b_price m_price door person size safety

Step Three : Get to Know the Data(1/4) Compare Values with Descriptions

Step Three : Get to Know the Data(3/4) Validate Assumptions In this six attributes, even if is the worst category, also have some customers can accept, for example suitcase capacity (size) even if is small, still could classify to “good”. But there has two attributes quite are special, respectively is person as well as safety. In the “person ”, the value is 2, the class all are unacc. In the safety attribute value is low, the class all are unacc. Therefore we may supposition this two attributes is very important for customer when they chose the car.

Step Three : Get to Know the Data(4/4) Ask Lots of Questions From above, we know these two attributes is important to customers. They do not compromise on these two attributes. The reason might be that customers think a car have only two seats is not so functional to them, and they pay much attention to the safety of cars. After all, the value of life is beyond the value of money.

Step Four : Create a Model Set Creating a Model Set for Prediction We separated the data set into two parts, one part is used as training data set to produce the prediction model, and the other part is used as test data set to test the accuracy of our model. We used cross-validation method, which means all data from the data set might be selected into training data set and test data set.

Categorical Variables with Too Many Values Numeric Variables with Skewed Distributions and Outliers Missing Values Values with Meanings That Change over Time Inconsistent Data Encoding Step Five : Fix Problems with the Data

Step Six : Transform Data to Bring Information to the Surface Capture Trends Create Ratios and Other Combinations of Variables Convert Counts to Proportions

Step Seven : Build Models The data mining method we used to build the model is classification. We chose the weka.Classifiers.tree.Id3 our classify method, since it shows the better result. 10 -fold cross-validation

Step Eight : Assess Models(1/3) === Summary === Correctly Classified Instances % Incorrectly Classified Instances % Kappa statistic Mean absolute error Root mean squared error Relative absolute error % Root relative squared error % UnClassified Instances % Total Number of Instances 1728

Step Eight : Assess Models(2/3) === Detailed Accuracy By Class === TP Rate FP Rate Precision Recall F-Measure Class unacc acc vgood good

Step Eight : Assess Models(3/3) === Confusion Matrix === a b c d <-- classified as | a = unacc | b = acc | c = vgood | d = good

Step Nine : Deploy Models Because we don’t have the scoring set to test the model, we skip this step.

Step Ten : Assess Results Although, there are 61 entries have been wrongly classified, we can tell, from confusion Matrix, that even they are in a wrong class, most of them are led to a class near by their actual classes. There are 44 entries were in the next class to their actual classes. Overall, the model performs quite well, from all evaluating values followed with the less serious mis- classification. We believe the result is reliable.

Conclusions (1/2) There are many rules concluded from the decision tree, so we chose some of them to discuss. As mentioned above, “Safety” is very important. Thus, if the value of “safety” is low, the result will directly fall into unacceptable (unacc). And whatever the value of safety is, if “person”’s value is 2, the entry will also fall directly into unacceptable.

Conclusions (2/2) Among the six attributes, customers care less about “door”, as in most of the case this attribute affect the customers’ acceptance less. Maybe because cars are high price product, customers won’t easily give a good or v-good evaluation to ones with just a single outstanding attribute. Because of that, restrictions lead to good and v-good evaluation is plenty – not so easily met. Following are the rules customers would give good or v-good evaluations.