Final Project- Mining Mushroom World. Agenda Motivation and Background Determine the Data Set (2) 10 DM Methodology steps (19) Conclusion.

Slides:



Advertisements
Similar presentations
Florida International University COP 4770 Introduction of Weka.
Advertisements

D ON ’ T G ET K ICKED – M ACHINE L EARNING P REDICTIONS FOR C AR B UYING Albert Ho, Robert Romano, Xin Alice Wu – Department of Mechanical Engineering,
Combining Classification and Model Trees for Handling Ordinal Problems D. Anyfantis, M. Karagiannopoulos S. B. Kotsiantis, P. E. Pintelas Educational Software.
Data Analysis of Tennis Matches Fatih Çalışır. 1.ATP World Tour 250  ATP 250 Brisbane  ATP 250 Sydney... 2.ATP World Tour 500  ATP 500 Memphis  ATP.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Contraceptive Method Choice 指導教授 黃三益博士 組員 :B 王俐文 B 謝孟凌 B 陳怡珺.
Machine Learning in Practice Lecture 3 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Data Mining with XLMiner
A Classification Approach for Effective Noninvasive Diagnosis of Coronary Artery Disease Advisor: 黃三益 教授 Student: 李建祥 D 楊宗憲 D 張珀銀 D
Data Mining.
1 Predicting the winner of C.Y. award 指導教授:黃三益博士 組員: 尹川 陳隆賢 陳偉聖.
Data Mining CS 341, Spring 2007 Lecture 4: Data Mining Techniques (I)
Data Mining: A Closer Look Chapter Data Mining Strategies (p35) Moh!
ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.
On Comparing Classifiers: Pitfalls to Avoid and Recommended Approach Published by Steven L. Salzberg Presented by Prakash Tilwani MACS 598 April 25 th.
SOFTWARE PROJECT MANAGEMENT Project Quality Management Dr. Ahmet TÜMAY, PMP.
1 Chapter 4: Variability. 2 Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure.
Chapter 5 Data mining : A Closer Look.
1 CSI5388 Data Sets: Running Proper Comparative Studies with Large Data Repositories [Based on Salzberg, S.L., 1997 “On Comparing Classifiers: Pitfalls.
指導教授:黃三益 教授 學生: M 陳聖現 M 王啟樵 M 呂佳如.
CSCI 347 / CS 4206: Data Mining Module 06: Evaluation Topic 07: Cost-Sensitive Measures.
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
An Exercise in Machine Learning
Business Intelligence, Data Mining and Data Analytics/Predictive Analytics By: Asela Thomason IS 495 Summer 2015.
©2006 Prentice Hall Business Publishing, Auditing 11/e, Arens/Beasley/Elder Audit Sampling for Tests of Details of Balances Chapter 17.
©2010 Prentice Hall Business Publishing, Auditing 13/e, Arens//Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
©2012 Pearson Education, Auditing 14/e, Arens/Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
by B. Zadrozny and C. Elkan
Cost-Sensitive Bayesian Network algorithm Introduction: Machine learning algorithms are becoming an increasingly important area for research and application.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
WEKA - Explorer (sumber: WEKA Explorer user Guide for Version 3-5-5)
1.
Department of Computer Science, University of Waikato, New Zealand Geoffrey Holmes, Bernhard Pfahringer and Richard Kirkby Traditional machine learning.
DATA MINING FINAL REPORT Vipin Saini M 許博淞 M 陳昀志 M
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
Data Mining: Classification & Predication Hosam Al-Samarraie, PhD. Centre for Instructional Technology & Multimedia Universiti Sains Malaysia.
Methodology Qiang Yang, MTM521 Material. A High-level Process View for Data Mining 1. Develop an understanding of application, set goals, lay down all.
1 Statistical Techniques Chapter Linear Regression Analysis Simple Linear Regression.
The CRISP Data Mining Process. August 28, 2004Data Mining2 The Data Mining Process Business understanding Data evaluation Data preparation Modeling Evaluation.
Data Reduction via Instance Selection Chapter 1. Background KDD  Nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable.
27-18 września Data Mining dr Iwona Schab. 2 Semester timetable ORGANIZATIONAL ISSUES, INDTRODUCTION TO DATA MINING 1 Sources of data in business,
Chapter 3 Data Mining Methodology and Best Practices
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
© Tan,Steinbach, Kumar Introduction to Data Mining 8/05/ Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan,
Descriptive Statistics. My immediate family includes my wife Barbara, my sons Adam and Devon, and myself. I am 62, Barbara is 61, and the boys are both.
+ Chapter Scientific Method variable is the factor that changes in an experiment in order to test a hypothesis. To test for one variable, scientists.
Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.
An Exercise in Machine Learning
Introduction To Statistics
Data Mining and Decision Support
Review of Statistical Terms Population Sample Parameter Statistic.
***Classification Model*** Hosam Al-Samarraie, PhD. CITM-USM.
1 CSI5388 Practical Recommendations. 2 Context for our Recommendations I This discussion will take place in the context of the following three questions:
“D ATA M INING ON A M USHROOM D ATABASE ” Clara Eusebi, Cosmin Gilga, Deepa John, Andre Maisonave.
Sampling Theory Determining the distribution of Sample statistics.
Data Mining CH6 Implementation: Real machine learning schemes(2) Reporter: H.C. Tsai.
©2012 Prentice Hall Business Publishing, Auditing 14/e, Arens/Elder/Beasley Audit Sampling for Tests of Details of Balances Chapter 17.
Feature learning for multivariate time series classification Mustafa Gokce Baydogan * George Runger * Eugene Tuv † * Arizona State University † Intel Corporation.
Audit Sampling for Tests of Details of Balances
Audit Sampling for Tests of Details of Balances
Data Mining: Concepts and Techniques Course Outline
Presented by Steven Lewis
Prepared by: Mahmoud Rafeek Al-Farra
We will sketch a tree that represents who YOU are.
iSRD Spam Review Detection with Imbalanced Data Distributions
Opening Weka Select Weka from Start Menu Select Explorer Fall 2003
Lecture 1: Descriptive Statistics and Exploratory
Presentation transcript:

Final Project- Mining Mushroom World

Agenda Motivation and Background Determine the Data Set (2) 10 DM Methodology steps (19) Conclusion

Motivation and Background To distinguish between edible mushrooms and poisonous ones by how they look To know whether we can eat the mushroom, to survive in the wild To survive outside the computer world

Determine the Data Set (1/2) Source of data : UCI Machine Learning Repository Mushrooms Database From Audobon Society Field Guide Documentation : complete, but missing statistical information Described in terms of physical characteristics Classification : poisonous or edible All attributes are nominal-valued *Large database: 8124 instances (2480 missing values for attribute #12)

Determine the Data Set (2/2) 1. Past Usage Schlimmer,J.S. (1987). Concept Acquisition Through Representational Adjustment (Technical Report 87-19). Iba,W., Wogulis,J., & Langley,P. (1988). ICML, No other mushrooms data

10 DM Methodology steps Step 1. Translate the Business Problem into a Data Mining Problem a.Data Mining Goal : separate edible mushrooms from poisonous ones b.How will the Results be Used- increase the survival rate c.How will the Results be Delivered- Decision Tree, Naïve Bayes, Ripper, NeuralNet

10 DM Methodology steps Step 2. Select Appropriate Data a.Data Source –The Audubon Society Field guide to North American Mushrooms (1981). G. H. Lincoff (Pres.), New York: Alfred A. Knopf –Jeff Schlimmer donated these data on April 27th, 1987 b.Volumes of Data -Total 8124 instances -4208(51.8%) edible; 3916(48.2%) poisonous -2480(30.5%) missing in attribute “stalk-root”

10 DM Methodology steps Step 2. Select Appropriate Data c.How Many Variables- 22 attributes -cap-shape, cap-color, odor, population, habitat and so on…… d.How Much History is Required- no seasonality *As long as we can eat them when we see them

10 DM Methodology steps Step 3. Get to Know the Data a.Examine Distributions : Use “Weka” to visualize all the 22 attributes with histograms b.Class : edible=e, poisonous=p

Step 3. Get to Know the Data a.Examine Distributions: there are 2 types of historgrams b.First- all kinds of values appear c.(Attribute 21) population: abundant=a, clustered=c, numerous=n, scattered=s, several=v, solitary=y

Step 3. Get to Know the Data 1. Examine Distributions : there are 2 types of historgrams –Second- only some kinds of value appear –(Attribute 7) gill-spacing : close=c, crowded=w, distant=d

Step 3. Get to Know the Data 1. Examine Distributions : there are exceptions –Exception 1- missing values in the attribute –(Attribute 11) stalk-root : bulbous=b, club=c, cup=u, equal=e, rhizomorphs=z, rooted=r, missing=? 2480 of this attribute have missing values (Total 8124)

Step 3. Get to Know the Data 1. Examine Distributions : there are exceptions –Exception 2- undistinguishable attribute –(Attribute 16) veil-type : partial=p, universal=u

Step3. Get to Know the Data 2. Compare Values with Descriptions –no unexpected values except for missing values

10 DM Methodology steps Step 4. Create a Model Set –Creating a Balanced Sample- 75%(6093) as training data, 25%(2031) as test data –Rapid Miner’s “cross-validation” function: k-1 as training, 1 as test

10 DM Methodology steps Step 5. Fix Problems with the Data –Dealing with Missing Values- the attribute “stalk- root” has 2480 missing values –replace all missing values with the average of “stalk-root” value –We replaced ‘?’ with the average value ‘b’

10 DM Methodology steps Step 6. Transform Data to Bring Information to the Surface –all nominal attribute, no numerical analysis in this step

10 DM Methodology steps Step 7. Build Model 1. Decision Tree Performance –A–Accuracy : 99.11% –L–Lift : % True pTrue eClass precision Pred. p % Pred. e % Class recall98.16%100.00% True pTrue eClass precision Pred. p % Pred. e % Class recall98.16%100.00%

10 DM Methodology steps Step 7. Build Model 2. Naïve Bayes Performance –A–Accuracy : 95.77% –L–Lift : % True pTrue eClass precision Pred. p % Pred. e % Class recall92.13%99.14% True pTrue eClass precision Pred. p % Pred. e % Class recall92.13%99.14%

10 DM Methodology steps Step 7. Build Model 3. Ripper Performance –A–Accuracy : 100% –L–Lift : % True pTrue eClass precision Pred. p % Pred. e % Class recall100.00% True pTrue eClass precision Pred. p % Pred. e % Class recall100.00%

10 DM Methodology steps Step 7. Build Model 4. NeuralNet Performance –A–Accuracy : 91.04% –L–Lift : % True pTrue eClass precision Pred. p % Pred. e % Class recall92.65%89.54% True pTrue eClass precision Pred. p % Pred. e % Class recall92.65%89.54%

10 DM Methodology steps Step 8. Assess Models –Accuracy : Ripper and Decision Tree have better performances

10 DM Methodology steps Step 8. Assess Models –Lift (to compare the performances of different classification models) : Ripper and Decision Tree have higher lifts

10 DM Methodology steps Step 9. Deploy Models –We haven’t go out and find real mushrooms Step 10. Assess Results Conclusion and questions –Maybe ripper and decision tree are better models for nominal data –How Rapid Miner separates training data from test data