Introduction to Clementine Tutors: Cecia Chan & Gabriel Fung Data Mining Tutorial.

Slides:



Advertisements
Similar presentations
Introduction to Data Mining with XLMiner
Advertisements

1. Abstract 2 Introduction Related Work Conclusion References.
© Prentice Hall1 DATA MINING TECHNIQUES Introductory and Advanced Topics Eamonn Keogh (some slides adapted from) Margaret Dunham Dr. M.H.Dunham, Data Mining,
Department of Computer Science, University of Waikato, New Zealand Eibe Frank WEKA: A Machine Learning Toolkit The Explorer Classification and Regression.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Evaluation of MineSet 3.0 By Rajesh Rathinasabapathi S Peer Mohamed Raja Guided By Dr. Li Yang.
Lab 2. Exploring the Data with Graphs During data mining, it is often useful to explore the data by creating visual summaries. Clementine offers several.
Clementine Tutorial. This tutorial will introduce you to the Clementine toolkit for data mining and show you how to get started with your own data mining.
Data Mining – Intro.
CS157A Spring 05 Data Mining Professor Sin-Min Lee.
SW318 Social Work Statistics Slide 1 Using SPSS for Graphic Presentation  Various Graphics in SPSS  Pie chart  Bar chart  Histogram  Area chart 
Presented To: Madam Nadia Gul Presented By: Bi Bi Mariam.
Data Mining: A Closer Look
Computer Science Universiteit Maastricht Institute for Knowledge and Agent Technology Data mining and the knowledge discovery process Summer Course 2005.
Beyond Opportunity; Enterprise Miner Ronalda Koster, Data Analyst.
TURKISH STATISTICAL INSTITUTE INFORMATION TECHNOLOGIES DEPARTMENT (Muscat, Oman) DATA MINING.
GUHA method in Data Mining Esko Turunen Tampere University of Technology Tampere, Finland.
Enterprise systems infrastructure and architecture DT211 4
Data Mining: Concepts & Techniques. Motivation: Necessity is the Mother of Invention Data explosion problem –Automated data collection tools and mature.
1 Chapter 1: Introduction 1.1 Introduction to SAS Enterprise Miner.
Chapter 1: Introduction
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
OLAM and Data Mining: Concepts and Techniques. Introduction Data explosion problem: –Automated data collection tools and mature database technology lead.
Chapter 5: Data Mining for Business Intelligence
Data Mining Techniques
Data Mining. 2 Models Created by Data Mining Linear Equations Rules Clusters Graphs Tree Structures Recurrent Patterns.
Tutor: Prof. A. Taleb-Bendiab Contact: Telephone: +44 (0) CMPDLLM002 Research Methods Lecture 9: Quantitative.
3 Objects (Views Synonyms Sequences) 4 PL/SQL blocks 5 Procedures Triggers 6 Enhanced SQL programming 7 SQL &.NET applications 8 OEM DB structure 9 DB.
Dr. Russell Anderson Dr. Musa Jafar West Texas A&M University.
Inductive learning Simplest form: learn a function from examples
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
WEKA – Knowledge Flow & Simple CLI
Chapter 11 LEARNING FROM DATA. Chapter 11: Learning From Data Outline  The “Learning” Concept  Data Visualization  Neural Networks The Basics Supervised.
Using Neural Networks in Database Mining Tino Jimenez CS157B MW 9-10:15 February 19, 2009.
INTRODUCTION TO DATA MINING MIS2502 Data Analytics.
Chapter 9 Neural Network.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Chapter 7 Neural Networks in Data Mining Automatic Model Building (Machine Learning) Artificial Intelligence.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining with Oracle using Classification and Clustering Algorithms Proposed and Presented by Nhamo Mdzingwa Supervisor: John Ebden.
1 Improving quality of graduate students by data mining Asst. Prof. Kitsana Waiyamai, Ph.D. Dept. of Computer Engineering Faculty of Engineering, Kasetsart.
CS157B Fall 04 Introduction to Data Mining Chapter 22.3 Professor Lee Yu, Jianji (Joseph)
Advanced Database Course (ESED5204) Eng. Hanan Alyazji University of Palestine Software Engineering Department.
Chapter 5: Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization DECISION SUPPORT SYSTEMS AND BUSINESS.
Dimension Reduction in Workers Compensation CAS predictive Modeling Seminar Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc.
BOĞAZİÇİ UNIVERSITY DEPARTMENT OF MANAGEMENT INFORMATION SYSTEMS MATLAB AS A DATA MINING ENVIRONMENT.
Web-Mining …searching for the knowledge on the Internet… Marko Grobelnik Institut Jožef Stefan.
1 Introduction to Data Mining C hapter 1. 2 Chapter 1 Outline Chapter 1 Outline – Background –Information is Power –Knowledge is Power –Data Mining.
Neural Networks Demystified by Louise Francis Francis Analytics and Actuarial Data Mining, Inc.
MIS2502: Data Analytics Advanced Analytics - Introduction.
Data Mining and Decision Support
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.
In part from: Yizhou Sun 2008 An Introduction to WEKA Explorer.
DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.
Prepared by Fayes Salma.  Introduction: Financial Tasks  Data Mining process  Methods in Financial Data mining o Neural Network o Decision Tree  Trading.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
MIS2502: Data Analytics Advanced Analytics - Introduction
DATA MINING © Prentice Hall.
Clementine Tutorial.
MIS 451 Building Business Intelligence Systems
Advanced Analytics Using Enterprise Miner
Data Mining: Concepts and Techniques Course Outline
כריית מידע -- מבוא ד"ר אבי רוזנפלד.
Data Warehousing and Data Mining
Prepared by: Mahmoud Rafeek Al-Farra
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Welcome! Knowledge Discovery and Data Mining
Presentation transcript:

Introduction to Clementine Tutors: Cecia Chan & Gabriel Fung Data Mining Tutorial

A Brief Review of Data Mining (I) Data mining is… Data mining is… –A process of extracting previously unknown, valid and actionable knowledge from large databases A rule of thumb: A rule of thumb: –If we know clearly the shape and likely content of what we are looking for, we are probably not dealing with data mining

A Brief Review of Data Mining (II) Therefore, data mining is not… Therefore, data mining is not… –SQL queries against any number of disparate database or data warehouse –SQL queries in a parallel or massively parallel environment –Information retrieval, for example, through intelligent agents –Multidimensional database analysis (MDA) –OLAP –Exploratory data analysis (EDA) –Graphical visualization –Traditional statistical processing against a data warehouse However, they are all related to data mining However, they are all related to data mining

Data Mining Process 1. Business objective(s) determination –What is your goal? 2. Data collection –You can learn nothing without data… 3. Data preprocessing (or Data preparation) –Remove outlier / filter noise / modify fields / etc 4. Modeling –The core part of data mining 5. Evaluation –See what you have learn!

Data Mining Software Existing Data mining software: Existing Data mining software: –Clementine from SPSS (we have this software), Enterprise Minter from SAS (we have this software), Intelligence Miner from IBM (we have this software), MineSet from Silicon Graphics, K-wiz from Compression Sciences Ltd., DBMiner from DBMiner Tech. Inc., PolyAnalyst from Megaputer Intelligence, StatServer from Mathsoft : :

Problem Statement Situation: Situation: –You are a researcher compiling data for a medical study –You have collected data about a set of patients, all of whom suffered from the same illness –Each patient responded to one of five drug treatments

Step 1: Business objective Figure out which drug might be appropriate for a future patient with the same illness Figure out which drug might be appropriate for a future patient with the same illness Here are the data collected: Here are the data collected: –Age –Sex (M or F) –BP (Blood pressure: High, normal, or low) –Weight (The weight of the patient) –Cholesterol (Blood cholesterol: Normal or high) –Na (Blood sodium concentration) –K (Blood potassium concentration) –Drug (Drug to which the patient responded)

Using Clementine (1) Clementine is located in… Clementine is located in… –Start  All Programs  Clementine Models Nodes Work-Space

Using Clementine (2) Nodes in the workspace represent different objects and actions. You connect the nodes to form streams, which, when executed, let you visualize relationships and draw conclusions. Nodes in the workspace represent different objects and actions. You connect the nodes to form streams, which, when executed, let you visualize relationships and draw conclusions.

Step 2: Data Collection (1) Double Click Nodes for inputting the collected data

Data Collection (2) Location of your file Use how many columns from the file Is the first row specify the names of the fields or not Other details

Step 3: Data Preparation – Explore the Data (1) Nodes for exploration/visualization: Nodes for exploration/visualization: –Table (in the Output panel) –Plot (in the Graphs Panel) –Histogram (in the Graphs Panel) –Distribution (in the Graphs Panel) –Web (in the Graphs Panel)

Step 3: Data Preparation – Explore the Data (2) Note: Connect the nodes by click-and-drag the middle button of the mouse Double Click Connect the nodes: Connect the nodes:

Step 3: Data Preparation – Explore the Data (3) Execution: Execution: Note: Right click on the table node to display this menu

Step 3: Data Preparation – Explore the Data (4) Other nodes (Please try the other nodes yourself): Other nodes (Please try the other nodes yourself): –Histogram:

Step 3: Data Preparation – Modify the Data (1) Replacing values: Replacing values: –Use Filler node: »Suppose we want to transform all weights to its log value (Note: we usually only transform variables to log when it is highly skewed):

Step 3: Data Preparation – Modify the Data (2) Derive a new value: Derive a new value: –Use Derive node: »Suppose we want to combine Na and K:

Step 3: Data Preparation – Modify the Data (3) Remove some fields Remove some fields –Use Filter node »Suppose we have derived a new field Na_Over_K, now we need to remove the field Na and K:

Step 4: Modeling – Define fields Define the fields Define the fields –Use Type node:

Step 4: Modeling – Build a Model (1) It is the core part of data mining. It is the core part of data mining. Supervised Learning: Supervised Learning: –Train Net (Neural Network) –C5.0 (C5.0 Decision Tree) –Linear Reg. (Linear regression) –C & R Tree (Classification and Regression Tree, CART) Unsupervised Learning: Unsupervised Learning: –Train Kohonen (Self-Organized Map, SOM) –Train KMeans (K-means Clustering) –TwoStep (A kind of Hierarchical Clustering) Others: Others: –GRI (Association Rule mining) –Apriori (Association Rule mining) –Factor / PCA (Factor analysis, attribute selection technique)

Step 4: Modeling – Build a Model (2) Build what model? Build what model? –Recall that our objective is to determine which type of drugs is suitable for a specific patient. –Thus, it is a classification problem (supervised learning) In this tutorial, we use: In this tutorial, we use: –C5.0 and C & R Tree

Step 4: Modeling – Build a Model (3) Note: Note: –There are many complex settings for each model –In this tutorial, we use default setting –Fine tuning a model requires solid experiences in data mining

Step 5: Evaluation (1) It means NOTHING even if we have learned SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALID It means NOTHING even if we have learned SOMETHING, until the knowledge that we have learned are ACTIONABLE and VALID Remember: Remember: –The data set of training and testing are ALWAYS different (why?)

Step 5: Evaluation (2) Create the following flow Create the following flow Note: Must have the same flow as the training stage

Step 5: Evaluation (3) Different results: Different results: –Different models can yield a completely different results –Choosing and tuning a good model is a difficult job –In this tutorial, we only introduce the process of data mining only

Assignment 1

Assignment 1 – Problem Statement Situation: Situation: –You are a financial analyst of a bank –You have to predict whether a customer is Good or Bad based on some demographic information Data Set: Data Set: –A data set about your past customers has been collected –Each customer is either Good or Bad

Assignment 1 – Field definitions VARIABLEROLEDEFINITIONDESCRIPTION CHECKINGinputNominalChecking account status HISTORYinputNominalCredit history AMOUNTinputIntervalAmount in Bank SAVINGSinputNominalNo. of Savings (bonds, stocks, etc) EMPLOYEDinputNominalEmployment Type (Gov., private, etc) INSTALLPinputNominalType of installment rate MARITALinputNominalMartial status PROPERTYinputNominalType of Property AGEinputIntervalAge in years OTHERinputNominalType of other installment plan HOUSINGinputNominalType of House EXISTCRinputIntervalNumber of existing credits JOBinputNominalJob Nature FOREIGNinputBinaryForeign worker or Local worker GOOD_BADOutputBinaryGood or bad credit rating

Assignment 1 – Data Mining Process Data Collection Data Collection –Please download CreditRisk data set from –Two data sets: (i) creditRisk1.csv is for training (ii) creditRisk2.csv is for testing Data Preprocessing Data Preprocessing –Please explore the data and think critically whether any data preprocessing is necessary »Hints: Two of the interval variables are highly skewed

Assignment 1 – Data Mining Process Modeling Modeling –Please build a prediction models using default settings: »C5.0 Decision Tree Model Assessment Model Assessment –Please use the testing data set to evaluate the performance of the prediction models

Assignment 1 –Submission Save the stream as “id.str” Save the stream as “id.str” –E.g, str Upload your stream to the course account Upload your stream to the course account Deadline: Deadline: –4 April 2004 This is an individual assignment This is an individual assignment Note: We strongly encourage you to submit this assignment during the class!!! Note: We strongly encourage you to submit this assignment during the class!!!