Data Mining – Clustering and Classification 1.  Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means.

Slides:

Advertisements

Similar presentations

Decision Tree Approach in Data Mining

Advertisements

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.

Statistics 202: Statistical Aspects of Data Mining

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

Data Mining Sangeeta Devadiga CS 157B, Spring 2007.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.

Data Mining Classification: Naïve Bayes Classifier

Lecture Notes for Chapter 4 Introduction to Data Mining

Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,

6/3/2015 T.K. Cocx, Prediction of criminal careers through 2- dimensional Extrapolation W. Kosters et al.

Online Algorithms – II Amrinder Arora Permalink:

CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.

1 BUS 297D: Data Mining Professor David Mease Lecture 5 Agenda: 1) Go over midterm exam solutions 2) Assign HW #3 (Due Thurs 10/1) 3) Lecture over Chapter.

Basic Data Mining Techniques Chapter Decision Trees.

Ensemble Learning: An Introduction

Basic Data Mining Techniques

What is Cluster Analysis?

What is Cluster Analysis?

ML ALGORITHMS. Algorithm Types Classification (supervised) Given -> A set of classified examples “instances” Produce -> A way of classifying new examples.

Clustering a.j.m.m. (ton) weijters The main idea is to define k centroids, one for each cluster (Example from a K-clustering tutorial of Teknomo, K.

Health and CS Philip Chan. DNA, Genes, Proteins What is the relationship among DNA Genes Proteins ?

Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.

MAKING THE BUSINESS BETTER Presented By Mohammed Dwikat DATA MINING Presented to Faculty of IT MIS Department An Najah National University.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Bayesian Networks. Male brain wiring Female brain wiring.

Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (

Data Clustering 1 – An introduction

CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Knowledge Discovery and Data Mining Evgueni Smirnov.

Algorithms: The Basic Methods Witten – Chapter 4 Charles Tappert Professor of Computer Science School of CSIS, Pace University.

Chapter 4 Classification. 2 Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of.

Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Modul 6: Classification. 2 Classification: Definition  Given a collection of records (training set ) Each record contains a set of attributes, one of.

Final Exam Review. The following is a list of items that you should review in preparation for the exam. Note that not every item in the following slides.

Decision Trees Jyh-Shing Roger Jang ( 張智星 ) CSIE Dept, National Taiwan University.

Clustering Methods K- means. K-means Algorithm Assume that K=3 and initially the points are assigned to clusters as follows. C 1 ={x 1,x 2,x 3 }, C 2.

Pattern Recognition April 19, 2007 Suggested Reading: Horn Chapter 14.

EXAM REVIEW MIS2502 Data Analytics. Exam What Tool to Use? Evaluating Decision Trees Association Rules Clustering.

Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)

1 Illustration of the Classification Task: Learning Algorithm Model.

Big Data Analysis and Mining Qinpei Zhao 赵钦佩 2015 Fall Decision Tree.

Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.

Classification: Basic Concepts, Decision Trees. Classification Learning: Definition l Given a collection of records (training set) –Each record contains.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining By Tan, Steinbach,

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

DECISION TREES Asher Moody, CS 157B. Overview  Definition  Motivation  Algorithms  ID3  Example  Entropy  Information Gain  Applications  Conclusion.

Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.

Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.

WHAT IS DATA MINING?  The process of automatically extracting useful information from large amounts of data.  Uses traditional data analysis techniques.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.

DATA MINING and VISUALIZATION Instructor: Dr. Matthew Iklé, Adams State University Remote Instructor: Dr. Hong Liu, Embry-Riddle Aeronautical University.

Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.

Analyzing Stock Quotes using Data Mining Techniques Supervisor: Dr. W.S. Ho Student: To Yi Fun, Cyrus( )

Data Mining Introduction to Classification using Linear Classifiers

EECS 647: Introduction to Database Systems

Lecture Notes for Chapter 4 Introduction to Data Mining

Sangeeta Devadiga CS 157B, Spring 2007

MIS2502: Data Analytics Classification using Decision Trees

Prepared by: Mahmoud Rafeek Al-Farra

MIS2502: Data Analytics Classification Using Decision Trees

Practice Project Overview

COSC 4368 Intro Supervised Learning Organization

Presentation transcript:

Data Mining – Clustering and Classification 1

 Review Questions ◦ Question 1: Clustering and Classification  Algorithm Questions ◦ Question 2: K-Means Clustering ◦ Question 3: Classification Tree 2

3

 What is Clustering? 4

Clustering is finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups. Clustering can be used to “understand” data (i.e. group related documents for bowsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations), or “summarise” the data (i.e. reduce the size of large data sets – precipitation in Australia for example). 5

 What is Classification? 6

Classification is where, given a collection of records (known as a training set), find a model for the class attribute as a function of the values of other attributes. The goal of classification is previously unseen records should be assigned a class as accurately as possible – this is done using a test set to show the accuracy of the model. Classification examples include predicting tumor cells as benign or malignant, classifying credit card transactions as legitimate or fraudulent, classifying structures of proteins as alpha-helix, beta- sheet, or random coil or categorising news stories as finance, weather, entertainment, sports, etc. 7

 How do the two differ? 8

The difference between these two approaches is clustering is used to find similar/related information in a data set, whereas classification takes a data set and classifies the class attribute based on a model using a function of the values of other attributes. Clustering returns a group of data and classification returns an classification for each row of information in the data set. 9

10

 Consider the following set of two dimensional records:  Use the k-means algorithm to cluster the data. We assume there are 3 clusters, and the records 1, 3 and 5 are used as the initial centroids (means). 11 RIDDimension1Dimension

Calculate the distances of the three clusters (1 is the centroid for C1, 3 is the centroid for C2 and 5 is the centroid for C3): 12 RIDDistance to 1Distance to 3Distance to = sqrt((8-5)^2 + (4-4)^2) = 3 = sqrt((2-5)^2 + (4-4)^2) = 3 = sqrt((2-5)^2 + (8-4)^2) = = sqrt((8-2)^2 + (4-6)^2) = 6.3 = sqrt((2-2)^2 + (4-6)^2) = 2 = sqrt((2-2)^2 + (8-6)^2) = = sqrt((8-8)^2 + (4-6)^2) = 2 = sqrt((2-8)^2 + (4-6)^2) = 6.3 = sqrt((2-8)^2 + (8-6)^2) = 6.3

Since the following records are lowest in their respective C’s, the following clusters and records are produced: C1: 1, 2, 6 C2: 3, 4 C3: 5 Hence the mean for the clusters are now: C1: D1 = (( )/3) = 7, D2 = (( )/3 = 4.7 C2: D1 = ((2 + 2)/2) = 2, D2 = ((4 + 6)/2) = 5 C3: D1 = 2, D2 = 8 13

Now Calculated Distances based on new coordinates of the clusters: 14 RIDDistance to C1 (7, 4.7)Distance to C2 (2, 5)Distance to C3 (2, 8) 1 = sqrt((7-8)^2 + (4.7-4)^2) = 1.2 = sqrt((2-8)^2 + (5-4)^2) = 6.1 = sqrt((2-8)^2 + (8-4)^2) = = sqrt((7-5)^2 + (4.7-4)^2) = 2.1 = sqrt((2-5)^2 + (5-4)^2) = 3.2 = sqrt((2-5)^2 + (8-4)^2) = 5 3 = sqrt((7-2)^2 + (4.7-4)^2) = 5 = sqrt((2-2)^2 + (5-4)^2) = 1 = sqrt((2-2)^2 + (8-4)^2) = 4 4 = sqrt((7-2)^2 + (4.7-6)^2) = 5.17 = sqrt((2-2)^2 + (5-6)^2) = 1 = sqrt((2-2)^2 + (8-6)^2) = 2 5 = sqrt((7-2)^2 + (4.7-8)^2) = 6 = sqrt((2-2)^2 + (5-8)^2) = 3 = sqrt((2-2)^2 + (8-9)^2) = 0 6= sqrt((7-8)^2 + (4.7-6)^2) = 1.6 = sqrt((2-8)^2 + (5-6)^2) = 6.1 = sqrt((2-8)^2 + (8-6)^2) = 6.3

Since the following records are lowest in their respective C’s, the following new clusters and records are produced: C1: 1, 2, 6 C2: 3, 4 C3: 5 Since the Clusters have not had a change in records, these clusters are the accepted clusters for the records. 15

Apply the Hunt’s classification algorithm to build a decision tree for the following training data set (where loanworthy is the class, RID is record ID), assuming we test the attributes from left to right. 16 RIDMarriedSalary Acct_Balanc e AgeLoanworthy 1no>=50K<5K>=25yes 2 >=50K>=5K>=25yes 3 20K…50K<5K<25no 4 <20K>=5K<25no 5 <20K<5K>=25no 6yes20K…50K>=5K>=25yes

17

18