K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output:

Slides:



Advertisements
Similar presentations
CLUSTERING.
Advertisements

Clustering II.
SEEM Tutorial 4 – Clustering. 2 What is Cluster Analysis?  Finding groups of objects such that the objects in a group will be similar (or.
Clustering.
Hierarchical Clustering
Cluster Analysis: Basic Concepts and Algorithms
1 CSE 980: Data Mining Lecture 16: Hierarchical Clustering.
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Hierarchical Clustering, DBSCAN The EM Algorithm
Data Mining Cluster Analysis: Basic Concepts and Algorithms
More on Clustering Hierarchical Clustering to be discussed in Clustering Part2 DBSCAN will be used in programming project.
Chapter 3: Cluster Analysis
IT 433 Data Warehousing and Data Mining Hierarchical Clustering Assist.Prof.Songül Albayrak Yıldız Technical University Computer Engineering Department.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
6-1 ©2006 Raj Jain Clustering Techniques  Goal: Partition into groups so the members of a group are as similar as possible and different.
Clustering II.
Clustering II.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Cluster Analysis.
Cluster Analysis: Basic Concepts and Algorithms
Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Evaluating Performance for Data Mining Techniques
Cluster Analysis Part II. Learning Objectives Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Hierarchical Clustering
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CSE5334 DATA MINING CSE4334/5334 Data Mining, Fall 2014 Department of Computer Science and Engineering, University of Texas at Arlington Chengkai Li (Slides.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Clustering.
By Timofey Shulepov Clustering Algorithms. Clustering - main features  Clustering – a data mining technique  Def.: Classification of objects into sets.
Ch. Eick: Introduction to Hierarchical Clustering and DBSCAN 1 Remaining Lectures in Advanced Clustering and Outlier Detection 2.Advanced Classification.
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree like diagram that.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Data Clustering Michael J. Watts
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
Clustering.
Hierarchical and Ensemble Clustering
Hierarchical and Ensemble Clustering
CSCI N317 Computation for Scientific Applications Unit Weka
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Hierarchical Clustering
SEEM4630 Tutorial 3 – Clustering.
Hierarchical Clustering
CS 685: Special Topics in Data Mining Jinze Liu
Presentation transcript:

K-Means Algorithm Each cluster is represented by the mean value of the objects in the cluster Input: set of objects (n), no of clusters (k) Output: set of k clusters Algo Randomly select k samples & mark them a initial cluster Repeat Assign/ reassign in sample to any given cluster to which it is most similar depending upon the mean of the cluster Update the cluster’s mean until No Change.

K-Means (graph) Step1: Form k centroids, randomly Step2: Calculate distance between centroids and each object Use Euclidean’s law do determine min distance: d(A,B) = (x 2 -x 1 ) 2 + (y 2 -y 1 ) 2 Step3: Assign objects based on min distance to k clusters Step4: Calculate centroid of each cluster using C =(x 1 +x 2 +…x n, y 1 +y 2 +…y n ) n n Go to step 2. Repeat until no change in centroids.

K-Mediod (PAM) Also called Partitioning Around Mediods. Step1:choose k mediods Step2:assign all points to closest mediod Step3:form distance matrix for each cluster and choose the next best mediod. i.e., the point closest to all other points in cluster go to step2. Repeat until no change in any mediods

What are Hierarchical Methods? Groups data objects into a tree of clusters Classified as Agglomerative (Bottom-up) Divisive (Top-Bottom) Once a merge or split decision is made it cannot be backtracked

Types of hierarchical clustering Agglomerative (Bottom-up) AGNES Places each object into a cluster and merges atomic clusters into larger clusters They differ in the definition of intercluster similarity Divisive: (Top-Bottom) DIANA All objects are initially in one cluster Subdivides the cluster into smaller and smaller pieces, until each object forms a cluster of its own or satisfies some termination condition In both of the above methods the termination condition is the number of clusters

Dendogram Level 0 Level 1 Level 2 Level 3 Level 4

Measures of Distance Minimum distance – Nearest Neighbor- single linkage –minimum spanning tree Maximum distance – Farthest neighbor clustering algorithm – complete linkage Mean distance - avoids outlier sensitivity problem Average distance : can handle categorical as well as numeric data

Euclidean Distance

Agglomerative Algorithm Step1:Make each object as a cluster Step2:Calculate the Euclidean distance from every point to every other point. i.e., construct a Distance Matrix Step3:Identify two clusters with shortest distance. Merge them Go to Step 2 Repeat until all objects are in one cluster

Agglomerative Algorithm Approaches Single Link: Quite simple Not very efficient Suffers from chain effect Complete Link More compact than those found using the single link technique Average Link

Simple Example ItemEACBD E A C B D 33630

Another Example Find single link technique to find clusters in the given database. XY

Plot given data

Identify two nearest clusters

Repeat process until all objects in same cluster

Average link Average distance matrix

Construct a distance matrix

Divisive Clustering All items are initially placed in one cluster The clusters are repeatedly split in two until all items are in their own cluster A B C D E

Difficulties in Hierarchical Clustering Difficulties regarding the selection of merge or split points This decision is critical because the further merge or split decisions are based on the newly formed clusters Method does not scale well So hierarchical methods are integrated with other clustering techniques to form multiple-phase clustering

Types of hierarchical clustering techniques BIRCH-Balanced Iterative Reducing and Clustering using hierarchies ROCK: Robust clustering with links, explores the concept of links CHAMELEON: hierarchical clustering algorithm using dynamic modeling

Outlier Analysis Outliers are data objects, which are different from or inconsistent with the remaining set of data Outliers can be caused because of Measurement or execution error Result of inherent data variability Can be used in fraud detection Outlier detection and analysis is referred to as outlier mining.

Applications of outlier mining Fraud detection Customized marketing for identifying the spending behavior of customers with extremely low or high incomes. Medical analysis for finding unusual responses to various medical treatments.

What is outlier mining? Given a set of n data points or objects and k, the expected number of outliers find the top k objects that are dissimilar, exceptional or inconsistent with respect to remaining data There are two subproblems Define what data can be considered as inconsistent in a given data set Method to mine the outliers

Methods of outlier detection Statistical approach distance-based approach Density-based local outlier approach Deviation-based approach

Statistical Distribution Identifies outliers with respect to a discordancy test Discordancy test examines a working hypothesis and an alternative hypothesis It verifies whether an object oi, is significantly large in relation to the distribution F. This helps in accepting the working hypothesis or rejecting it (alternative distribution) Inherent alternative distribution Mixture alternative distribution Slippage alternative distribution

Procedures for detecting outliers Block procedures: All suspect objects are treated as outliers or all of then are accepted as consistent Consecutive procedures: object that is least likely to be an outlier is tested first. If it is found to be an outlier then all of the more extreme values are also considered as outliers. Else the next most extreme object is tested and so on

Questions in Clustering