Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.

Slides:



Advertisements
Similar presentations
Copyright Jiawei Han, modified by Charles Ling for CS411a
Advertisements

Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: K-Means Clustering Martin Russell.
K-means Clustering Ke Chen.
Cluster Analysis: Basic Concepts and Algorithms
Clustering Basic Concepts and Algorithms
1 Machine Learning: Lecture 10 Unsupervised Learning (Based on Chapter 9 of Nilsson, N., Introduction to Machine Learning, 1996)
Data Mining Techniques: Clustering
Introduction to Bioinformatics
Clustering II.
ICS 421 Spring 2010 Data Mining 2 Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 4/8/20101Lipyeow Lim.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Today Unsupervised Learning Clustering K-means. EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms Ali Al-Shahib.
Cluster Analysis: Basic Concepts and Algorithms
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
K-means clustering CS281B Winter02 Yan Wang and Lihua Lin.
Revision (Part II) Ke Chen COMP24111 Machine Learning Revision slides are going to summarise all you have learnt from Part II, which should be helpful.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 11: Clustering Martin Russell.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
1 Similarity and Dissimilarity Between Objects Distances are normally used to measure the similarity or dissimilarity between two data objects Some popular.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining and Text Mining. The Standard Data Mining process.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
What Is Cluster Analysis?
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Slides by Eamonn Keogh (UC Riverside)
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining K-means Algorithm
Data Mining -Cluster Analysis. What is a clustering ? Clustering is the process of grouping data into classes, or clusters, so that objects within a cluster.
Topic 3: Cluster Analysis
K-means and Hierarchical Clustering
AIM: Clustering the Data together
Revision (Part II) Ke Chen
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Revision (Part II) Ke Chen
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Junheng, Shengming, Yunsheng 11/09/2018
Topic 5: Cluster Analysis
Introduction to Machine learning
Presentation transcript:

Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib

Slide 2 Lectures on WebCT

Slide 3 Today  Unsupervised Learning  Clustering  K-means

Slide 4 What is Clustering?  It is a unsupervised learning method (no predefined classes, in our buy_computer example, there is no ‘no’ and ‘yes’).  Imagine you are given a set of data objects for analysis, unlike in classification, the class label of each example is not known.  Clustering is the process of grouping the data into classes or clusters so that examples within a cluster have high similarity in comparison to one another, but are very dissimilar to examples in other clusters.  Dissimilarities are assessed based on the attribute values describing the examples.  Often, distance measures are used.

Slide 5 Clustering Note: You do not know which type of Star each star is, they are unlabelled, you Just use the information given in the attributes (or features) of the star

Slide 6 EE3J2 Data Mining Structure of data  Typical real data is not uniformly distributed  It has structure  Variables might be correlated  The data might be grouped into natural ‘clusters’  The purpose of cluster analysis is to find this underlying structure automatically

Slide 7 7 Data Structures Clustering algorithms typically operate on either:  Data matrix – represents n objects (a.k.a. examples e.g. persons) with p variables (e.g. age,height,gender, etc.). Its n examples x p variables  Dissimilarity matrix – stores a collection of distances between examples. d(x,y) = difference or dissimilarity between examples x and y. How can dissimilarity d(x,y) be assessed?

Slide 8 EE3J2 Data Mining Clusters and centroids  In another words……  If we assume that the clusters are spherical, then they are determined by their centres  The cluster centres are called centroids  How many centroids do we need?  Where should we put them? centroids d(x,y) x y

Slide 9 Measuring dissimilarity (or similarity)  To measure similarity, often a distance function ‘d’ is used  Measures “dissimilarity” between pairs objects x and y Small distance d(x, y): objects x and y are more similar Large distance d(x, y): objects x and y are less similar

Slide 10 Properties of the distance function  So, a function d(x,y) defined on pairs of points x and y is called a distance (d) if it satisfies: d(x,y)≥ 0: Distance is a nonnegative number d(x,x) = 0 the distance of an object to itself is 0. d(x,y) = d(y,x) for all points x and y (d is symmetric) d(x,y)  d(x,y) + d(y,z) for all points x, y and z (this is called the triangle inequality)

Slide 11 Euclidean Distance  The most popular distance measure is Euclidean distance.  If x = (x 1, x 2,…,x N ) and y = (y 1,y 2,…,y N ) then:  This corresponds to the standard notion of distance in Euclidean space

Slide 12 EE3J2 Data Mining Distortion  Distortion is a measure of how well a set of centroids models a set of data  Suppose we have: data points y 1, y 2,…,y T centroids c 1,…,c M  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )

Slide 13 EE3J2 Data Mining Distortion  The distortion for the centroid set C = c 1,…,c M is defined by:  In other words, the distortion is the sum of distances between each data point and its nearest centroid  The task of clustering is to find a centroid set C such that the distortion Dist(C) is minimised

Slide The K-Means Clustering Method  Given k, the k-means algorithm is implemented in 4 steps: Partition objects into k nonempty subsets Compute seed points as the centroids of the clusters of the current partition. The centroid is the center (mean point) of the cluster. Assign each object to the cluster with the nearest seed point. Go back to Step 2, stop when no more new assignment.

Slide The K-Means Clustering Method Example

Slide 16 Lets watch an animation!

Slide 17 K-means Clustering  Suppose that we have decided how many centroids we need - denote this number by K  Suppose that we have an initial estimate of suitable positions for our K centroids  K-means clustering is an iterative procedure for moving these centroids to reduce distortion

Slide 18 K-means clustering - notation  Suppose there are T data points, denoted by:  Suppose that the initial K clusters are denoted by:  One iteration of K-means clustering will produce a new set of clusters Such that

Slide 19 K-means clustering (1)  For each data point y t let c i(t) be the closest centroid  In other words: d(y t, c i(t) ) = min m d(y t,c m )  Now, for each centroid c 0 k define:  In other words, Y 0 k is the set of data points which are closer to c 0 k than any other cluster

Slide 20 K-means clustering (2)  Now define a new k th centroid c 1 k by: where |Y k 0 | is the number of samples in Y k 0  In other words, c 1 k is the average value of the samples which were closest to c 0 k

Slide 21 K-means clustering (3)  Now repeat the same process starting with the new centroids: to create a new set of centroids: … and so on until the process converges  Each new set of centroids has smaller distortion than the previous set

Slide Comments on the K-Means Method  Strength Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms  Weakness Applicable only when mean is defined, then what about categorical data? Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non- convex shapes

Slide 23 Conclusions  Unsupervised Learning  Clustering  Distance metrics  k-means clustering algorithm

Slide 24 On Tuesday  Sequence Analysis

Slide 25 Some References and Acknowledgments  Data Mining: Concepts and Techniques. J.Han and M.Kamber  J.Han slides, University of Illinois at Urbana-Champaign