Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.

Slides:



Advertisements
Similar presentations
Clustering II.
Advertisements

Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
PARTITIONAL CLUSTERING
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Unsupervised learning
Data Mining Techniques: Clustering
Machine Learning and Data Mining Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ What is Cluster Analysis? l Finding groups of objects such that the objects in a group will.
Chapter 4: Unsupervised Learning. CS583, Bing Liu, UIC 2 Road map Basic concepts K-means algorithm Representation of clusters Hierarchical clustering.
Clustering II.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
What is Cluster Analysis
Cluster Analysis: Basic Concepts and Algorithms
Unsupervised Learning and Data Mining
Introduction to Bioinformatics - Tutorial no. 12
What is Cluster Analysis?
Cluster Analysis CS240B Lecture notes based on those by © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004.
What is Cluster Analysis?
Ulf Schmitz, Pattern recognition - Clustering1 Bioinformatics Pattern recognition - Clustering Ulf Schmitz
Clustering Unsupervised learning Generating “classes”
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Partitional and Hierarchical Based clustering Lecture 22 Based on Slides of Dr. Ikle & chapter 8 of Tan, Steinbach, Kumar.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CLUSTERING. Overview Definition of Clustering Existing clustering methods Clustering examples.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Clustering Gene Expression Data BMI/CS 576 Colin Dewey Fall 2010.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Machine Learning Queens College Lecture 7: Clustering.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Clustering/Cluster Analysis. What is Cluster Analysis? l Finding groups of objects such that the objects in a group will be similar (or related) to one.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Hierarchical Clustering
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 28 Nov 9, 2005 Nanjing University of Science & Technology.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Chapter 4: Unsupervised Learning Dr. Mehmet S. Aktaş Acknowledgement: Thanks to Dr. Bing Liu for teaching materials.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Hierarchical clustering approaches for high-throughput data Colin Dewey BMI/CS 576 Fall 2015.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction.
C LUSTERING José Miguel Caravalho. CLUSTER ANALYSIS OR CLUSTERING IS THE TASK OF ASSIGNING A SET OF OBJECTS INTO GROUPS ( CALLED CLUSTERS ) SO THAT THE.
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Clustering.
Unsupervised Learning
What Is Cluster Analysis?
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
Hierarchical clustering approaches for high-throughput data
CSE572, CBS598: Data Mining by H. Liu
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Hierarchical Clustering
Unsupervised Learning: Clustering
Unsupervised Learning
Presentation transcript:

Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning

CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute. These patterns are then utilized to predict the values of the target attribute in future data instances. Unsupervised learning: The data have no target attribute. We want to explore the data to find some intrinsic structures in them.

Unsupervised Learning Unsupervised algorithms aim to create groups or subsets of the data where data points belonging to a cluster are as similar to each other as possible, while making the difference between the clusters as high as possible. As a simple example, you could imagine clustering customers by their demographics. The learning algorithm may help you discover distinct groups of customers by region, age ranges, gender and other attributes in such way that we can develop targeted marketing programs.

CS583, Bing Liu, UIC 4 What is Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. the process of grouping a set of patterns into classes of similar objects Patterns within a cluster should be similar. Patterns from different clusters should be dissimilar.

Hard vs. soft clustering Hard clustering: Each pattern belongs to exactly one cluster More common and easier to do Soft clustering: A pattern can belong to more than one cluster.

CS583, Bing Liu, UIC 6 Aspects of clustering A clustering algorithm Partitional clustering Hierarchical clustering … A distance (similarity, or dissimilarity) function Clustering quality Inter-clusters distance  maximized Intra-clusters distance  minimized The quality of a clustering result depends on the algorithm, the distance function, and the application.

How do we define “similarity”? Recall that the goal is to group together “similar” data – but what does this mean? No single answer – it depends on what we want to find or emphasize in the data; this is one reason why clustering is an “art” The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice!

(Dis)similarity measures Instead of talking about similarity measures, we often equivalently refer to dissimilarity measures (I’ll give an example of how to convert between them in a few slides…) Jagota defines a dissimilarity measure as a function f(x,y) such that f(x,y) > f(w,z) if and only if x is less similar to y than w is to z This is always a pair-wise measure Think of x, y, w, and z as gene expression profiles (rows or columns)

Euclidean distance Here n is the number of dimensions in the data vector.

Desirable Properties of a Clustering Algorithm Ability to deal with different data types Minimal requirements for domain knowledge to determine input parameters Able to deal with noise and outliers Insensitive to order of input records Incorporation of user-specified constraints Interpretability and usability

CS583, Bing Liu, UIC 15 K-means clustering K-means is a partitional clustering algorithm Let the set of data points (or instances) D be {x 1, x 2, …, x n }, where x i = (x i1, x i2, …, x ir ) is a vector in a real-valued space X  R r, and r is the number of attributes (dimensions) in the data. The k-means algorithm partitions the given data into k clusters. Each cluster has a cluster center, called centroid. k is specified by the user

CS583, Bing Liu, UIC 16 K-means algorithm Given k, the k-means algorithm works as follows: 1)Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2)Assign each data point to the closest centroid 3)Re-compute the centroids using the current cluster memberships. 4)If a convergence criterion is not met, go to 2).

CS583, Bing Liu, UIC 17 K-means algorithm – (cont …)

CS583, Bing Liu, UIC 18 Stopping/convergence criterion 1. no (or minimum) re-assignments of data points to different clusters, 2. no (or minimum) change of centroids, or 3. Fixed number of iterations

Example

Strengths and Weakness Strengths: Simple: easy to understand and to implement Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations.

Weakness The user needs to specify k. The algorithm is sensitive to outliers Outliers are data points that are very far away from other data points. Outliers could be errors in the data recording or some special data points with very different values.

CS583, Bing Liu, UIC 26 Weaknesses of k-means: Problems with outliers

Hierarchical Clustering Produce a nested sequence of clusters, a tree, also called Dendrogram.

Hierarchical Clustering (cont.) This produces a binary tree or dendrogram The final cluster is the root and each data item is a leaf The height of the bars indicate how close the items are

29 Types of hierarchical clustering Agglomerative (bottom up) clustering: It builds the dendrogram (tree) from the bottom level, and merges the most similar (or nearest) pair of clusters stops when all the data points are merged into a single cluster (i.e., the root cluster). Divisive (top down) clustering: It starts with all data points in one cluster, the root. Splits the root into a set of child clusters. Each child cluster is recursively divided further stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point

30 Agglomerative clustering It is more popular then divisive methods. At the beginning, each data point forms a cluster (also called a node). Merge nodes/clusters that have the least distance. Go on merging Eventually all nodes belong to one cluster

31 Agglomerative clustering algorithm

32 An example: working of the algorithm