Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.

Slides:



Advertisements
Similar presentations
K-Means Clustering Algorithm Mining Lab
Advertisements

Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
Data Mining Techniques: Clustering
K-means clustering Hongning Wang
Chapter 4: Unsupervised Learning. CS583, Bing Liu, UIC 2 Road map Basic concepts K-means algorithm Representation of clusters Hierarchical clustering.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 16: Flat Clustering 1.
1 Chapter 8: Clustering. 2 Searching for groups Clustering is unsupervised or undirected. Unlike classification, in clustering, no pre- classified data.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
What is Cluster Analysis?
CLUSTERING (Segmentation)
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
COMP53311 Clustering Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan ’ s notes Presented by Raymond Wong
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Unsupervised Learning. Supervised learning vs. unsupervised learning.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
A new clustering tool of Data Mining RAPID MINER.
CLUSTERING AND SEGMENTATION MIS2502 Data Analytics Adapted from Tan, Steinbach, and Kumar (2004). Introduction to Data Mining.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Chapter 4: Unsupervised Learning Dr. Mehmet S. Aktaş Acknowledgement: Thanks to Dr. Bing Liu for teaching materials.
Mr. Idrissa Y. H. Assistant Lecturer, Geography & Environment Department of Social Sciences School of Natural & Social Sciences State University of Zanzibar.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Clustering.
Unsupervised Learning: Clustering
CSC 4510/9010: Applied Machine Learning
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Topic 3: Cluster Analysis
Data Mining – Chapter 4 Cluster Analysis Part 2
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
Data Mining CSCI 307, Spring 2019 Lecture 24
Presentation transcript:

Unsupervised Learning

CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with a target (class) attribute.  These patterns are then utilized to predict the values of the target attribute in future data instances. Unsupervised learning: The data have no target attribute.  We want to explore the data to find some intrinsic structures in them.

CS583, Bing Liu, UIC 3 Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e.,  it groups data instances that are similar to (near) each other in one cluster and data instances that are very different (far away) from each other into different clusters. Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. Due to historical reasons, clustering is often considered synonymous with unsupervised learning.  In fact, association rule mining is also unsupervised This lecture focuses on clustering.

CS583, Bing Liu, UIC 4 An illustration The data set has three natural groups of data points, i.e., 3 natural clusters.

CS583, Bing Liu, UIC 5 What is clustering for? Let us see some real-life examples Example 1: groups people of similar sizes together to make “small”, “medium” and “large” T-Shirts.  Tailor-made for each person: too expensive  One-size-fits-all: does not fit all. Example 2: In marketing, segment customers according to their similarities  To do targeted marketing.

CS583, Bing Liu, UIC 6 What is clustering for? (cont…) Example 3: Given a collection of text documents, we want to organize them according to their content similarities,  To produce a topic hierarchy In fact, clustering is one of the most utilized data mining techniques.  It has a long history, and used in almost every field, e.g., medicine, psychology, botany, sociology, biology, archeology, marketing, insurance, libraries, etc.  In recent years, due to the rapid increase of online documents, text clustering becomes important.

CS583, Bing Liu, UIC 7 Aspects of clustering A clustering algorithm  Partitional clustering  Hierarchical clustering  … A distance (similarity, or dissimilarity) function Clustering quality  Inter-clusters distance  maximized  Intra-clusters distance  minimized The quality of a clustering result depends on the algorithm, the distance function, and the application.

CS583, Bing Liu, UIC 8 K-means clustering K-means is a partitional clustering algorithm Let the set of data points (or instances) D be {x 1, x 2, …, x n }, where x i = (x i1, x i2, …, x ir ) is a vector in a real- valued space X  R r, and r is the number of attributes (dimensions) in the data. The k-means algorithm partitions the given data into k clusters.  Each cluster has a cluster center, called centroid.  k is specified by the user

CS583, Bing Liu, UIC 9 K-means algorithm Given k, the k-means algorithm works as follows: 1)Randomly choose k data points (seeds) to be the initial centroids, cluster centers 2)Assign each data point to the closest centroid 3)Re-compute the centroids using the current cluster memberships. 4)If a convergence criterion is not met, go to 2).

CS583, Bing Liu, UIC 10 K-means algorithm – (cont …)

CS583, Bing Liu, UIC 11 Stopping/convergence criterion 1. no (or minimum) re-assignments of data points to different clusters, 2. no (or minimum) change of centroids, or 3. minimum decrease in the sum of squared error (SSE),  C i is the jth cluster, m j is the centroid of cluster C j (the mean vector of all the data points in C j ), and dist(x, m j ) is the distance between data point x and centroid m j. (1)

CS583, Bing Liu, UIC 12 An example + +

CS583, Bing Liu, UIC 13 An example (cont …)

CS583, Bing Liu, UIC 14 An example distance function

CS583, Bing Liu, UIC 15 A disk version of k-means K-means can be implemented with data on disk  In each iteration, it scans the data once.  as the centroids can be computed incrementally It can be used to cluster large datasets that do not fit in main memory We need to control the number of iterations  In practice, a limited is set (< 50). Not the best method. There are other scale- up algorithms, e.g., BIRCH.

CS583, Bing Liu, UIC 16 A disk version of k-means (cont …)

CS583, Bing Liu, UIC 17 Strengths of k-means Strengths:  Simple: easy to understand and to implement  Efficient: Time complexity: O(tkn), where n is the number of data points, k is the number of clusters, and t is the number of iterations.  Since both k and t are small. k-means is considered a linear algorithm. K-means is the most popular clustering algorithm. Note that: it terminates at a local optimum if SSE is used. The global optimum is hard to find due to complexity.

CS583, Bing Liu, UIC 18 Weaknesses of k-means The algorithm is only applicable if the mean is defined.  For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers  Outliers are data points that are very far away from other data points.  Outliers could be errors in the data recording or some special data points with very different values.

CS583, Bing Liu, UIC 19 Weaknesses of k-means: Problems with outliers

CS583, Bing Liu, UIC 20 Weaknesses of k-means: To deal with outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points.  To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small.  Assign the rest of the data points to the clusters by distance or similarity comparison, or classification

CS583, Bing Liu, UIC 21 Weaknesses of k-means (cont …) The algorithm is sensitive to initial seeds.

CS583, Bing Liu, UIC 22 Weaknesses of k-means (cont …) If we use different seeds: good results There are some methods to help choose good seeds