Unsupervised Learning. Supervised learning vs. unsupervised learning.

Slides:



Advertisements
Similar presentations
K-Means Clustering Algorithm Mining Lab
Advertisements

Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
Unsupervised learning
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
Chapter 4: Unsupervised Learning. CS583, Bing Liu, UIC 2 Road map Basic concepts K-means algorithm Representation of clusters Hierarchical clustering.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis (1).
What is Cluster Analysis?
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Lecture 09 Clustering-based Learning
Clustering Unsupervised learning Generating “classes”
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Algorithms k-means Hierarchic Agglomerative Clustering (HAC) …. BIRCH Association Rule Hypergraph Partitioning (ARHP) Categorical clustering.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Clustering What is clustering? Also called “unsupervised learning”Also called “unsupervised learning”
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
COMP Data Mining: Concepts, Algorithms, and Applications 1 K-means Arbitrarily choose k objects as the initial cluster centers Until no change,
Clustering I. 2 The Task Input: Collection of instances –No special class label attribute! Output: Clusters (Groups) of instances where members of a cluster.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
CS 8751 ML & KDDData Clustering1 Clustering Unsupervised learning Generating “classes” Distance/similarity measures Agglomerative methods Divisive methods.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
Clustering Algorithms Sunida Ratanothayanon. What is Clustering?
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
Chapter 4: Unsupervised Learning Dr. Mehmet S. Aktaş Acknowledgement: Thanks to Dr. Bing Liu for teaching materials.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Mining and Text Mining. The Standard Data Mining process.
Topic 4: Cluster Analysis Analysis of Customer Behavior and Service Modeling.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Clustering.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
Unsupervised Learning: Clustering
CSC 4510/9010: Applied Machine Learning
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Topic 3: Cluster Analysis
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
Topic 5: Cluster Analysis
Unsupervised Learning: Clustering
Presentation transcript:

Unsupervised Learning

Supervised learning vs. unsupervised learning

Adapted from Andrew Moore,

K-means clustering algorithm 8 Adapted from Bing Liu, UIC Input: k, D; Choose k points as initial centroids (cluster centers); Repeat the following until the stopping criterion is met: For each data point x  D do compute the distance from x to each centroid; assign x to the closest centroid; Re-compute centroids as means of current cluster memberships

Demo

10 Stopping/convergence criterion 1.no (or minimum) re-assignments of data points to different clusters, 2.no (or minimum) change of centroids, or 3.minimum decrease in the sum of squared error (SSE), C i is the jth cluster, m j is the centroid of cluster C j (the mean vector of all the data points in C j ), and dist(x, m j ) is the distance between data point x and centroid m j. (1) Adapted from Bing Liu, UIC

CS583, Bing Liu, UIC Example distance functions Let x i = (a i1,..., a in ) and x j = (a j1,...,a jn ) – Euclidean distance: – Manhattan (city block) distance

A text document consists of a sequence of sentences and each sentence consists of a sequence of words. To simplify: a document is usually considered a “bag” of words in document clustering. – Sequence and position of words are ignored. A document is represented with a vector just like a normal data point. Distance between two documents is the cosine of the angle between their corresponding feature vectors. Distance function for text documents Adapted from Bing Liu, UIC

Example from Clustering Map of Biomedical Articles

Example: Image segmentation by k-means clustering by color From K=5, RGB space

K=10, RGB space

K=5, RGB space

K=10, RGB space

K=5, RGB space

K=10, RGB space

Weaknesses of k-means Adapted from Bing Liu, UIC

Weaknesses of k-means The algorithm is only applicable if the mean is defined. – For categorical data, k-mode - the centroid is represented by most frequent values. Adapted from Bing Liu, UIC

Weaknesses of k-means The algorithm is only applicable if the mean is defined. – For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. Adapted from Bing Liu, UIC

Weaknesses of k-means The algorithm is only applicable if the mean is defined. – For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers – Outliers are data points that are very far away from other data points. – Outliers could be errors in the data recording or some special data points with very different values. Adapted from Bing Liu, UIC

Weaknesses of k-means The algorithm is only applicable if the mean is defined. – For categorical data, k-mode - the centroid is represented by most frequent values. The user needs to specify k. The algorithm is sensitive to outliers – Outliers are data points that are very far away from other data points. – Outliers could be errors in the data recording or some special data points with very different values. k-means is sensitive to initial random centroids Adapted from Bing Liu, UIC

CS583, Bing Liu, UIC Weaknesses of k-means: Problems with outliers Adapted from Bing Liu, UIC

How to deal with outliers/noise in clustering?

CS583, Bing Liu, UIC Dealing with outliers One method is to remove some data points in the clustering process that are much further away from the centroids than other data points. – To be safe, we may want to monitor these possible outliers over a few iterations and then decide to remove them. Another method is to perform random sampling. Since in sampling we only choose a small subset of the data points, the chance of selecting an outlier is very small. – Assign the rest of the data points to the clusters by distance or similarity comparison, or classification Adapted from Bing Liu, UIC

CS583, Bing Liu, UIC Weaknesses of k-means (cont …) The algorithm is sensitive to initial seeds. + + Adapted from Bing Liu, UIC

CS583, Bing Liu, UIC If we use different seeds: good results There are some methods to help choose good seeds + + Adapted from Bing Liu, UIC Weaknesses of k-means (cont …)

CS583, Bing Liu, UIC The k-means algorithm is not suitable for discovering clusters that are not hyper-ellipsoids (or hyper-spheres). + Adapted from Bing Liu, UIC Weaknesses of k-means (cont …)

CS583, Bing Liu, UIC k-means summary Despite weaknesses, k-means is still the most popular algorithm due to its simplicity, efficiency and – other clustering algorithms have their own lists of weaknesses. No clear evidence that any other clustering algorithm performs better in general – although they may be more suitable for some specific types of data or applications. Adapted from Bing Liu, UIC