CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/

Slides:



Advertisements
Similar presentations
Clustering Basic Concepts and Algorithms
Advertisements

PARTITIONAL CLUSTERING
©2012 Paula Matuszek CSC 9010: Text Mining Applications: Document Clustering l Dr. Paula Matuszek l
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , Chapter 8.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 16 10/18/2011.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
K-means clustering Hongning Wang
Machine Learning and Data Mining Clustering
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Slide 1 EE3J2 Data Mining Lecture 16 Unsupervised Learning Ali Al-Shahib.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Evaluating Performance for Data Mining Techniques
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
DATA MINING CLUSTERING K-Means.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website:
Unsupervised learning introduction
Unsupervised Learning. Supervised learning vs. unsupervised learning.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
Cluster Analysis Potyó László. Cluster: a collection of data objects Similar to one another within the same cluster Similar to one another within the.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
CHAPTER 1: Introduction. 2 Why “Learn”? Machine learning is programming computers to optimize a performance criterion using example data or past experience.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Clustering Unsupervised learning introduction Machine Learning.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Cluster Analysis Dr. Bernard Chen Assistant Professor Department of Computer Science University of Central Arkansas.
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Cluster Analysis Dr. Bernard Chen Ph.D. Assistant Professor Department of Computer Science University of Central Arkansas Fall 2010.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Clustering Approaches Ka-Lok Ng Department of Bioinformatics Asia University.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
COMP24111 Machine Learning K-means Clustering Ke Chen.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning: Clustering
CSC 4510/9010: Applied Machine Learning
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Ke Chen Reading: [7.3, EA], [9.1, CMB]
Machine Learning Lecture 9: Clustering
Topic 3: Cluster Analysis
Ke Chen Reading: [7.3, EA], [9.1, CMB]
KAIST CS LAB Oh Jong-Hoon
Data Mining 資料探勘 分群分析 (Cluster Analysis) Min-Yuh Day 戴敏育
CS 391L: Machine Learning Clustering
Text Categorization Berlin Chen 2003 Reference:
Topic 5: Cluster Analysis
Introduction to Machine learning
Presentation transcript:

CSC 4510 – Machine Learning Dr. Mary-Angela Papalaskari Department of Computing Sciences Villanova University Course website: 11: Unsupervised Learning - Clustering 1 Some of the slides in this presentation are adapted from: Prof. Frank Klassner’s ML class at Villanova the University of Manchester ML course The Stanford online ML course

Supervised learning Training set: The Stanford online ML course

Unsupervised learning Training set: The Stanford online ML course

Unsupervised Learning Learning “what normally happens” No output Clustering: Grouping similar instances Example applications – Customer segmentation – Image compression: Color quantization – Bioinformatics: Learning motifs 4 CSC M.A. Papalaskari - Villanova University

Clustering Algorithms K means Hierarchical – Bottom up or top down Probabilistic – Expectation Maximization (E-M) CSC M.A. Papalaskari - Villanova University 5

Clustering algorithms Partitioning method: Construct a partition of n examples into a set of K clusters Given: a set of examples and the number K Find: a partition of K clusters that optimizes the chosen partitioning criterion – Globally optimal: exhaustively enumerate all partitions – Effective heuristic method: K-means algorithm. CSC M.A. Papalaskari - Villanova University 6

19 K-Means Assumes instances are real-valued vectors. Clusters based on centroids, center of gravity, or mean of points in a cluster, c Reassignment of instances to clusters is based on distance to the current cluster centroids. Based on: CSC M.A. Papalaskari - Villanova University 7

K-means intuition Randomly choose k points as seeds, one per cluster. Form initial clusters based on these seeds. Iterate, repeatedly reallocating seeds and by re- computing clusters to improve the overall clustering. Stop when clustering converges or after a fixed number of iterations. Based on: CSC M.A. Papalaskari - Villanova University 8

The Stanford online ML course

The Stanford online ML course

The Stanford online ML course

The Stanford online ML course

The Stanford online ML course

The Stanford online ML course

The Stanford online ML course

The Stanford online ML course

The Stanford online ML course

21 K-Means Algorithm based on: Let d be the distance measure between instances. Select k random points {s1, s2,… sk} as seeds. Until clustering converges or other stopping criterion: – For each instance xi: Assign xi to the cluster cj such that d(xi, sj) is minimal. – (Update the seeds to the centroid of each cluster) For each cluster cj, sj = μ(cj) CSC M.A. Papalaskari - Villanova University 18

Distance measures Euclidean distance Manhattan Hamming CSC M.A. Papalaskari - Villanova University 19

CSC M.A. Papalaskari - Villanova University 20 Orange schema

CSC M.A. Papalaskari - Villanova University 21

Clusters aren’t always separated…

K-means for non-separated clusters T-shirt sizing Height Weight The Stanford online ML course

Weaknesses of k-means The algorithm is only applicable to numeric data The user needs to specify k. The algorithm is sensitive to outliers – Outliers are data points that are very far away from other data points. – Outliers could be errors in the data recording or some special data points with very different values. CSC M.A. Papalaskari - Villanova University 24

Strengths of k-means Strengths: – Simple: easy to understand and to implement – Efficient: Time complexity: O(tkn), – where n is the number of data points, – k is the number of clusters, and – t is the number of iterations. – Since both k and t are small. k-means is considered a linear algorithm. K-means is the most popular clustering algorithm. CSC M.A. Papalaskari - Villanova University 25