Data Mining CSCI 307, Spring 2019 Lecture 24

Slides:



Advertisements
Similar presentations
CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Advertisements

Cluster Analysis: Basic Concepts and Algorithms
K-means clustering Hongning Wang
Introduction to Bioinformatics
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
4. Ad-hoc I: Hierarchical clustering
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Clustering. 2 Outline  Introduction  K-means clustering  Hierarchical clustering: COBWEB.
Advanced Multimedia Text Clustering Tamara Berg. Reminder - Classification Given some labeled training documents Determine the best label for a test (query)
What is Cluster Analysis?
K-means Clustering. What is clustering? Why would we want to cluster? How would you determine clusters? How can you do this efficiently?
Chapter 3: Cluster Analysis  3.1 Basic Concepts of Clustering  3.2 Partitioning Methods  3.3 Hierarchical Methods The Principle Agglomerative.
Evaluating Performance for Data Mining Techniques
1 Formal Evaluation Techniques Chapter 7. 2 test set error rates, confusion matrices, lift charts Focusing on formal evaluation methods for supervised.
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
DATA MINING CLUSTERING K-Means.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Applications of Dynamic Programming and Heuristics to the Traveling Salesman Problem ERIC SALMON & JOSEPH SEWELL.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.7: Instance-Based Learning Rodney Nielsen.
Information Retrieval Lecture 6 Introduction to Information Retrieval (Manning et al. 2007) Chapter 16 For the MSc Computer Science Programme Dell Zhang.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.8: Clustering Rodney Nielsen Many of these.
Selecting Diverse Sets of Compounds C371 Fall 2004.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall 6.8: Clustering Rodney Nielsen Many / most of these.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 6.2: Classification Rules Rodney Nielsen Many.
Machine Learning Queens College Lecture 7: Clustering.
Slide 1 EE3J2 Data Mining Lecture 18 K-means and Agglomerative Algorithms.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
DATA MINING: CLUSTER ANALYSIS Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Machine Learning Lecture 4: Unsupervised Learning (clustering) 1.
Data Mining – Algorithms: K Means Clustering
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
Data Science Practical Machine Learning Tools and Techniques 6.8: Clustering Rodney Nielsen Many / most of these slides were adapted from: I. H. Witten,
Unsupervised Learning
Combinatorial clustering algorithms. Example: K-means clustering
Data Mining: Basic Cluster Analysis
Unsupervised Learning: Clustering
Unsupervised Learning: Clustering
Clustering CSC 600: Data Mining Class 21.
Data Science Algorithms: The Basic Methods
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Clustering (3) Center-based algorithms Fuzzy k-means
AIM: Clustering the Data together
Clustering.
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Data Mining – Chapter 4 Cluster Analysis Part 2
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Cluster Analysis.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Unsupervised Learning: Clustering
Data Mining CSCI 307, Spring 2019 Lecture 21
Data Mining CSCI 307, Spring 2019 Lecture 23
Unsupervised Learning
Presentation transcript:

Data Mining CSCI 307, Spring 2019 Lecture 24 Clustering

Clustering Clustering techniques apply when there is no class to be predicted Aim: divide instances into “natural” groups As we've seen clusters can be: disjoint vs. overlapping deterministic vs. probabilistic flat vs. hierarchical Look at a classic clustering algorithm: k-means k-means clusters are disjoint, deterministic, and flat hierarchical 1 2 3 a 0.4 0.1 0.5 b 0.1 0.8 0.1 probabilistic

The k-means algorithm How to choose k (i.e. the number of clusters) is saved for Chapter 5, after we learn how to evaluate the success of ML. To cluster data into k (k is predefined number of clusters) groups: 0. Choose k cluster centers (at random) 1. Assign instances to clusters (based on distance to cluster centers) 2. Compute centroids (i.e. the mean of all the instances) of the clusters 3. Go to 1, until convergence

Discussion Algorithm minimizes squared distance to cluster centers Result can vary significantly (based on initial choice of seeds) Can get trapped in local minimum, example: initial cluster centers To increase chance of finding global optimum: restart with different random seeds k-means++ (procedure for finding better seeds, and ultimately improve the cluster)

Discussion initial cluster centers instances initial cluster centers

Need ==> Faster Distance Calculations Use kD-trees or ball trees to speed up the process (of finding the closest cluster)! First, build tree, which remains static, for all the data points At each node, store number of instances and sum of all instances In each iteration, descend tree and find which cluster each node belongs to Can stop descending as soon as we find out a node belongs entirely to a particular cluster Use statistics stored at the nodes to compute new cluster centers

No nodes need to be examined below the frontier

Choosing the number of clusters Big question in practice: what is the right number of clusters, i.e., what is the right value for k? Cannot simply optimize squared distance on training data to choose k Squared distance decreases monotonically with increasing values of k Need some measure that balances distance with complexity of the model, e.g., based on the MDL principle (covered later) Finding the right-size model using MDL becomes easier when applying a recursive version of k-means (bisecting k-means): Compute A: information required to store data centroid, and the location of each instance with respect to this centroid Split data into two clusters using 2-means Compute B: information required to store the two new cluster centroids, and the location of each instance with respect to these two If A > B, split the data and recurse (just like in other tree learners)