IAT 355 Clustering ______________________________________________________________________________________.

Slides:



Advertisements
Similar presentations
Clustering.
Advertisements

Clustering.
Information Retrieval Lecture 7 Introduction to Information Retrieval (Manning et al. 2007) Chapter 17 For the MSc Computer Science Programme Dell Zhang.
Clustering Basic Concepts and Algorithms
PARTITIONAL CLUSTERING
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Introduction to Bioinformatics
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
1 Text Clustering. 2 Clustering Partition unlabeled examples into disjoint subsets of clusters, such that: –Examples within a cluster are very similar.
Cluster Analysis.
What is Cluster Analysis?
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
Unsupervised Learning: Clustering 1 Lecture 16: Clustering Web Search and Mining.
Text Clustering.
Clustering Supervised vs. Unsupervised Learning Examples of clustering in Web IR Characteristics of clustering Clustering algorithms Cluster Labeling 1.
tch?v=Y6ljFaKRTrI Fireflies.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
ITCS 6265 Information Retrieval & Web Mining Lecture 15 Clustering.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
1 Motivation Web query is usually two or three words long. –Prone to ambiguity –Example “keyboard” –Input device of computer –Musical instruments How can.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 16: Clustering.
Clustering Paolo Ferragina Dipartimento di Informatica Università di Pisa This is a mix of slides taken from several presentations, plus my touch !
Chapter 2: Getting to Know Your Data
Information Retrieval and Organisation Chapter 16 Flat Clustering Dell Zhang Birkbeck, University of London.
Clustering. What is Clustering? Clustering: the process of grouping a set of objects into classes of similar objects –Documents within a cluster should.
Clustering.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 15 10/13/2011.
Clustering (Modified from Stanford CS276 Slides - Lecture 17 Clustering)
Today’s Topic: Clustering  Document clustering Motivations Document representations Success criteria  Clustering algorithms Partitional Hierarchical.
Compiled By: Raj Gaurang Tiwari Assistant Professor SRMGPC, Lucknow Unsupervised Learning.
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Clustering Analysis CS 685: Special Topics in Data Mining Jinze Liu.
Lecture 12: Clustering May 5, Clustering (Ch 16 and 17)  Document clustering  Motivations  Document representations  Success criteria  Clustering.
Information Retrieval and Organisation Chapter 17 Hierarchical Clustering Dell Zhang Birkbeck, University of London.
Basic Machine Learning: Clustering CS 315 – Web Search and Data Mining 1.
Clustering Slides adapted from Chris Manning, Prabhakar Raghavan, and Hinrich Schütze (
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Introduction to Information Retrieval Introduction to Information Retrieval Clustering Chris Manning, Pandu Nayak, and Prabhakar Raghavan.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Data Mining and Text Mining. The Standard Data Mining process.
Big Data Infrastructure Week 9: Data Mining (4/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States.
Hierarchical Clustering & Topic Models
Unsupervised Learning
Unsupervised Learning: Clustering
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sampath Jayarathna Cal Poly Pomona
Unsupervised Learning: Clustering
Semi-Supervised Clustering
Machine Learning Clustering: K-means Supervised Learning
Machine Learning Lecture 9: Clustering
Data Mining K-means Algorithm
Clustering Seminar Social Media Mining University UC3M Date May 2017
K-Means Seminar Social Media Mining University UC3M Date May 2017
CSE 5243 Intro. to Data Mining
K-means and Hierarchical Clustering
Information Organization: Clustering
Revision (Part II) Ke Chen
本投影片修改自Introduction to Information Retrieval一書之投影片 Ch 16 & 17
Análisis de Cluster.
CS 391L: Machine Learning Clustering
Clustering Wei Wang.
Text Categorization Berlin Chen 2003 Reference:
Clustering Techniques
Topic 5: Cluster Analysis
Hierarchical Clustering
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Introduction to Machine learning
Unsupervised Learning
Presentation transcript:

IAT 355 Clustering ______________________________________________________________________________________ SCHOOL OF INTERACTIVE ARTS + TECHNOLOGY [SIAT] | WWW.SIAT.SFU.CA IAT 355 1 IAT 355

Lecture outline Distance/Similarity between data objects Data objects as geometric data points Clustering problems and algorithms K-means K-median

What is clustering? A grouping of data objects such that the objects within a group are similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized

Outliers cluster outliers Outliers are objects that do not belong to any cluster or form clusters of very small cardinality In some applications we are interested in discovering outliers, not clusters (outlier analysis) cluster outliers

What is clustering? Clustering: the process of grouping a set of objects into classes of similar objects Documents within a cluster should be similar. Documents from different clusters should be dissimilar. The commonest form of unsupervised learning Unsupervised learning = learning from raw data, as opposed to supervised data where a classification of examples is given A common and important task that finds many applications in Info Retrieval and other places

Why do we cluster? Clustering : given a collection of data objects group them so that Similar to one another within the same cluster Dissimilar to the objects in other clusters Clustering results are used: As a stand-alone tool to get insight into data distribution Visualization of clusters may unveil important information As a preprocessing step for other algorithms Efficient indexing or compression often relies on clustering

Applications of clustering? Image Processing cluster images based on their visual content Web Cluster groups of users based on their access patterns on webpages Cluster webpages based on their content Bioinformatics Cluster similar proteins together (similarity wrt chemical structure and/or functionality etc) Many more…

Yahoo! Hierarchy isn’t clustering but is the kind of output you want from clustering www.yahoo.com/Science … (30) agriculture biology physics CS space ... ... ... ... ... dairy botany cell AI courses crops craft magnetism Final data set: Yahoo Science hierarchy, consisting of text of web pages pointed to by Yahoo, gathered summer of 1997. 264 classes, only sample shown here. agronomy HCI missions forestry evolution relativity

The clustering task Group observations into groups so that the observations belonging in the same group are similar, whereas observations in different groups are different Basic questions: What does “similar” mean What is a good partition of the objects? I.e., how is the quality of a solution measured How to find a good partition of the observations

Observations to cluster Real-value attributes/variables e.g., salary, height Binary attributes e.g., gender (M/F), has_cancer(T/F) Nominal (categorical) attributes e.g., religion (Christian, Muslim, Buddhist, Hindu, etc.) Ordinal/Ranked attributes e.g., military rank (soldier, sergeant, lutenant, captain, etc.) Variables of mixed types multiple attributes with various types

Observations to cluster Usually data objects consist of a set of attributes (also known as dimensions) J. Smith, 20, 200K If all d dimensions are real-valued then we can visualize each data point as points in a d-dimensional space If all d dimensions are binary then we can think of each data point as a binary vector

Distance functions The distance d(x, y) between two objects xand y is a metric if d(i, j)0 (non-negativity) d(i, i)=0 (isolation) d(i, j)= d(j, i) (symmetry) d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality) [Why do we need it?] The definitions of distance functions are usually different for real, boolean, categorical, and ordinal variables. Weights may be associated with different variables based on applications and data semantics.

Data Structures data matrix Distance matrix attributes/dimensions Data Cases/Objects objects objects

Distance functions for binary vectors Jaccard similarity between binary vectors X and Y Jaccard distance between binary vectors X and Y Jdist(X,Y) = 1- JSim(X,Y) Example: JSim = 1/6 Jdist = 5/6 Q1 Q2 Q3 Q4 Q5 Q6 X 1 Y

Distance functions for real-valued vectors Lp norms or Minkowski distance: where p is a positive integer If p = 1, L1 is the Manhattan (or city block) distance:

Distance functions for real-valued vectors If p = 2, L2 is the Euclidean distance: Also one can use weighted distance: Very often Lpp is used instead of Lp (why?)

Partitioning algorithms: basic concept Construct a partition of a set of n objects into a set of k clusters Each object belongs to exactly one cluster The number of clusters k is given in advance

The k-means problem Given a set X of n points in a d-dimensional space and an integer k Task: choose a set of k points {c1, c2,…,ck} in the d-dimensional space to form clusters {C1, C2,…,Ck} such that is minimized Some special cases: k = 1, k = n

The k-means algorithm One way of solving the k-means problem Randomly pick k cluster centers {c1,…,ck} For each i, set the cluster Ci to be the set of points in X that are closer to ci than they are to cj for all i≠j For each i let ci be the center of cluster Ci (mean of the vectors in Ci) Repeat until convergence

K-means Thanks, Wikipedia! Feb 24, 2017 IAT 355

Properties of the k-means algorithm Finds a local optimum Converges often quickly (but not always) The choice of initial points can have large influence in the result

Two different K-means Clusterings Original Points Optimal Clustering Sub-optimal Clustering

Discussion k-means algorithm Finds a local optimum Converges often quickly (but not always) The choice of initial points can have large influence Clusters of different densities Clusters of different sizes Outliers can also cause a problem (Example?)

Some alternatives to random initialization of the central points Multiple runs Helps, but probability is not on your side Select original set of points by methods other than random . E.g., pick the most distant (from each other) points as cluster centers (kmeans++ algorithm)

The k-median problem Given a set X of n points in a d-dimensional space and an integer k Task: choose a set of k points {c1,c2,…,ck} from X and form clusters {C1,C2,…,Ck} such that is minimized

k-means vs k-median k-Means: Choose arbitrary cluster centers ci k-Medians: Task: choose a set of k points {c1,c2,…,ck} from X and form clusters {C1,C2,…,Ck} such that

Text: Notion of similarity/distance Ideal: semantic similarity. Practical: term-statistical similarity We will use cosine similarity. Docs as vectors. For many algorithms, easier to think in terms of a distance (rather than similarity) between docs. We will mostly speak of Euclidean distance But real implementations use cosine similarity

Hierarchical Clustering Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents. One approach: recursive application of a partitional clustering algorithm. animal vertebrate fish reptile amphib. mammal worm insect crustacean invertebrate

Dendrogram: Hierarchical Clustering Clustering obtained by cutting the dendrogram at a desired level: each connected component forms a cluster.

Hierarchical Agglomerative Clustering (HAC) Sec. 17.1 Hierarchical Agglomerative Clustering (HAC) Starts with each item in a separate cluster then repeatedly joins the closest pair of clusters, until there is only one cluster. The history of merging forms a binary tree or hierarchy. Note: the resulting clusters are still “hard” and induce a partition

Closest pair of clusters Sec. 17.2 Closest pair of clusters Many variants to defining closest pair of clusters Single-link Similarity of the most similar (single-link) Complete-link Similarity of the “furthest” points, the least similar Centroid Clusters whose centroids (centers of gravity) are the most similar Average-link Average distance between pairs of elements

Single Link Agglomerative Clustering Sec. 17.2 Single Link Agglomerative Clustering Use maximum similarity of pairs: Can result in long and thin clusters due to chaining effect. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

Sec. 17.2 Single Link Example

Complete Link Use minimum similarity of pairs: Sec. 17.2 Complete Link Use minimum similarity of pairs: Makes “tighter,” spherical clusters that are typically preferable. After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is: Ci Cj Ck

Sec. 17.2 Complete Link Example

What Is A Good Clustering? Sec. 16.3 What Is A Good Clustering? Internal criterion: A good clustering will produce high quality clusters in which: the intra-class (that is, within-cluster) similarity is high the inter-class (between clusters) similarity is low The measured quality of a clustering depends on both the item (document) representation and the similarity measure used

External criteria for clustering quality Sec. 16.3 External criteria for clustering quality Quality measured by its ability to discover some or all of the hidden patterns or latent classes in gold standard data Assesses a clustering with respect to ground truth … requires labeled data Assume documents with C gold standard classes, while our clustering algorithms produce K clusters, ω1, ω2, …, ωK with ni members.

Purity Simple measure: purity, the ratio between the most common class in the cluster πi and the size of cluster ωi Biased because having n clusters maximizes purity Others are entropy of classes in clusters (or mutual information between classes and clusters)

Purity example                  Cluster I (Red) Sec. 16.3 Purity example                  Cluster I (Red) Cluster II (Blue) Cluster III (Green) Cluster I: Purity = 1/6 (max(5, 1, 0)) = 5/6 Cluster II: Purity = 1/6 (max(1, 4, 1)) = 4/6 Cluster III: Purity = 1/5 (max(2, 0, 3)) = 3/5

Rand Index measures between pair decisions. Here RI = 0.68 Sec. 16.3 Rand Index measures between pair decisions. Here RI = 0.68 Number of points Same Cluster in clustering Different Clusters in clustering Same class in ground truth 20 24 Different classes in ground truth 72

Sec. 16.3 Rand index

Thanks to: Evimaria Terzi, Boston University Pandu Nayak and Prabhakar Raghavan, Stanford University