Download presentation
Presentation is loading. Please wait.
1
Semi-Supervised Clustering
Jieping Ye Department of Computer Science and Engineering Arizona State University
2
Outline Overview of clustering and classification
What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification What is semi-supervised clustering? Why semi-supervised clustering? Semi-supervised clustering algorithms
3
Supervised classification versus unsupervised clustering
Unsupervised clustering Group similar objects together to find clusters Minimize intra-class distance Maximize inter-class distance Supervised classification Class label for each training sample is given Build a model from the training data Predict class label on unseen future data points
4
What is clustering? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups Inter-cluster distances are maximized Intra-cluster distances are minimized
5
What is Classification?
6
Clustering algorithms
K-Means Hierarchical clustering Graph based clustering (Spectral clustering) Bi-clustering
7
Classification algorithms
K-Nearest-Neighbor classifiers Naïve Bayes classifier Linear Discriminant Analysis (LDA) Support Vector Machines (SVM) Logistic Regression Neural Networks
8
Supervised Classification Example
. . . .
9
Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
10
Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
11
Unsupervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
12
Unsupervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
13
Semi-Supervised Learning
Combines labeled and unlabeled data during training to improve performance: Semi-supervised classification: Training on labeled data exploits additional unlabeled data, frequently resulting in a more accurate classifier. Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data. Unsupervised clustering Semi-supervised learning Supervised classification
14
Semi-Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
15
Semi-Supervised Classification Example
. . . . . . . . . . . . . . . . . . . .
16
Semi-Supervised Classification
Algorithms: Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00]. Co-training [Blum:COLT98]. Transductive SVM’s [Vapnik:98,Joachims:ICML99]. Graph based algorithms Assumptions: Known, fixed set of categories given in the labeled data. Goal is to improve classification of examples into these known categories.
17
Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
18
Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
19
Second Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
20
Second Semi-Supervised Clustering Example
. . . . . . . . . . . . . . . . . . . .
21
Semi-supervised clustering: problem definition
Input: A set of unlabeled objects, each described by a set of attributes (numeric and/or categorical) A small amount of domain knowledge Output: A partitioning of the objects into k clusters (possibly with some discarded as outliers) Objective: Maximum intra-cluster similarity Minimum inter-cluster similarity High consistency between the partitioning and the domain knowledge
22
Why semi-supervised clustering?
Why not clustering? The clusters produced may not be the ones required. Sometimes there are multiple possible groupings. Why not classification? Sometimes there are insufficient labeled data. Potential applications Bioinformatics (gene and protein clustering) Document hierarchy construction News/ categorization Image categorization
23
Semi-Supervised Clustering
Domain knowledge Partial label information is given Apply some constraints (must-links and cannot-links) Approaches Search-based Semi-Supervised Clustering Alter the clustering algorithm using the constraints Similarity-based Semi-Supervised Clustering Alter the similarity measure based on the constraints Combination of both
24
Search-Based Semi-Supervised Clustering
Alter the clustering algorithm that searches for a good partitioning by: Modifying the objective function to give a reward for obeying labels on the supervised data [Demeriz:ANNIE99]. Enforcing constraints (must-link, cannot-link) on the labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01]. Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].
25
Overview of K-Means Clustering
K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters. Algorithm: Initialize K cluster centers randomly. Repeat until convergence: Cluster Assignment Step: Assign each data point x to the cluster Xl, such that L2 distance of x from (center of Xl) is minimum Center Re-estimation Step: Re-estimate each cluster center as the mean of the points in that cluster
26
K-Means Objective Function
Locally minimizes sum of squared distance between the data points and their corresponding cluster centers: Initialization of K cluster centers: Totally random Random perturbation from global mean Heuristic to ensure well-separated centers
27
K Means Example
28
K Means Example Randomly Initialize Means
29
K Means Example Assign Points to Clusters
30
K Means Example Re-estimate Means
31
K Means Example Re-assign Points to Clusters
32
K Means Example Re-estimate Means
33
K Means Example Re-assign Points to Clusters
34
K Means Example Re-estimate Means and Converge
35
Semi-Supervised K-Means
Partial label information is given Seeded K-Means Constrained K-Means Constraints (Must-link, Cannot-link) COP K-Means
36
Semi-Supervised K-Means for partially labeled data
Seeded K-Means: Labeled data provided by user are used for initialization: initial center for cluster i is the mean of the seed points having label i. Seed points are only used for initialization, and not in subsequent steps. Constrained K-Means: Labeled data provided by user are used to initialize K-Means algorithm. Cluster labels of seed data are kept unchanged in the cluster assignment steps, and only the labels of the non-seed data are re-estimated.
37
Seeded K-Means Use labeled data to find the initial centroids and
then run K-Means. The labels for seeded points may change.
38
Seeded K-Means Example
39
Seeded K-Means Example Initialize Means Using Labeled Data
40
Seeded K-Means Example Assign Points to Clusters
41
Seeded K-Means Example Re-estimate Means
42
Seeded K-Means Example Assign points to clusters and Converge
the label is changed x
43
Exercise Compute the clustering using seeded Kmeans
44
Constrained K-Means Use labeled data to find the initial centroids and
then run K-Means. The labels for seeded points will not change.
45
Constrained K-Means Example
46
Constrained K-Means Example Initialize Means Using Labeled Data
47
Constrained K-Means Example Assign Points to Clusters
48
Constrained K-Means Example Re-estimate Means and Converge
49
Exercise Compute the clustering using constrained Kmeans
50
COP K-Means COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points. Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster). Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.
51
COP K-Means Algorithm
52
Illustration Determine its label Must-link x x Assign to the red class
53
Illustration Determine its label Cannot-link Assign to the red class x
54
Illustration Determine its label Must-link Cannot-link
x x Cannot-link The clustering algorithm fails
55
Similarity-based semi-supervised clustering
Alter the similarity measure based on the constraints Paper: From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering. D. Klein et al. Two types of constraints: Must-links and Cannot-links Clustering algorithm: Hierarchical clustering
56
Constraints
57
Overview of Hierarchical Clustering Algorithm
Agglomerative versus Divisive Basic algorithm of Agglomerative clustering Compute the distance matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the distance matrix Until only a single cluster remains Key operation is the update of the distance between two clusters
58
How to Define Inter-Cluster Distance
p1 p3 p5 p4 p2 . . . . Distance? MIN MAX Group Average Distance Between Centroids distance matrix
59
Must-link constraints
Distance between must-links pair to zero. Derive a new metric by running an all-pairs-shortest distances algorithm. It is still a metric Faithful to the original metric Computational complexity: O(N2 C) C: number of points involved in must-link constraints N: total number of points
60
New distance matrix based on must-link constraints
p1 p3 p5 p4 p2 . . . . Hierarchical clustering can be carried out based on the new distance matrix. What is missing? New distance matrix
61
Cannot-link constraint
Run hierarchical clustering with complete link (MAX) The distance between two clusters is determined by the largest distance. Set the distance between cannot-link pair to be The new distance matrix does not define a metric. Work very well in practice
62
Constrained complete-link clustering algorithm
Derive a new distance Matrix based on both Types of constraints
63
Illustration 2 1 4 3 5 Initial distance matrix 0.2 0.5 0.1 0.8 0.4 0.6
0.2 0.5 0.1 0.8 0.4 0.6 0.3 2 1 4 3 5 Initial distance matrix
64
New distance matrix Must-links: 1—2, 3—4 Cannot-links: 2--3 0.9 0.2
0.2 0.5 0.1 0.8 0.4 0.6 0.3 0.1 0.8 0.2 0.6 Must-links: 1—2, 3—4 Cannot-links: 2--3
65
Hierarchical clustering
1 and 2 form a cluster, and 3 and 4 form another cluster 1,2 3,4 5 0.9 0.8 0.2 1,2 3,4 5
66
Summary Seeded and Constrained K-Means: partially labeled data
COP K-Means: constraints (Must-link and Cannot-link) Constrained K-Means and COP K-Means require all the constraints to be satisfied. May not be effective if the seeds contain noise. Seeded K-Means use the seeds only in the first step to determine the initial centroids. Less sensitive to the noise in the seeds. Semi-supervised hierarchical clustering
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.