DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Clustering II.
Hierarchical Clustering, DBSCAN The EM Algorithm
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Fast Algorithms For Hierarchical Range Histogram Constructions
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
K Means Clustering , Nearest Cluster and Gaussian Mixture
Assessment. Schedule graph may be of help for selecting the best solution Best solution corresponds to a plateau before a high jump Solutions with very.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.
Clustering II.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
EPIDEMIOLOGY AND BIOSTATISTICS DEPT Esimating Population Value with Hypothesis Testing.
Mutual Information Mathematical Biology Seminar
Iterative Optimization of Hierarchical Clusterings Doug Fisher Department of Computer Science, Vanderbilt University Journal of Artificial Intelligence.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Unsupervised Learning: Clustering Rong Jin Outline  Unsupervised learning  K means for clustering  Expectation Maximization algorithm for clustering.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Basic Data Mining Techniques
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
What is Cluster Analysis?
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Lecture II-2: Probability Review
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Bi-Clustering Jinze Liu. Outline The Curse of Dimensionality Co-Clustering  Partition-based hard clustering Subspace-Clustering  Pattern-based 2.
Unsupervised Learning Reading: Chapter 8 from Introduction to Data Mining by Tan, Steinbach, and Kumar, pp , , (
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
University of Ottawa - Bio 4118 – Applied Biostatistics © Antoine Morin and Scott Findlay 08/10/ :23 PM 1 Some basic statistical concepts, statistics.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha, Rajeev Rastogi, Kyuseok Shim Stanford University Bell Laboratories Bell Laboratories.
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Cluster validation Integration ICES Bioinformatics.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Flat clustering approaches
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Classification Ensemble Methods 1
Data Mining and Decision Support
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
Tutorial I: Missing Value Analysis
Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
CLUSTERING GRID-BASED METHODS Elsayed Hemayed Data Mining Course.
Non-parametric Methods for Clustering Continuous and Categorical Data Steven X. Wang Dept. of Math. and Stat. York University May 13, 2010.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Class Six Turn In: Chapter 15: 30, 32, 38, 44, 48, 50 Chapter 17: 28, 38, 44 For Class Seven: Chapter 18: 32, 34, 36 Chapter 19: 26, 34, 44 Quiz 3 Read.
Ensemble Classifiers.
Semi-Supervised Clustering
Data Mining Soongsil University
Constrained Clustering -Semi Supervised Clustering-
Data Mining K-means Algorithm
CSE572, CBS598: Data Mining by H. Liu
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
CSE572, CBS572: Data Mining by H. Liu
Data Transformations targeted at minimizing experimental variance
Text Categorization Berlin Chen 2003 Reference:
Chapter 7: Transformations
CSE572: Data Mining by H. Liu
Presentation transcript:

DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)

Outline Introduction –Projected clustering –Semi-supervised clustering Our problem Our new algorithm Experimental results Future works and extensions

Projected Clustering Where are the clusters?

Projected Clustering Where are the clusters?

Projected Clustering Pattern-based projected cluster:

Projected Clustering Goal: to discover clusters and their relevant dimensions that optimize a certain objective function. Previous approaches: –Partitional: PROCLUS, ORCLUS –One cluster at a time: DOC, FastDOC, MineClus –Hierarchical: HARP

Projected Clustering Limitations of the approaches: –Cannot detect clusters of extremely low dimensionalities (clusters with low percentage of relevant dimensions, e.g. only 5% of input dimensions are relevant) –Require the input of parameter values that are hard for users to supply –Performance sensitive to the parameter values –High time complexity

Semi-supervised Clustering In some applications, there is usually a small amount of domain knowledge available (e.g. the functions of 5% of the genes probed on a microarray). The knowledge may not be suitable/sufficient for carrying out classification. Clustering algorithms make little use of external knowledge.

Semi-supervised Clustering The idea of semi-supervised clustering: –Use the models implicitly assumed behind a clustering algorithm (e.g. compact hypersphere of k- means, density-connected irregular regions of DBScan) –Use external knowledge to guide the tuning of model parameters (e.g. location of cluster centers)

Semi-supervised Clustering Why not clustering? –The clusters produced may not be the ones required. –There could be multiple possible groupings. –There is no way to utilize the domain knowledge that is accessible (active learning v.s. passive validation). (Guha et al., 1998)

Semi-supervised Clustering Why not classification? –There is insufficient labeled data: Objects are not labeled. The amount of labeled objects is statistically insignificant. The labeled objects do not cover all classes. The labeled objects of a class do not cover all cases (e.g. they are all found at one side of a class). –It is not always possible to find a classification method with an underlying model that fits the data (e.g. pattern-based similarity).

Our Problem Data Model: –The input dataset has n objects and d dimensions –The dataset contains k disjoint clusters, and possibly some outlier objects –Each cluster is associated with a set of relevant dimensions –If a dimension is relevant to a cluster, the projections of the cluster members on the dimension are random samples of a local Gaussian distribution –Other projections are random samples of a global distribution (e.g. uniform distribution or Gaussian distribution with a standard deviation much larger than those of the local distributions)

Our Problem Resulting data: if a dimension is relevant to a cluster, the projections of its members on the dimension will be close to each other (the within-cluster variance much smaller than irrelevant dimensions). Example: XYZ C1N(5, 1) C2N(8, 1)N(6, 1)U(0, 10) C3U(0, 10)

Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)

Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)

Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)

Our Problem XYZ C1 N(5, 1) C2 N(8, 1)N(6, 1)U(0, 10) C3 U(0, 10)

Our Problem Problem definition: –Inputs: The dataset D The target number of clusters k A (possibly empty) set I o of labeled objects (obj. ID, class label), which may or may not cover all classes A (possibly empty) set I v of labeled relevant dimensions (dim. ID, class label), which may or may not cover all classes. A single dimension can be specified as relevant to multiple clusters

Our Problem Problem definition (cont’d): –Outputs: A set of k disjoint projected clusters with a (locally) optimal objective score A (possibly empty) set of outlier objects

Our Problem Assumptions made in this study: –There is a primary clustering target (c.f. biclustering) –Disjoint, axis-parallel clusters (c.f. subspace clustering and ORCLUS) –Distance-based similarity –One cluster per class (c.f. decision tree) –All inputs are correct (but can be biased, i.e., with projections on the relevant dimensions deviated from the cluster center)

Our New Algorithm Basic idea: k-medoid/median 1.Determine the potential medoids (seeds) and relevant dimensions of each cluster 2.Assign every object to the cluster (or to the outlier list) that gives the greatest improvement to the objective score 3.Decide which medoids are good/bad –A good medoid: replace by cluster median, refine selected dimensions –A bad medoid: replace by another seed 4.Repeat 2 and 3 until no improvements can be obtained in a certain number of iterations

Our New Algorithm Issues to consider: –Design of the objective function –Selection of relevant dimensions for a cluster –Determination of seeds and the relevant dimensions of the corresponding potential clusters –Replacement of medoids

Our New Algorithm Design goals of the objective function: –Should not have a trivial best score (e.g. when each cluster selects only one dimension) –Should not be ruined by the selection of a small amount of irrelevant dimensions –Should be robust (clustering accuracy should not degrade seriously when the input parameter values are not very accurate)

Our New Algorithm The objective function: –Overall score: –Score component of cluster C i : –Contribution of selected dimension v j on the score component of C i : – : normalization factor

Our New Algorithm Characteristics of the objective function: –Higher score => better clustering –No trivial best score when each cluster selects only one or selects all dimensions –Relevant dimensions (dimensions with smaller ) constitute more to the objective score –Robust? (To be discussed soon…)

Our New Algorithm Dimension selection: –In order to maximize, all dimensions with should be selected. –Appropriate values of : Should be at least  j 2, the global variance of dimension v j Scheme 1: Scheme 1b:, but only dimensions with are selected => easier to compare the results with different m

Our New Algorithm Scheme 2: estimate the probability for an irrelevant dimension to be selected (global distribution needs to be known). If the global distribution is Gaussian… –If n i values are randomly sampled from the global distribution of an irrelevant dimension v j, the random variable (n i -1)  ij 2 /  j 2 has a chi-square distribution with n i -1 degrees of freedom. –Suppose we want the probability of selecting an irrelevant dimension to be p, then From the cumulative chi-square distribution, the corresponding can be computed.

Our New Algorithm Probability density function and cumulative distribution (n i =30): (30-1)  ij 2 /  i 2  19 =>  ij 2  0.66  i 2 (= m  i 2 )

Our New Algorithm Robustness of the algorithm: –A good value of m should be… Large enough to tolerate local variances Small enough to distinguish local variances from global variances –The best value to use is data- dependent, but provided the difference between local and global variances is large, there is usually a wide range of values that lead to results with acceptable performance (e.g. 0.3 < m < 0.7)

Our New Algorithm Determination of seeds and the relevant dimensions of the corresponding potential clusters: –Traditional approach: One seed pool Seeds determined randomly/by max-min distance method/by preclustering (e.g. hierarchical) Relevant dimensions of each cluster determined by a set of objects near the medoid (in the input space)

Our New Algorithm Our proposal – seed group: –Seeds are stored in separate seed groups, each seed group contains a small number (e.g. 5) seeds –One private seed group for each cluster with some inputs –A number of public seed groups are shared by all clusters without external inputs –The seeds of the cluster with the largest amount of inputs are initialized first (as we are most confident in their correctness), and then those with less inputs, and so on. Finally, the public seed groups are initialized.

Our New Algorithm Our proposal – seeds selection: –Based on low-dimensional histograms (grids) –Relevant dimension => small variance => high density –Procedures: Determine starting point Hill-climbing –=> Need to determine both the dimensions used in constructing the grid and the starting point

Our New Algorithm Determining the grid-constructing dimensions and the starting point: –Case 1: a cluster with both labeled objects and labeled relevant dimensions –Case 2: a cluster with only labeled objects –Case 3: a cluster with only labeled relevant dimensions –Case 4: a cluster with no inputs

Our New Algorithm Case 1: both kinds of inputs are available 1.Form a seed cluster by the input objects 2.Rank all dimensions by 3.All dimensions with positive  ij or in the input set I v are candidate dimensions for constructing the histograms 4.Relative chance of being selected:  ij if dimension v j is not in I v, 1 otherwise 5.The starting point is the median of the seed cluster

Our New Algorithm Example: cluster 2 –  2x : 0.68 –  2y : 0.83 –  2z : The hill-climbing mechanism fixes errors due to biased inputs

Our New Algorithm Case 2: labeled objects only –Similar to case 1, but the chance for each dimension to be selected is based on  ij only Case 3: labeled dimensions only –Similar to case 1, but with no staring point, i.e., all cells are examined, and the one with the highest density will be returned

Our New Algorithm Case 4: no inputs –The tentative seed is the one with the maximum projected distance to the closest selected seeds (modified max-min distance method) –For each dimension, an one-dimensional histogram is constructed to determine the density of objects around the projection of the tentative seed –The chance for each dimension being selected to construct the grid is based on the density –The tentative seed is used as the starting point

Our New Algorithm Medoid drawing/replacement: –The medoid of each cluster is initially drawn from… The corresponding private seed group, if available A unique public seed group, otherwise –After assignment, the medoid for cluster C i is likely to be a bad one if… (  i / max i  I ) is small – the cluster has a low quality as compared to other clusters (  i / max  I ) is small – the cluster has a low quality as compared to a perfect cluster The cluster is very similar to another cluster

Our New Algorithm Medoid drawing/replacement (cont’d): –Each time, only one potential bad medoid is replaced since the probability of simultaneously correcting multiple medoids is low –The target bad medoid is replaced by a seed from the corresponding private seed group or a new public seed group –The medoids of other clusters are replaced by cluster medoids, and the relevant dimensions are reselected –The algorithm keeps track on the best set of medoids and relevant dimensions

Experimental Results Dataset 1: n=1000, d=100, k=5, l real =5-40 (5- 40% of d) No external inputs Algorithms: –HARP –PROCLUS –SSPC –CLARANS (non-projected control)

Experimental Results Best performance (results with the best ARI values):

Experimental Results Best performance v.s. average performance:

Experimental Results Robustness (l real =10):

Experimental Results Dataset 2: n=150, d=3000, k=5, l real =30 (1% of d) Inputs: –I o size, I v size=1-9 –4 combinations: both, labeled objects only, labeled relevant dimensions only, none –Coverage: 1-5 clusters (20-100%)

Experimental Results Increasing input size (100% coverage):

Experimental Results Increasing coverage (input size=3):

Experimental Results Increasing coverage (input size=6):

Future Works and Extensions Other required experiments: –Biased inputs –Multiple labeling methods for a single dataset –Scalability –Real data –Imperfect data with artificial outliers and errors –Searching for the best k

Future Works and Extensions To be considered in the future: –Other input types (e.g. must-links and cannot-links) –Wrong/Inconsistent inputs –Pattern-based and range-based similarity –Non-disjoint clusters

References Projected clustering: –HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering (DB Seminar on 20 Sep 2002)HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering Semi-supervised clustering: –The Semi-supervised Clustering Problem (DB Seminar on 2 Jan 2004)The Semi-supervised Clustering Problem