DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)

Slides:



Advertisements
Similar presentations
Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.
Advertisements

CS 478 – Tools for Machine Learning and Data Mining Clustering: Distance-based Approaches.
Clustering.
Clustering Basic Concepts and Algorithms
Clustering Categorical Data The Case of Quran Verses
PARTITIONAL CLUSTERING
Data Mining Classification: Alternative Techniques
Ensemble Methods An ensemble method constructs a set of base classifiers from the training data Ensemble or Classifier Combination Predict class label.
Data Mining Classification: Alternative Techniques
K Means Clustering , Nearest Cluster and Gaussian Mixture
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
© University of Minnesota Data Mining for the Discovery of Ocean Climate Indices 1 CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance.
Unsupervised Learning and Data Mining
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
1 Partitioning Algorithms: Basic Concepts  Partition n objects into k clusters Optimize the chosen partitioning criterion Example: minimize the Squared.
Cluster Analysis (1).
What is Cluster Analysis?
Mining Long Sequential Patterns in a Noisy Environment Jiong Yang, Wei Wang, Philip S. Yu, Jiawei Han SIGMOD 2002.
Clustering Ram Akella Lecture 6 February 23, & 280I University of California Berkeley Silicon Valley Center/SC.
Birch: An efficient data clustering method for very large databases
Radial Basis Function Networks
Clustering Unsupervised learning Generating “classes”
Unsupervised Learning. CS583, Bing Liu, UIC 2 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate.
Data mining and machine learning A brief introduction.
DB Seminar Series: HARP: A Hierarchical Algorithm with Automatic Relevant Attribute Selection for Projected Clustering Presented by: Kevin Yip 20 September.
1 Lecture 10 Clustering. 2 Preview Introduction Partitioning methods Hierarchical methods Model-based methods Density-based methods.
A Clustering Algorithm based on Graph Connectivity Balakrishna Thiagarajan Computer Science and Engineering State University of New York at Buffalo.
START OF DAY 8 Reading: Chap. 14. Midterm Go over questions General issues only Specific issues: visit with me Regrading may make your grade go up OR.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
DB Seminar Series: Semi- supervised Projected Clustering By: Kevin Yip (4 th May 2004)
Dimension reduction : PCA and Clustering Slides by Agnieszka Juncker and Chris Workman modified by Hanne Jarmer.
Outline Knowledge discovery in databases. Data warehousing. Data mining. Different types of data mining. The Apriori algorithm for generating association.
Andreas Papadopoulos - [DEXA 2015] Clustering Attributed Multi-graphs with Information Ranking 26th International.
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Clustering COMP Research Seminar BCB 713 Module Spring 2011 Wei Wang.
CLUSTER ANALYSIS Introduction to Clustering Major Clustering Methods.
Relevant Overlapping Subspace Clusters on CATegorical Data (ROCAT) Xiao He1, Jing Feng1, Bettina Konte1, Son T.Mai1, Claudia Plant2 1: University of Munich,
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Clustering Clustering is a technique for finding similarity groups in data, called clusters. I.e., it groups data instances that are similar to (near)
CURE: EFFICIENT CLUSTERING ALGORITHM FOR LARGE DATASETS VULAVALA VAMSHI PRIYA.
Presented by Ho Wai Shing
2015/12/251 Hierarchical Document Clustering Using Frequent Itemsets Benjamin C.M. Fung, Ke Wangy and Martin Ester Proceeding of International Conference.
Flat clustering approaches
Definition Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to)
Random Forests Ujjwol Subedi. Introduction What is Random Tree? ◦ Is a tree constructed randomly from a set of possible trees having K random features.
Clustering High-Dimensional Data. Clustering high-dimensional data – Many applications: text documents, DNA micro-array data – Major challenges: Many.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining: Cluster Analysis This lecture node is modified based on Lecture Notes for Chapter.
CZ5211 Topics in Computational Biology Lecture 4: Clustering Analysis for Microarray Data II Prof. Chen Yu Zong Tel:
CHAN Siu Lung, Daniel CHAN Wai Kin, Ken CHOW Chin Hung, Victor KOON Ping Yin, Bob Fast Algorithms for Projected Clustering.
Clustering Wei Wang. Outline What is clustering Partitioning methods Hierarchical methods Density-based methods Grid-based methods Model-based clustering.
Intro. ANN & Fuzzy Systems Lecture 16. Classification (II): Practical Considerations.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
What Is Cluster Analysis?
Semi-Supervised Clustering
Data Mining K-means Algorithm
K Nearest Neighbor Classification
Collaborative Filtering Nearest Neighbor Approach
CSE572, CBS598: Data Mining by H. Liu
I don’t need a title slide for a lecture
DATA MINING Introductory and Advanced Topics Part II - Clustering
CSE572, CBS572: Data Mining by H. Liu
Clustering Wei Wang.
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
CSE572: Data Mining by H. Liu
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Lecture 16. Classification (II): Practical Considerations
Presentation transcript:

DB Seminar Series: The Subspace Clustering Problem By: Kevin Yip (17 May 2002)

Presentation Outline Problem definition Different approaches Focus: the projective clustering approach

Problem Definition – Traditional Clustering Traditional clustering problem: To divide data points into disjoint groups such that the value of an objective function is optimized. Objective function: to minimize intra-cluster distance and maximize inter-cluster distance. Distance function: define over all dimensions, numeric or categorical.

Problem Definition – Traditional Clustering Example Problem: clustering points in 2-D space. Distance function: Euclidean distance (d: no. of dimensions, 2 in this case).

Problem Definition – Traditional Clustering Example (source: CURE, SIGMOD 1998)

Problem Definition – Distance Function Problem Observation: distance measures defined over all dimensions are sometimes inappropriate. Example (source: DOC, SIGMOD 2002) C 1 : (x 1, x 2 ) C 2 : (x 2, x 3 ) C 3 : (x 1, x 3 )

Problem Definition – Distance Function Problem As the number of noise dimensions increases, the distance functions become less and less accurate. => For each cluster, except the set of data points, we also need to find out the set of “related dimensions” (“bounded attributes”)

Problem Definition – The Subspace Clustering Problem Formal Definition: Given a dataset of N data points and d dimensions, we want to divide the points into k disjoint clusters, each relating to a subset of dimensions, such that an objective function is optimized. Objective function: usually intra-cluster distance, each cluster uses its own set of dimensions in distance calculation.

Problem Definition – The Subspace Clustering Problem Observation: normal distance functions (Manhattan, Euclidean, etc.) give a smaller value if less dimensions are involved. =>1. Use a normalized distance function. =>2. Should also try to maximize the number of dimensions. Example (DOC): score(C, D) = |C|(1/β) |D|, C = points in a cluster, D = relating attributes, β is a constant.

Different Approaches – Overview Grid-based dimension selection Association rule hypergraph partitioning Context-specific Bayesian clustering Projective clustering (Focus)

Different Approaches – Grid-Based Dimension Selection CLIQUE (98), ENCLUS (99), MAFIA (99), etc. Basic idea: A cluster is a region with high density. – Divide the domain of each dimension into units. – For each dimension, find all dense units – units with many points. – Merge neighboring dense units into “clusters”. – After finding all 1-d clusters, find 2-d dense units. – Repeat with higher dimensions.

Different Approaches – Grid-Based Dimension Selection A 2-D dataset for illustration:

Different Approaches – Grid-Based Dimension Selection 1. Divide the domain of each dimension into sub- units.

Different Approaches – Grid-Based Dimension Selection 2. Find all dense units – units with many points. (assume density threshold = 3 points)

Different Approaches – Grid-Based Dimension Selection 3. Merge neighboring dense units into “clusters”.

Different Approaches – Grid-Based Dimension Selection 4. Find 2-d dense units. Merge neighboring dense units, if any.

Different Approaches – Grid-Based Dimension Selection 5. Repeat with higher dimensions.

Different Approaches – Grid-Based Dimension Selection Results:1-d:,,,. 2-d:,.

Different Approaches – Grid-Based Dimension Selection Problems with the grid-based dimension selection approach: – Non-disjoint clusters. – Exponential dependency on the number of dimensions.

Different Approaches - Association Rule Hypergraph Partitioning 1997 Cluster related items (attribute values) using association rules and cluster related transactions (data points) using clusters of items.

Different Approaches – Association Rule Hypergraph Partitioning Procedures: 1. Find all frequent itemsets in the dataset. 2. Construct a hypergraph with each item as a vertex, and each hyperedge corresponding to a frequent itemset. (If {A, B, C} is a frequent itemset, there is a hyperedge connecting the vertices of A, B, and C.)

Different Approaches – Association Rule Hypergraph Partitioning Procedures: 3. Each hyperedge is assigned a weight equal to a function of the confidences of all the association rules between the connecting items. (If there are association rules {A}=>{B,C} (c. 0.8), {A,B}=>{C} (c. 0.4), {A,C}=>{B} (c. 0.6), {B}=>{A,C} (c. 0.4), {B,C}=>{A} (c. 0.8) and {C}=>{A,B} (c. 0.6), then the weight of the hyperedge ABC can be the average of the confidences, i.e. 0.6)

Different Approaches – Association Rule Hypergraph Partitioning Procedures: 4. Use a hypergraph partitioning algorithm (e.g. HMETIS, 97) to divide the hypergraph into k partitions, so that the sum of the weights that straddle partitions is minimized. Each partition forms a cluster with different subset of items. 5. Assign each transaction to a cluster, based on a scoring function (e.g. percentage of matched items).

Different Approaches – Association Rule Hypergraph Partitioning Problems with the association rule hypergraph partitioning approach: – In real clusters, an item can be related to multiple clusters. – May not be applicable to numeric attributes.

Different Approaches – Context- Specific Bayesian Clustering Naïve-Bayesian classification: given a training set with classes C i (i=1..k), a data point with attribute values x 1, x 2, …, x d is classified by P(C=C i | x 1, x 2, …, x d ) =P(x 1, x 2, …, x d | C=C i ) P(C=C i ) / P(x 1, x 2, …, x d ) αP(x 1, x 2, …, x d | C=C i ) P(C=C i ) =P(x 1 |C=C i )P(x 2 |C=C i )…P(x d |C=C i )P(C=C i )

Different Approaches – Context- Specific Bayesian Clustering A RECOMB 2001 paper Context-specific independence (CSI) model: each attribute X i depends only on classes in a set L i. E.g. if k=5 and L 1 ={1, 4}, then P(X 1 |C=C 2 ) = P(X 1 |C=C 3 ) = P(X 1 |C=C 5 ) = P(X 1 |C=C def )

Different Approaches – Context- Specific Bayesian Clustering A CSI model M contains: k – the number of classes. G – the set of attributes that depend on some classes. L i – the “local structures” of the attributes. Parameters for a CSI model, θ M : P(C=C i ), P(X i |L i =C j )

Different Approaches – Context- Specific Bayesian Clustering Recall P(C=C i | x 1, x 2, …, x d ) α P(x 1 |C=C i )P(x 2 |C=C i )…P(x d |C=C i )P(C=C i ), in the CSI model, it equals P(X 1 |L i =C j )P(X 2 |L i =C j )…P(X d |L i =C j )P(C=C i ) So, for a dataset (without class labels), if we can guess a CSI model and its parameters, then we can assign each data point to a class => clustering.

Different Approaches – Context- Specific Bayesian Clustering Searching best model and parameters: – Define a score to rank the current model and parameters (BIC( M, θ M ) or CS( M, θ M )). – Randomly pick a model and a set of parameters and calculate the score. – Try modifying the model (e.g. add an attribute to a local structure), recalculate the score. – If the score is better, keep it and try modifying a parameter.

Different Approaches – Context- Specific Bayesian Clustering Repeat until a stopping criterion is reached (e.g. using simulated annealing). M 1, θ M1 -> M 2, θ M1 -> M 2, θ M2 -> M 3, θ M2 ->…

Different Approaches – Context- Specific Bayesian Clustering The scoring functions (just have a taste):

Different Approaches – Context- Specific Bayesian Clustering Problems with the context-specific Bayesian clustering approach: – Cluster quality and execution time not guaranteed. – Easily get into local minimum.

Focus: The Projective Clustering Approach PROCLUS (99), ORCLUS (00), etc. K-medoid partitional clustering. Basic idea: use a set of sample points to determine the relating dimensions for each cluster, assign points to the clusters according to the dimension sets, throw away some bad medoids and repeat.

Focus: The Projective Clustering Approach Algorithm (3 phases): Initialization phase – Input k: target number of clusters. – Input l: average number of dimensions in a cluster. – Draw Ak samples randomly from the dataset, where A is a constant. – Use max-min algorithm to draw Bk points from the sample, where B is a constant < A. Call this set of points M.

Focus: The Projective Clustering Approach Iterative Phase – Draw k medoids from M. – For each medoid m i, calculate the Manhattan distance δ i (involving all dimensions) to the nearest medoid. – Find all points in the whole dataset that are within a distance δ i from m i.

Focus: The Projective Clustering Approach Finding the set of surrounding points for a medoid: A B C δ

Focus: The Projective Clustering Approach – The average distance between the points and the medoid along each dimension will be calculated. – Among all kd dimensions, select kl of them with exceptionally small average distances. An extra restriction is that each medoid must pick at least 2 dimensions. – Whether the distance from medoid of a particular dimension is “exceptionally small” in a cluster is determined by its standard score:

Focus: The Projective Clustering Approach Scoring dimensions: A B C

Focus: The Projective Clustering Approach Example: – In cluster C 1, the average distances from medoid along dimension D 1 =10, along D 2 =15 and along D 3 =13. In Cluster C 2, the average distances are 7, 6 and 12. – Mean(C 1 ) = ( ) / 3 = – S.D.(C 1 ) = – Z(C 1 D 1 ) = ( )/2.52 = – Similarly, Z(C 1 D 2 ) = 0.93, Z(C 1 D 3 ) = 0.13, Z(C 2 D 1 ) = -0.41, Z(C 2 D 2 ) = -0.73, Z(C2D3) = – So the order to pick the dimensions will be C 1 D 1 -> C 2 D 2 -> C 2 D 1 -> C 1 D 3 -> C 1 D 2 -> C 2 D 3.

Focus: The Projective Clustering Approach Iterative Phase (cont’d) – Now, each medoid has a related set of dimensions. Assign all points in the whole dataset to the medoid closest to it (using a normalized distance function involving only the selected dimension). – Calculate the overall score of the clustering. Record the cluster definitions (relating attributes and assigned points) if the score is the new best one. – Throw away medoids with too few points. Replace them with some points remained in M.

Focus: The Projective Clustering Approach Refinement Phase – After determining the best set of medoids, use the assigned points to re-determine the sets of dimensions, and reassign all points. – If the distance between a point and its medoid is longer than the distance between the medoid and its closest medoid, the point is marked as an outlier.

Focus: The Projective Clustering Approach Experiment: – Dataset: synthetic, 100, 000 points, 20 dimensions. – Set 1: 5 clusters, each with 7 dimensions. – Set 2: 5 clusters, with 2-7 dimensions. – Machine: 233-MHz IBM RS/6000, 128M RAM, running AIX. Dataset stored in a 2GB SCSI drive. – Comparison: CLIQUE (grid-based)

Focus: The Projective Clustering Approach Result accuracy (set 1): InputDimensionsPoints A3, 4, 7, 9, 14, 16, B3, 4, 7, 12, 13, 14, C4, 6, 11, 13, 14, 17, D4, 7, 9, 13, 14, 16, E3, 4, 9, 12, 14, 16, Outliers-5000 Actual clusters FoundDimensionsPoints 14, 6, 11, 13, 14, 17, , 4, 7, 9, 14, 16, , 4, 7, 12, 13, 14, , 7, 9, 13, 14, 16, , 4, 9, 12, 14, 16, Outliers-2396 PROCLUS results

Focus: The Projective Clustering Approach Result accuracy (set 1): InputABCDEOut. Output Out Confusion Matrix of PROCLUS

Focus: The Projective Clustering Approach Result accuracy (set 1): InputABCDEOut. Output Confusion Matrix of CLIQUE

Focus: The Projective Clustering Approach Result accuracy (set 2): InputDimensionsPoints A2, 3, 4, 9, 11, 14, B2, 3, C2, D2, 3, 4, 12, 13, E2, Outliers-5000 Actual clusters FoundDimensionsPoints 12, 3, , , 3, 4, 12, 13, , , 3, 4, 9, 11, 14, Outliers-5294 PROCLUS results

Focus: The Projective Clustering Approach Result accuracy (set 2): InputABCDEOut. Output Out Confusion Matrix of PROCLUS

Focus: The Projective Clustering Approach Scalability (with dataset size):

Focus: The Projective Clustering Approach Scalability (with average dimension- ality):

Focus: The Projective Clustering Approach Scalability (with space dimension- ality):

Focus: The Projective Clustering Approach Problems with the projective clustering approach: – Need to know l, the average number of dimensions. – A cluster with very small number of selected dimensions will absorb the points of other clusters. – Using a distance measure over the whole dimension space to select the sets of dimensions may not be accurate, especially when the number of noise attributes is large.

Summary The subspace clustering problem: given a dataset of N data points and d dimensions, we want to divide the points into k disjoint clusters, each relating to a subset of dimensions, such that an objective function is optimized. Grid-based dimension selection Association rule hypergraph partitioning Context-specific Bayesian clustering Projective clustering

References Grid-based dimension selection: – “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications” (SIGMOD 1998) – “Entropy-based Subspace Clustering for Mining Numerical Data” (SIGKDD 1999) – “MAFIA: Efficient and Scalable Subspace Clustering for Very Large Data Sets” (Technical Report , Northwestern University 1999) Association rule hypergraph partitioning: – “Clustering Based On Association Rule Hypergraphs” (Clustering Workshop 1997)

References – “Multilevel Hypergraph Partitioning: Application in VLSI Domain” (DAC 1997) Context-specific Bayesian clustering: – “Context-Specific Bayesian Clustering for Gene Expression Data” (RECOMB 2001) Projective clustering – “Fast Algorithms for Projected Clustering” (SIGMOD 1999) – “Finding Generalized Projected Clusters in High Dimensional Spaces” (SIGMOD 2000) – “A Monte Carlo Algorithm for Fast Projective Clustering” (SIGMOD 2002)