Download presentation
Presentation is loading. Please wait.
Published byBathsheba Lester Modified over 9 years ago
1
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Keng-Wei Chang Author: Yehuda Koren and David Harel A Two-Way Visualization Method for Clustered Data ACM SIGKDD international conference on Knowledge discovery and datamining
2
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Outline Motivation Objective Introduction Basic Notions Computing The x-Coordinates Computing The y-Coordinates Result Related Work Conclusions Personal Opinion
3
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Motivation A number of technological development have led to an explosion of raw data that has to be analyzed We are especially interested in two families of tools in this domain Clustering algorithms and data visualization methods
4
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Objective in this paper, we integrate the two approaches hierarchical clustering depicted as a dendrogram low-dimensional embedding
5
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction A number of technological development have led to an explosion of raw data that has to be analyzed We are especially interested in two families of tools in this domain Clustering algorithms and data visualization methods Clustering methods can be broadly classified Hierarchical and partitional
6
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction Our main interest here is hierarchical clustering The clustering hierarchy is often visualized as a dendrogram A full binary tree has a significant disadvantage does not provide exploratory visual representations of the data itself another issue is that of cluster validity
7
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction we are particularly interested in methods for achieving a low-dimensional embedding of data principal component analysis (PCA) multidimensional scaling (MDS) force-directed placement solve some limitations of dendrogram but, cannot utilize external clustering information
8
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction for a demonstration of the relative merits of the two approaches a dendrogram vs. a low-dimensional embedding
9
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Introduction in this paper, we integrate the two approaches hierarchical clustering depicted as a dendrogram low-dimensional embedding
10
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Basic Notions given data about n elements {1,…,n} relationships between pairs of elements are by distances d ij ≥ 0 or similarities w ij ≥ 0 2-dimentional embedding of the data id defined by two vectors x, y Є the coordinates of element i are ( x i, y i )
11
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Computing The x-Coordinates The embedding must place each element exactly below its corresponding leaf in the dendrogram this means that the x-coordinate must corresponding leaf in the dendrogram face the problem of computing the x-coordinates of the dendrogram leaves preserves the relationships among the data as much as possible
12
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Computing The x-Coordinates we exhaust all the existing methods, opting for a twofold process find the best orientation of the dendrogram this step determines the ordering of the leaves decide on the exact gaps between consecutive leaves in the ordering
13
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dendrogram orientation a dendrogram has 2 n-1 different orientations example :
14
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dendrogram orientation one way of defining formally what should be considered a “good” ordering associate a cost function with the dendrogram such that finding the best ordering is equivalent to optimizing this function be the classical minimum linear arrangement problem minimizes
15
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dendrogram orientation in our particular problem also faced with an ordering task a permutation of {1, …, n} however, here we should not consider all possible permutations, but only agree with dendrogram’s structure n! 2 n-1 using dynamic programming, running time is exponential in the dendrogram’s height not in its size
16
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dendrogram orientation introduce an additional form of the cost function maximizes
17
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dendrogram orientation given an ordered dendrogram T a node v Leaves(v) : the set of leaves in the substree rooted by v x be the ordering on the leaves Let S be Leaves(v) L be the set of leaves of left of S R be the set of leaves of right of S if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
18
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dendrogram orientation a key concept of the algorithm is local arrangement cost, defined as : if |L| = l, |S| = s, we have x(L) = {1,…,l}, x(S) = {l+1,…,l+x}, x(R) = {l+s+1,…,n}
19
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Dendrogram orientation two additional related terms will be used another term that will be used in the algorithm
20
Intelligent Database Systems Lab N.Y.U.S.T. I. M.
21
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining coordinates of the leaves computing the exact gaps between each two consecutive leaves example :
22
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Determining coordinates of the leaves a better approach is to take a weighted average over all influenced leaf pairs
23
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Computing The y-Coordinates Principle component analysis Classical multidimensional scaling Eigen-projection Stress minimization
24
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Result Odors dataset consists of 30 volatile odorous pure chemicals contains 262 elements, natural clusters : 30 use a UPGMA agglomerative clustering to construct the dendrogram
25
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Result Iris dataset an example of discriminant analysis contains 150 elements, natural clusters : 3
26
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Result Gene expression data : CDC15-synchronized cell cycle a much larger dataset of gene-expression data contains 6113 elements
27
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Related Work TreeView dendrogram over a color-coded matrix
28
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Discussion success for integrating two key methods in exploratory data analysis cluster analysis and low-dimensional embedding two unique properties Guaranteed separation between any kind of given clusters The ability to deal with a predefined hierarchical clustering
29
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Personal Opinion Advantages ─ has success for integrating two of clustering methods. ─ more intuition in analyzing Application ─ Real data for clustering and analyzing. ─ May solve the problem lack of clustering information Limited ─ cannot show the real shape of clusters
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.