Download presentation
Presentation is loading. Please wait.
Published byAriel Hood Modified over 9 years ago
1
Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker: Liu Yu-Jiun Date: 2006/11/8
2
2 Introduction The goal of data mining is to discover useful knowledge. Present the clusters as the sets of points. Interpret the clusters as the human- comprehensible patterns. In the past, only concern the length of patterns, and descript the cluster C directly.
3
3 SOR description Sum of Rectangles ( ) is the canonical format for cluster descriptions. : either or Black: cluster C (R1 and R2) Red: other cluster (R1 ’ ) Green: Bc description: R1 + R2 description: Bc – R1 ’
4
4 Notations
5
5 Example R2 R5 R4 R3 R1 R2 ’ R3 ’
6
6 Problems Maximum Description Accuracy (MDA) Minimum Description Length (MDL) A novel description: description
7
7 Accuracy Formula Two additional measures: 1.Recall at fixed precision. (fix precision = 1) 2.Precision at fixed recall. (fix recall = 1)
8
8 Three Heuristic Algorithms Learn2Cover MDL approximating max length. Length of rectangle. DesTree MDA approximating the Pareto front. FindClans transforms the output from DesTree into the shorter final description.
9
9 Learn2Cover is the next point from Bc in the sorted order.
10
10 Cost of Learn2Cover : the length of rectangle R along dimension Dj. R ’ : the expanded R in covering
11
11 DesTree DesTree takes the output from Learn2Cover, R or R, as input. Build the tree from bottom to up. Merge the child nodes into parent nodes until a single node is left. Each node represents a rectangle. The higher in the tree we cut, the shorter the length and the lower the accuracy. -
12
12 merge
13
13 FindClans FindClans takes as input a cut from DesTree, outputs a description.
14
14 Algorithm -- FindClans
15
15 Experimental Compare with CART and BP. Real datasets from the UCI repository, where data records with the same class label were treated as a cluster.
16
16 Comparisons with CART Concern both of MDA and MDL.
17
17 DesTree vs. CART accuracy length
18
18 Comparisons with BP BP addresses the MDL problem only. Synthetic datasets. Gaining 20%~50% length reduction. Learn2Cover without violation checking, so faster than BP.
19
19 Conclusions provides enhanced expressive power. MDA allows trading accuracy for interpretability. A paradigm for query-based “ second- generation ” database mining systems.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.