Turing Clusters into Patterns: Rectangle-based Discriminative Data Description Byron J. Gao and Martin Ester IEEE ICDM 2006 Adviser: Koh Jia-Ling Speaker: Liu Yu-Jiun Date: 2006/11/8
2 Introduction The goal of data mining is to discover useful knowledge. Present the clusters as the sets of points. Interpret the clusters as the human- comprehensible patterns. In the past, only concern the length of patterns, and descript the cluster C directly.
3 SOR description Sum of Rectangles ( ) is the canonical format for cluster descriptions. : either or Black: cluster C (R1 and R2) Red: other cluster (R1 ’ ) Green: Bc description: R1 + R2 description: Bc – R1 ’
4 Notations
5 Example R2 R5 R4 R3 R1 R2 ’ R3 ’
6 Problems Maximum Description Accuracy (MDA) Minimum Description Length (MDL) A novel description: description
7 Accuracy Formula Two additional measures: 1.Recall at fixed precision. (fix precision = 1) 2.Precision at fixed recall. (fix recall = 1)
8 Three Heuristic Algorithms Learn2Cover MDL approximating max length. Length of rectangle. DesTree MDA approximating the Pareto front. FindClans transforms the output from DesTree into the shorter final description.
9 Learn2Cover is the next point from Bc in the sorted order.
10 Cost of Learn2Cover : the length of rectangle R along dimension Dj. R ’ : the expanded R in covering
11 DesTree DesTree takes the output from Learn2Cover, R or R, as input. Build the tree from bottom to up. Merge the child nodes into parent nodes until a single node is left. Each node represents a rectangle. The higher in the tree we cut, the shorter the length and the lower the accuracy. -
12 merge
13 FindClans FindClans takes as input a cut from DesTree, outputs a description.
14 Algorithm -- FindClans
15 Experimental Compare with CART and BP. Real datasets from the UCI repository, where data records with the same class label were treated as a cluster.
16 Comparisons with CART Concern both of MDA and MDL.
17 DesTree vs. CART accuracy length
18 Comparisons with BP BP addresses the MDL problem only. Synthetic datasets. Gaining 20%~50% length reduction. Learn2Cover without violation checking, so faster than BP.
19 Conclusions provides enhanced expressive power. MDA allows trading accuracy for interpretability. A paradigm for query-based “ second- generation ” database mining systems.