Download presentation
Presentation is loading. Please wait.
Published byPaula Boyd Modified over 6 years ago
1
Efficient Cluster Detection by Ordered Neighborhoods
Emin Aksehirli, Emmanuel Müller, and Bart Goethals
2
✓ ✓ ✗ ✗ High Dimensional Data ? Curse of Dimensinality is
Tip of the iceberg. Not similar in all dimensions. Subspace clustering tries to find object and attribute pairs.
3
High Dimensional Data Real datasets are never that simple,
what if we add boolean data
4
High Dimensional Data Real datasets are never that simple,
what if we add boolean data
5
High Dimensional Data Real datasets are never that simple,
what if we add boolean data
6
High Dimensional Data Different and unstructured virews on the data.
Images! Era of the BigData and Big Tables. They collect all the data. We are expected to mine them.
7
Neighborhood and Cluster
8
Neighborhood and Cluster
Co-occurrence and SNN
9
Problem Setting Preserve local neighborhoods
Combine different views on the data Produce explainable results To find real similarities. There can be more than one view on the data and our method should be able to combine them. PCA, SVD transforms the data to a space that nobody would know. They do not help us to understand the results. If the results are not good, it is very hard to debug them. Some of the subspace clustering methods are very good, but it take ages for them to complete.
10
Cartification Without loss of generality
Euclidian similarity on a dimension 3 nearest neighbors Objects 1 2 3,
11
Cartification
12
Cartification
13
Cartification
14
Cartification
15
Cartification
16
Frequent Itemset Mining
Cartified DB ? Original DB Mine the Fis. They are frequent What do they mean on the original DB? FIs
17
Transformation CARTIFICATION High Dimensional DB Itemset (Transaction)
Clustering FIM On the one hand... Subspace Clusters Frequent Patterns
18
Cartification Frequent Itemset Mining solves our problem. CartiClus
“Cartification: A Neighborhood Preserving Transformation for Mining High Dimensional Data” by Emin Aksehirli, Bart Goethals, Emmannuel Müller, and Jilles Vreeken in Data Mining, ICDM 2013. It is not scalable.
19
Take 2
20
Ordered Neighborhoods
? { } - Storage - Continous - Intersection - Both for objects and neighborhoods - Fast
21
Neighborhood Matrix
22
Neighborhood Matrix
23
Neighborhood DB
24
Effect of Order =
25
Uniform vs. Clusters
26
High Dimensionality
27
Running example - CLON ✓ Attr 1
28
Running example - CLON ✓ Attr 1 Attr 2 ?
29
Running example - CLON ✓ Attr 1 Attr 2 Attr 3 Attr 4 ✗ ? ✗ ? ✓ ?
30
Cluster Detection
31
Size Scale
32
Dimension Scale
33
Noise Detection
34
Irrelevant Dimensions
35
Real World – Gene Expression
Alon Nutt Our method 0.78 PROCLUS 0.46 0.49 FIRES 0.52 0.55 SUBCLU 0.58 n/a STATPC CartiClus # of Objects 62 50 # of Dims 2000 1377
36
Real World – MovieLens Star Wars: A New Hope (a.k.a. Star Wars) (1977)
Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The Two Towers, The (2002) LotR: The Return of the King, The (2003) Back to the Future (1985) Terminator, The (1984) Terminator 2: Judgment Day (1991) Die Hard (1988) Usual Suspects, The (1995) Pulp Fiction (1994) Silence of the Lambs, The (1991)
37
Real World - Movielens Star Wars: A New Hope (1977)
Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The Two Towers, The (2002) LotR: The Return of the King, The (2003) Brazil (1985) Dr. Strangelove (1964) Clockwork Orange, A (1971) 2001: A Space Odyssey (1968) Blade Runner (1982) Alien (1979) Chinatown (1974) Rear Window (1954) North by Northwest (1959) Vertigo (1958) Psycho (1960) Silence of the Lambs, The (1991) Third Man, The (1949) Citizen Kane (1941) Godfather: Part II, The (1974) Godfather, The (1972) Taxi Driver (1976)
38
Conclusion Preserves neighborhood information
Combines different similarity measures gracefully Finds relevant features and discards noise Fast Produce explainable results Transformation... Debuggable → Code and the data is available at our website. Thank you!
39
Application
40
More Experiments
41
More Experiments
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.