Presentation is loading. Please wait.

Presentation is loading. Please wait.

Efficient Cluster Detection by Ordered Neighborhoods

Similar presentations


Presentation on theme: "Efficient Cluster Detection by Ordered Neighborhoods"— Presentation transcript:

1 Efficient Cluster Detection by Ordered Neighborhoods
Emin Aksehirli, Emmanuel Müller, and Bart Goethals

2 ✓ ✓ ✗ ✗ High Dimensional Data ? Curse of Dimensinality is
Tip of the iceberg. Not similar in all dimensions. Subspace clustering tries to find object and attribute pairs.

3 High Dimensional Data Real datasets are never that simple,
what if we add boolean data

4 High Dimensional Data Real datasets are never that simple,
what if we add boolean data

5 High Dimensional Data Real datasets are never that simple,
what if we add boolean data

6 High Dimensional Data Different and unstructured virews on the data.
Images! Era of the BigData and Big Tables. They collect all the data. We are expected to mine them.

7 Neighborhood and Cluster

8 Neighborhood and Cluster
Co-occurrence and SNN

9 Problem Setting Preserve local neighborhoods
Combine different views on the data Produce explainable results To find real similarities. There can be more than one view on the data and our method should be able to combine them. PCA, SVD transforms the data to a space that nobody would know. They do not help us to understand the results. If the results are not good, it is very hard to debug them. Some of the subspace clustering methods are very good, but it take ages for them to complete.

10 Cartification Without loss of generality
Euclidian similarity on a dimension 3 nearest neighbors Objects 1 2 3,

11 Cartification

12 Cartification

13 Cartification

14 Cartification

15 Cartification

16 Frequent Itemset Mining
Cartified DB ? Original DB Mine the Fis. They are frequent What do they mean on the original DB? FIs

17 Transformation CARTIFICATION High Dimensional DB Itemset (Transaction)
Clustering FIM On the one hand... Subspace Clusters Frequent Patterns

18 Cartification Frequent Itemset Mining solves our problem. CartiClus
“Cartification: A Neighborhood Preserving Transformation for Mining High Dimensional Data” by Emin Aksehirli, Bart Goethals, Emmannuel Müller, and Jilles Vreeken in Data Mining, ICDM 2013. It is not scalable.

19 Take 2

20 Ordered Neighborhoods
? { } - Storage - Continous - Intersection - Both for objects and neighborhoods - Fast

21 Neighborhood Matrix

22 Neighborhood Matrix

23 Neighborhood DB

24 Effect of Order =

25 Uniform vs. Clusters

26 High Dimensionality

27 Running example - CLON Attr 1

28 Running example - CLON Attr 1 Attr 2 ?

29 Running example - CLON Attr 1 Attr 2 Attr 3 Attr 4 ? ? ?

30 Cluster Detection

31 Size Scale

32 Dimension Scale

33 Noise Detection

34 Irrelevant Dimensions

35 Real World – Gene Expression
Alon Nutt Our method 0.78 PROCLUS 0.46 0.49 FIRES 0.52 0.55 SUBCLU 0.58 n/a STATPC CartiClus # of Objects 62 50 # of Dims 2000 1377

36 Real World – MovieLens Star Wars: A New Hope (a.k.a. Star Wars) (1977)
Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The Two Towers, The (2002) LotR: The Return of the King, The (2003) Back to the Future (1985) Terminator, The (1984) Terminator 2: Judgment Day (1991) Die Hard (1988) Usual Suspects, The (1995) Pulp Fiction (1994) Silence of the Lambs, The (1991)

37 Real World - Movielens Star Wars: A New Hope (1977)
Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The Two Towers, The (2002) LotR: The Return of the King, The (2003) Brazil (1985) Dr. Strangelove (1964) Clockwork Orange, A (1971) 2001: A Space Odyssey (1968) Blade Runner (1982) Alien (1979) Chinatown (1974) Rear Window (1954) North by Northwest (1959) Vertigo (1958) Psycho (1960) Silence of the Lambs, The (1991) Third Man, The (1949) Citizen Kane (1941) Godfather: Part II, The (1974) Godfather, The (1972) Taxi Driver (1976)

38 Conclusion Preserves neighborhood information
Combines different similarity measures gracefully Finds relevant features and discards noise Fast Produce explainable results Transformation... Debuggable → Code and the data is available at our website. Thank you!

39 Application

40 More Experiments

41 More Experiments


Download ppt "Efficient Cluster Detection by Ordered Neighborhoods"

Similar presentations


Ads by Google