Efficient Cluster Detection by Ordered Neighborhoods

Efficient Cluster Detection by Ordered Neighborhoods
Emin Aksehirli, Emmanuel Müller, and Bart Goethals

✓ ✓ ✗ ✗ High Dimensional Data ? Curse of Dimensinality is
Tip of the iceberg. Not similar in all dimensions. Subspace clustering tries to find object and attribute pairs.

High Dimensional Data Real datasets are never that simple,
what if we add boolean data

High Dimensional Data Different and unstructured virews on the data.
Images! Era of the BigData and Big Tables. They collect all the data. We are expected to mine them.

Neighborhood and Cluster

Neighborhood and Cluster
Co-occurrence and SNN

Problem Setting Preserve local neighborhoods
Combine different views on the data Produce explainable results To find real similarities. There can be more than one view on the data and our method should be able to combine them. PCA, SVD transforms the data to a space that nobody would know. They do not help us to understand the results. If the results are not good, it is very hard to debug them. Some of the subspace clustering methods are very good, but it take ages for them to complete.

Cartification Without loss of generality
Euclidian similarity on a dimension 3 nearest neighbors Objects 1 2 3,

Cartification

Frequent Itemset Mining
Cartified DB ? Original DB Mine the Fis. They are frequent What do they mean on the original DB? FIs

Transformation CARTIFICATION High Dimensional DB Itemset (Transaction)
Clustering FIM On the one hand... Subspace Clusters Frequent Patterns

Cartification Frequent Itemset Mining solves our problem. CartiClus
“Cartification: A Neighborhood Preserving Transformation for Mining High Dimensional Data” by Emin Aksehirli, Bart Goethals, Emmannuel Müller, and Jilles Vreeken in Data Mining, ICDM 2013. It is not scalable.

Take 2

Ordered Neighborhoods
? { } - Storage - Continous - Intersection - Both for objects and neighborhoods - Fast

Neighborhood Matrix

Neighborhood DB

Effect of Order =

Uniform vs. Clusters

High Dimensionality

Running example - CLON ✓ Attr 1

Running example - CLON ✓ Attr 1 Attr 2 ?

Running example - CLON ✓ Attr 1 Attr 2 Attr 3 Attr 4 ✗ ? ✗ ? ✓ ?

Cluster Detection

Size Scale

Dimension Scale

Noise Detection

Irrelevant Dimensions

Real World – Gene Expression
Alon Nutt Our method 0.78 PROCLUS 0.46 0.49 FIRES 0.52 0.55 SUBCLU 0.58 n/a STATPC CartiClus # of Objects 62 50 # of Dims 2000 1377

Real World – MovieLens Star Wars: A New Hope (a.k.a. Star Wars) (1977)
Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The Two Towers, The (2002) LotR: The Return of the King, The (2003) Back to the Future (1985) Terminator, The (1984) Terminator 2: Judgment Day (1991) Die Hard (1988) Usual Suspects, The (1995) Pulp Fiction (1994) Silence of the Lambs, The (1991)

Real World - Movielens Star Wars: A New Hope (1977)
Star Wars: The Empire Strikes Back (1980) Star Wars: Return of the Jedi (1983) LotR: The Fellowship of the Ring, The (2001) LotR: The Two Towers, The (2002) LotR: The Return of the King, The (2003) Brazil (1985) Dr. Strangelove (1964) Clockwork Orange, A (1971) 2001: A Space Odyssey (1968) Blade Runner (1982) Alien (1979) Chinatown (1974) Rear Window (1954) North by Northwest (1959) Vertigo (1958) Psycho (1960) Silence of the Lambs, The (1991) Third Man, The (1949) Citizen Kane (1941) Godfather: Part II, The (1974) Godfather, The (1972) Taxi Driver (1976)

Conclusion Preserves neighborhood information
Combines different similarity measures gracefully Finds relevant features and discards noise Fast Produce explainable results Transformation... Debuggable → Code and the data is available at our website. Thank you!

Application

More Experiments

Efficient Cluster Detection by Ordered Neighborhoods

Similar presentations

Presentation on theme: "Efficient Cluster Detection by Ordered Neighborhoods"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Cluster Detection by Ordered Neighborhoods

Similar presentations

Presentation on theme: "Efficient Cluster Detection by Ordered Neighborhoods"— Presentation transcript:

Similar presentations

About project

Feedback