Living under the Curse of Dimensionality Dave Abel CSIRO.

Slides:



Advertisements
Similar presentations
When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)
Advertisements

Scalability, from a database systems perspective Dave Abel.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Mining Compressed Frequent- Pattern Sets Dong Xin, Jiawei Han, Xifeng Yan, Hong Cheng Department of Computer Science University of Illinois at Urbana-Champaign.
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
        iDistance -- Indexing the Distance An Efficient Approach to KNN Indexing C. Yu, B. C. Ooi, K.-L. Tan, H.V. Jagadish. Indexing the distance:
Searching on Multi-Dimensional Data
Mining Distance-Based Outliers in Near Linear Time with Randomization and a Simple Pruning Rule Stephen D. Bay 1 and Mark Schwabacher 2 1 Institute for.
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
Spatial Mining.
Planning under Uncertainty
A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.
Data Mining Techniques So Far: Cluster analysis K-means Classification Decision Trees J48 (C4.5) Rule-based classification JRIP (RIPPER) Logistic Regression.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
SASH Spatial Approximation Sample Hierarchy
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Kuang-Hao Liu et al Presented by Xin Che 11/18/09.
MACHINE LEARNING 9. Nonparametric Methods. Introduction Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 
Data Quality Class 9. Rule Discovery Decision and Classification Trees Association Rules.
Spatio-temporal Databases Time Parameterized Queries.
Clustering… in General In vector space, clusters are vectors found within  of a cluster vector, with different techniques for determining the cluster.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 4: Modeling Decision Processes Decision Support Systems in the.
Chapter 8 File organization and Indices.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis.
Algorithms for Large Sets of Points Dave Abel CSIRO ICT Centre.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Building Knowledge-Driven DSS and Mining Data
1.A file is organized logically as a sequence of records. 2. These records are mapped onto disk blocks. 3. Files are provided as a basic construct in operating.
1.1 Chapter 1: Introduction What is the course all about? Problems, instances and algorithms Running time v.s. computational complexity General description.
Data Mining Techniques
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Graph Indexing: A Frequent Structure­ based Approach Authors:Xifeng Yan†, Philip S‡. Yu, Jiawei Han†
Issues with Data Mining
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Clustering Uncertain Data Speaker: Ngai Wang Kay.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Introduction to Spatial Microsimulation Dr Kirk Harland.
Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
The Curse of Dimensionality Richard Jang Oct. 29, 2003.
Christoph F. Eick Questions and Topics Review November 11, Discussion of Midterm Exam 2.Assume an association rule if smoke then cancer has a confidence.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Presented by Ho Wai Shing
Lecture 10 Page 1 CS 111 Summer 2013 File Systems Control Structures A file is a named collection of information Primary roles of file system: – To store.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
B+ Trees: An IO-Aware Index Structure Lecture 13.
Data Mining and Decision Support
Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.
1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
Data Mining By Farzana Forhad CS 157B. Agenda Decision Tree and ID3 Rough Set Theory Clustering.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Graph Indexing From managing and mining graph data.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
School of Computing Clemson University Fall, 2012
8.3.2 Constant Distance Approximations
SIMILARITY SEARCH The Metric Space Approach
Parallel Databases.
Tree-Structured Indexes
Clustering (3) Center-based algorithms Fuzzy k-means
Spatio-temporal Pattern Queries
Instance Based Learning (Adapted from various sources)
K Nearest Neighbor Classification
Association Rule Mining
Nearest-Neighbor Classifiers
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Presentation transcript:

Living under the Curse of Dimensionality Dave Abel CSIRO

Roadmap Why spend time on high dimensional data? Non-trivial … Some explanations A different approach Which leads to …

The Subtext: data engineering Solution techniques are compositions of algorithms for fundamental operations; Algorithms assume certain contexts; There can be gains of orders of magnitude in using ‘good’ algorithms that are suited to the context; Sometimes better algorithms need to be built for new contexts.

COTS Database technology ‘Simple’ data is handled well, even for very large databases and high transaction volumes, by relational database; Geospatial data (2d, 2.5d, 3d) is handled reasonably well; But pictures, series, sequences, …, are poorly supported.

For example … Find the 10 days for which trading on the LSX was most similar to today’s, and the pattern for the following day; Find the 20 sequences from SwissProt that are most similar to this one; If I hum the first few bars, can you fetch the song from the music archive?

Dimensionality? It’s all in the modelling; K-d means … the important relationships and operations on these object involve a certain set of k attributes as a bloc; 1d: a list/ key properties flow from value of a single attribute/(position in the list); 2d: points on a plane/ key properties and relationships from position on the plane; 3d and 4d: …

All in the modelling … Take a set of galaxies: Some physical interactions deal with galaxies as points in 3d (spatial) space; Or analyses based on the colours of galaxies could consider them as points in (say) 5d (colour) space;

All in the modelling (>5d)… Complex data types (pictures, graphs, etc) can be modelled as kd points using well- known tricks: –A blinking star could be modelled by the histogram of its brightness; –A photo could be represented as histogram of brightness x colour (3x3) of its pixels (i.e. as a point in 9d space); –A sonar echo could be modelled by the intensity every 10 ms after the first return.

Access Methods Access methods structure a data set for “efficient” search; The standard components of a method are: –Reduction of the data set to a set of sub-sets (partitions); –Definition of a directory (index) of partitions to allow traversal; –Definition of a search algorithm that traverses intelligently.

Only a few variants on the theme Space-based –Cells derived by a regular decomposition of the data space, s.t cells have ‘nice’ properties; –Points assigned to cells; Data-based –Decomposition of the data set to sub-sets, s.t. the sub-sets have ‘nice’ properties; –Incremental or bulk load. Efficiency comes through pruning: the index supports discovery of the partitions that need not be accessed.

kd an extension of 2d? Extensive r&d on (geo)spatial database ; Surely kd is just a generalisation of the problems in 2d and 3d? Analogues of 2d methods ran out of puff at about 8d, sometimes earlier; Why was this? Did it matter?

The Curse of Dimensionality Named by Bellman (1961); Creep in applicability, to generally include the “not commonsense” effects that become increasingly awkward as the dimensionality rises; And the non-linearity of costs with dimensionality (often exponential); Two examples.

CofD: Example 1 Sample the space [0,1] d by a grid with a spacing of 0.1: –1d: 10 points –2d: 100 points –3d: 1000 points; –… –10d: points;

CofD: Example 2 Determine the mean number of points within a hypersphere of radius r, placed randomly within the unit hypercube with a density of a. Let’s assume r << 1. Trivial if we ignore edge effects; But that would be misleading …

Edge effects? P(edge effect) = 2r (1d) = 4r – 4r 2 (2d) = 6r – 12r 2 + 8r 3 (3d)

Which means … If it’s a uniform random distribution, a point is likely to be near a face (or edge) in high- dimensional space; Analyses quickly end up in intractable expressions; Usually, interesting behaviour is lost when models are simplified to permit neat analyses.

Early rumbles … Weber et al [1998]: assertions that tree- based indexes will fail to prune in high-d; Circumstantial evidence; Relied on ‘well-known’ comparative costs for disk and CPU (too generous); Not a welcome report!

Theorem of Instability Reported by Beyer et al [1999 ], formalised & extended by Shaft & Ramakrishnan [2005]; For many data distributions, all pairs of points are the same distance apart.

Contrast plot, 3 Gaussian sets

Which means … Any search method based on a contracting search region must fall to the performance of a naiive (sequential) method, sooner or later; This covers all (arguably) approaches devised to date; So we need to think boldly (or change our interests)...

Target Problems In high-d, operations most commonly are framed in terms of neighbourhoods: –K Nearest Neighbours (kNN) query; –kNN join; –RkNN query. In low-d, operations are most commonly framed in terms of ranges for attributes.

kNN Query For this query point q, retrieve the 10 objects most similar to it. Which requires that we define similarity, conventionally by a distance function; The query type in high-d; Almost ubiquitous in high-d; Formidable literatures.

kNN Join For each object of a set, determine the k most similar points from the set; Encountered in data mining, classification, compression, ….; A little care provides a big reward; Not a lot of investigation.

RkNN Query If a new object q appears, for what objects will it be a k Nearest Neighbour? Eg a chain of bookstores knows where its stores are and where its frequent-buyers live. It is about to open a new store in Stockbridge. For which frequent-buyers will the new store be closer than the current stores? Even less investigation. High costs inhibit use.

Optimised Partitioning: the bet If we have a simple index structure and a simple search method, we can frame partitioning of the data set as an optimisation (assignment) problem; Although it’s NP-hard, we can probably solve it, well enough, using an iterative method; And it might be faster.

Which requires A.We devise the access method B. Formal statement of the problem: Objective function; Constraints. C.Solution Technique; D.Evaluate.

Partitioning as the core concept Reduce the data set to subsets (partitions). Partitions contain a variable number of points, with an upper limit. Partitions have a Minimum Bounding Box.

Index The index is a list of the partitions’ MBBs; In no particular order; Held in RAM (and so we should impose an upper limit on the number of partitions). I = {id, {low, high} d }

Mindist Search Discipline Fetch and scan the partitions (in a certain sequence), maintaining a list of the k candidates; To scan a partition, –Evaluate dist from each member to the query point; –If better than the current k’th candidate, place it in the list of candidates.

The Sequence: mindist Can simply evaluate the minimum distance from a query point to any point within an MBB (the mindist for a partition); If we fetch in ascending mindist, we can stop when a mindist is greater than the distance to the current k’th candidate; Conveniently, this is the optimum in terms of partitions fetched.

For example A B C Q A: 1 (..) 6: (6) 4: (6) B: 2: (6) 5: (6) Done!

Objective Function Minimise the total elapsed time of performing a large set of queries Which requires that we have a representative set of queries, from an historical record or a generator. And we have the solutions for those queries.

The Formal Statement Where A(B) is the cost of fetching a partition of B points, and C(B) is the cost of scanning a partition of B points.

Unit costs acquired empirically We can plug in costs for different environments.

Constraints All points allocated one (and only one) partition; Upper limit on points in a partition; Upper limit on number of partitions used.

Constraints

Finally … Which leaves us with the assignments of points to partitions as the only decision variables.

The Solution Technique Applies a conventional iterative refinement to an Initial Feasible Solution; The problem seems to be fairly placid; Acceptable load times for data sets trialled to date.

How to assess? Not hard to generate meaningless performance data; Basic behaviour: synthetic data (N, d, k, distribution); Comparative: real data sets; Benchmarks: naiive method and best- previously-reported; Careful implementation of a naiive methods can be rewarding.

Response with N of points

Response with Dimensionality

What does it mean? Can reduce times by 3, below the cutoff; The cutoff depends on the dataset size; Some conjectures drawn from the Theorems are based on an unrealistic model and are probably quantitatively wrong; Times for kNN queries have apparently fallen from 50 ms to 0.5 ms ms is attributable to system caching.

Join? RkNN? Work in progress! Specialist kNN Join algorithms are well worthwhile; Optimised Partitioning for RkNN works well; Falls in query costs from 5 sec (or so) to 5 ms (or so); Query + join + reverse is a nice package.

Which all suggests (Part 1) Neighbourhood operations used only in a few, specialised geospatial apps; Specific data structures used; More general view of “neighbourhood” might open up more apps; Eg finding clusters of galaxies from catalogues: –Large groups of galaxies that are bound gravitationally; –Available definitions are not helpful in “seeing” clusters. The core element is high density; –Search by neighbourhoods, rather than an arbitrary grid, to find high-density regions..

Which all suggests (Part 2) Algorithms using kNN as a basic operation can be accelerated by (apparently) x100; RkNN is apparently much cheaper than we expected (and …); Designer data structures appear possible (eg design such that no more than 5% of transactions take more than 50 ms).

And which shows … There are many interesting, open problems out there, for data engineers; Using Other People’s Techniques can be quite profitable; Data Engineers can be useful eScience team members.

More?