Scalability, from a database systems perspective Dave Abel.

Slides:



Advertisements
Similar presentations
Optimal Bounds for Johnson- Lindenstrauss Transforms and Streaming Problems with Sub- Constant Error T.S. Jayram David Woodruff IBM Almaden.
Advertisements

Applications of one-class classification
Query optimisation.
Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING.
Nonparametric Methods: Nearest Neighbors
Choosing an Order for Joins
K-MEANS ALGORITHM Jelena Vukovic 53/07
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
k-Nearest Neighbors Search in High Dimensions
Kaushik Chakrabarti(Univ Of Illinois) Minos Garofalakis(Bell Labs) Rajeev Rastogi(Bell Labs) Kyuseok Shim(KAIST and AITrc) Presented at 26 th VLDB Conference,
Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.
Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Curse of Dimensionality Prof. Navneet Goyal Dept. Of Computer Science & Information Systems BITS - Pilani.
Comments We consider in this topic a large class of related problems that deal with proximity of points in the plane. We will: 1.Define some proximity.
Danzhou Liu Ee-Peng Lim Wee-Keong Ng
Searching on Multi-Dimensional Data
Cse 521: design and analysis of algorithms Time & place T, Th pm in CSE 203 People Prof: James Lee TA: Thach Nguyen Book.
Multidimensional Data
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Statistical Classification Rong Jin. Classification Problems X Input Y Output ? Given input X={x 1, x 2, …, x m } Predict the class label y  Y Y = {-1,1},
Data Structures and Functional Programming Algorithms for Big Data Ramin Zabih Cornell University Fall 2012.
SASH Spatial Approximation Sample Hierarchy
Fast High-Dimensional Feature Matching for Object Recognition David Lowe Computer Science Department University of British Columbia.
Lecture Notes for CMPUT 466/551 Nilanjan Ray
Instance Based Learning
Algorithms for Large Sets of Points Dave Abel CSIRO ICT Centre.
Efficient Nearest-Neighbor Search in Large Sets of Protein Conformations Fabian Schwarzer Itay Lotan.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
FLANN Fast Library for Approximate Nearest Neighbors
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Query Evaluation Chapter 12: Overview.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Indexing for Multidimensional Data An Introduction.
Searching for Extremes Among Distributed Data Sources with Optimal Probing Zhenyu (Victor) Liu Computer Science Department, UCLA.
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Classification Problem GivenGiven Predict class label of a given queryPredict class label of a given query
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Introduction to Query Optimization, R. Ramakrishnan and J. Gehrke 1 Introduction to Query Optimization Chapter 13.
Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality Piotr Indyk, Rajeev Motwani The 30 th annual ACM symposium on theory of computing.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Presented by Ho Wai Shing
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Introduction to Query Optimization Chapter 13.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Quantum Computing MAS 725 Hartmut Klauck NTU
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.
23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.
Clustering (1) Chapter 7. Outline Introduction Clustering Strategies The Curse of Dimensionality Hierarchical k-means.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology A Nonlinear Mapping for Data Structure Analysis John W.
Living under the Curse of Dimensionality Dave Abel CSIRO.
CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.
Spatial Indexing I Point Access Methods.
Introduction to Query Optimization
Sameh Shohdy, Yu Su, and Gagan Agrawal
K Nearest Neighbor Classification
Fast Nearest Neighbor Search on Road Networks
Locality Sensitive Hashing
Multidimensional Indexes
cse 521: design and analysis of algorithms
CSE572, CBS572: Data Mining by H. Liu
Similarity Search: A Matching Based Approach
Minwise Hashing and Efficient Search
Topological Signatures For Fast Mobility Analysis
CSE572: Data Mining by H. Liu
Data Pre-processing Lecture Notes for Chapter 2
Advanced Geospatial Techniques: Aiding Earth Observation Applications
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Data Mining CSCI 307, Spring 2019 Lecture 23
Presentation transcript:

Scalability, from a database systems perspective Dave Abel

Roadmap Scale of what? A case study: 2D to kd; Some algorithms for kd similarity joins; So …

Size matters (for joins) Number of sources; Number of points; Number of dimensions. Lets use eAstronomy as an example.

Number of Sources Key issues: Heterogeneity (despite standards); The added sophistication of a more general solution. Optimisation typically flounders through inability to reliably estimate sizes of interim sets; But does it really matter?.

Number of points massive usually means that the data set is too large to fit in real memory; 10**7 seems to define massive in the database world; Usually target O(logN + k) for queries and O(NlogN + k) for joins, in disk I/O.

Number of dimensions Most database access methods are aimed at a single attribute/dimension. QEP deals with multiple atomic operations; Relatively recent interest in search and joins in high-dimensional space: data mining, image databases, complex objects. Surprises for the migrants from geospatial database. The curse of dimensionality (which the mathematicians have known all along).

Some simple algebra N d = n or = (n/N) 1/d So, approaches 1 as d increases. The traditional approaches of restricting the search space fail.

But 2d is still interesting Location is often significant: Geospatial Information Systems (aka Geographic Information Systems) are well- established; Many Astronomy challenges deal with 2d databases (although the coordinate system has its tricks). Issues of sheer size make it worthwhile to consider solutons specific to 2d.

The Sweep Algorithms for Key Operations Neighbour finding, aka fixed-radius all-neighbours, aka similarity join; Catalogue matching, aka fuzzy join; Nearest Neighbour; K-Nearest Neighbours.

The sweep algorithm for neighbour finding/similarity join Active List

Extend to kNN 1. Find an upper bound on dist to NN 2.Determine lower bounds on active list 3.Determine the NNs

WIP: preliminaries SDSS/Personal: 155K points, 12 seconds; Tycho2: 2.4M points; k = 10, 1000 seconds; k = 4, 700 seconds. ?? For large data sets. High dependence on density of points. But it will be dismal for high-dimensional problems.

Why Dismal? The active list is a (d-1)-dimensional data set; The epsilon for the active list is high, so the list is large; We have reduced a join to a nasty nested-loop with a query innermost.

kD Similarity Joins & KNN bounding boxes (bad news after d = 8!); Quadtree techniques; Epsilon Grid Order; Gorder: EGO + dimensionality reduction + some tweaks on selectivity.

Epsilon Grid Order 2,3,2

The lessons Disk I/O optimisation is almost separate from CP optimisation; Selectivity is critical (ie avoidance of distance computations); High data dependence: reliance on the non-uniform distributions of real data sets; How generally applicable are the results?

Best Practice? G-order from Nat Univ Singapore: 0.58M points, d= 10; t =1800 seconds; S = 0.07; 30K points, d = 64; t = 150; S = 0.3; Probably about 10x better than a brute force nested loops; Effects of dimensionality are low.

Final Thoughts Where is the split between the memory- resident and disk-based families? Does the pure form of the problem ignore the Physics or other underlying models? kNN is inherently expensive. Is it a classical problem? Parallelisation (with fresh approaches)? Are we near a plateau for similarity join and kNN with large data sets?