Hagenberg -Linz -Prague- Vienna iiWAS 2002, 10-12 September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING.

Slides:

Advertisements

Similar presentations

1 DATA STRUCTURES USED IN SPATIAL DATA MINING. 2 What is Spatial data ? broadly be defined as data which covers multidimensional points, lines, rectangles,

Advertisements

Lower Bounds for Local Search by Quantum Arguments Scott Aaronson.

Lower Bounds for Local Search by Quantum Arguments Scott Aaronson (UC Berkeley) August 14, 2003.

Answering Approximate Queries over Autonomous Web Databases Xiangfu Meng, Z. M. Ma, and Li Yan College of Information Science and Engineering, Northeastern.

Scalability, from a database systems perspective Dave Abel.

Object Recognition Using Locality-Sensitive Hashing of Shape Contexts Andrea Frome, Jitendra Malik Presented by Ilias Apostolopoulos.

QoS-based Management of Multiple Shared Resources in Dynamic Real-Time Systems Klaus Ecker, Frank Drews School of EECS, Ohio University, Athens, OH {ecker,

Gate Sizing for Cell Library Based Designs Shiyan Hu*, Mahesh Ketkar**, Jiang Hu* *Dept of ECE, Texas A&M University **Intel Corporation.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - IV Grid files, dim. curse C. Faloutsos.

Indexing DNA Sequences Using q-Grams

On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.

Shortest Vector In A Lattice is NP-Hard to approximate

Efficient Processing of Top- k Queries in Uncertain Databases Ke Yi, AT&T Labs Feifei Li, Boston University Divesh Srivastava, AT&T Labs George Kollios,

Spatio-temporal Databases

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

Nearest Neighbor Search in High Dimensions Seminar in Algorithms and Geometry Mica Arie-Nachimson and Daniel Glasner April 2009.

Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.

Searching on Multi-Dimensional Data

Lazy vs. Eager Learning Lazy vs. eager learning

Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.

3D Shape Histograms for Similarity Search and Classification in Spatial Databases. Mihael Ankerst,Gabi Kastenmuller, Hans-Peter-Kriegel,Thomas Seidl Univ.

A Generic Framework for Handling Uncertain Data with Local Correlations Xiang Lian and Lei Chen Department of Computer Science and Engineering The Hong.

SASH Spatial Approximation Sample Hierarchy

Spatio-temporal Databases Time Parameterized Queries.

Visual Querying By Color Perceptive Regions Alberto del Bimbo, M. Mugnaini, P. Pala, and F. Turco University of Florence, Italy Pattern Recognition, 1998.

Computing Sketches of Matrices Efficiently & (Privacy Preserving) Data Mining Petros Drineas Rensselaer Polytechnic Institute (joint.

1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman

KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.

Introduction to Database Systems 1 Join Algorithms Query Processing: Lecture 1.

Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.

Space-Efficient Sequence Alignment Space-Efficient Sequence Alignment Bioinformatics 202 University of California, San Diego Lecture Notes No. 7 Dr. Pavel.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Glasgow 02/02/04 NN k networks for content-based image retrieval Daniel Heesch.

B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)

Disclosure risk when responding to queries with deterministic guarantees Krish Muralidhar University of Kentucky Rathindra Sarathy Oklahoma State University.

M- tree: an efficient access method for similarity search in metric spaces Reporter ： Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU

Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.

Reporter ： Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

ICDE, San Jose, CA, 2002 Discovering Similar Multidimensional Trajectories Michail VlachosGeorge KolliosDimitrios Gunopulos UC RiversideBoston UniversityUC.

Challenges in Mining Large Image Datasets Jelena Tešić, B.S. Manjunath University of California, Santa Barbara

Similarity Searching in High Dimensions via Hashing Paper by: Aristides Gionis, Poitr Indyk, Rajeev Motwani.

Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept. of Electronic.

Robust Estimation With Sampling and Approximate Pre-Aggregation Author: Christopher Jermaine Presented by: Bill Eberle.

Fast Query-Optimized Kernel Machine Classification Via Incremental Approximate Nearest Support Vectors by Dennis DeCoste and Dominic Mazzoni International.

Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Multimedia and Time-Series Data When Is “ Nearest Neighbor ” Meaningful? Group member: Terry Chan, Edward Chu, Dominic Leung, David Mak, Henry Yeung, Jason.

CS Machine Learning Instance Based Learning (Adapted from various sources)

Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.

Incremental Reduced Support Vector Machines Yuh-Jye Lee, Hung-Yi Lo and Su-Yun Huang National Taiwan University of Science and Technology and Institute.

CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.

May 2003 SUT Color image segmentation – an innovative approach Amin Fazel May 2003 Sharif University of Technology Course Presentation base on a paper.

Antara Ghosh Jignashu Parikh

Indexing Multidimensional Data

SIMILARITY SEARCH The Metric Space Approach

Chinese Academy of Sciences, Beijing, China

Spatial Indexing I Point Access Methods.

Chapter 12: Query Processing

Spatio-temporal Pattern Queries

K Nearest Neighbor Classification

15-826: Multimedia Databases and Data Mining

Lecture 2- Query Processing (continued)

Similarity Search: A Matching Based Approach

Implementation of Relational Operations

CS5112: Algorithms and Data Structures for Applications

Continuous Density Queries for Moving Objects

Efficient Processing of Top-k Spatial Preference Queries

Presentation transcript:

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 1 -ISA: AN INCREMENTAL LOWER BOUND APPROACH FOR EFFICIENTLY FINDING APPROXIMATE NEAREST NEIGHBOR OF COMPLEX VAGUE QUERIES DANG Tran Khanh, KÜNG Josef, WAGNER Roland Institute for Applied Knowledge Processing (FAW) Johannes Kepler University of Linz Austria

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 2 OUTLINE Complex Vague Queries in the Vague Query System (VQS) Similarity search problem of the VQS in the conventional DBMSs Incremental hyper-Sphere Approach (ISA) Overcome shortcomings of Incremental hyper-Cube Approach (ICA) -ISA: Finding Approximate Nearest Neighbors of Complex Vague Queries The issue of the dimensionality curse The issue of increasing the query condition number Experimental Results Conclusions

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 3 COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM The VQS: Introduced by Kueng and Palkoska 1997 Support similarity search capabilities in the conventional DBMSs: return to users records semantically close to a given query One of the VQSs basic ideas: NCR-Tables (Numeric-Coordinate-Representation-Tables): keep numeric semantic information of non-numeric attributes

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 4 NCR-Tables – an example fuzzy field NCR-keyNCR - columns NCR-table COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM SELECT FROM Car WHERE Col IS dark blue INTO myResultTable;

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 5 Complex Vague Queries in VQS: A simplified view of the problem NCR-Table 1NCR-Table n … Index 1 … Index n Value_nk…Value_1k... ………… Value_n1…Value_11... Attribute n…Attribute 1...Query relation Vague query processing module COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 6 The issue of the dimensionality curse [Weber et al 1998; Beyer et al 1999] NCR-Tables with high-dimensional data: The probability of overlaps between a query and data regions is very high, and thus the performance of multidimensional access methods (MAMs) is decreased significantly A linear scan over the whole data set would perform better than MAMs Approximate nearest neighbor problem: dist(Q, P) (1+ )dist(Q, P)(1) Almost for single data sets: single–feature nearest neighbor (S-FNN) queries [Arya et al 1998, Kleinberg 1997, Amato et al 2000, Ciaccia and Patella 2000, etc.] COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 7 Solving Complex Vague Queries in VQS: Random access [Fagin 1996] is impossible …… y1x2 y2x1 y1x1 Attr2Attr1 Query relation …… …y2 …y1 [Values]Domain1Attr1 …… …x2 …x1 [Values]Domain1Attr1 COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 8 Incremental hyper-Cube Approach (ICA) [Kueng and Palkoska 1999] Issues with the ICA: see [Dang et al 2002a, Dang et al 2002b] for the details How to determine the initial hyper-cubes ? How to extend the hyper-cubes in necessary case Accessing unnecessary disk pages and objects Repeated disk accesses Only best match record is returned (not top-k records) COMPLEX VAGUE QUERIES IN THE VAGUE QUERY SYSTEM

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 9 INCREMENTAL HYPER-SPHERE APPROACH (ISA) Input: A query relation/view S A complex vague query Q with n query conditions q i (i=1, 2… n) Assume each feature space (or NCR-Table) related to Q is managed by a multidimensional index structure F i Output: Best match record/tuple T min for Q, T min S. Ties are arbitrarily broken. Step 1: Search on each F i for the corresponding q i using the adapted incremental algorithm for hyper-sphere range queries. Step 2: Combine the searching results from all q i to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1. Step 3: Compute total distances/scores for the found records using formula 2 below and find a record T min with the minimum total distance TD cur. Ties are arbitrarily broken.

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 10 INCREMENTAL HYPER-SPHERE APPROACH (ISA)

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 11 INCREMENTAL HYPER-SPHERE APPROACH (ISA) Step 4: Compute the maximum searching radius for each q i with respect to TD cur using formula 3 below and continue doing the search as steps 1, 2 and 3 until one of two following conditions holds: (a) the current searching radius of each q i is greater than or equal to its maximum searching radius; (b) found a new appropriate record T new with the total distance TD new <TD cur Step 5: If condition (a) holds then return T min as the best match for Q. Otherwise, i.e. condition (b) holds, replace T min with T new, i.e. TD cur is also replaced with a smaller value TD new, and go back to step 4

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 12 INCREMENTAL HYPER-SPHERE APPROACH (ISA) Modifying ISA to retrieve top-k records: see [Dang et al 2002b] High-dimensional feature spaces and/or Query condition number increases ISA performance is decreased

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 13 -ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES CVQ = M-FNN (Multi-Feature Nearest Neighbor) query Using lower bound total distance (LBTD)

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 14 -ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES Input: A query relation/view S A complex vague query Q with n query conditions q i (i=1, 2… n) Assume each feature space (or NCR-Table) related to Q is managed by a multidimensional index structure F i A real >0 used as a tolerant error Output: (1+ )-approximate NN record/tuple T app for Q, T app S. Ties are arbitrarily broken. Step 1: Search on each F i for the corresponding q i using the adapted incremental algorithm for hyper-sphere range queries. Step 2: Combine the searching results from all q i to find at least an appropriate record in S, which contains the returned NCR-Values with respect to each query condition. If there is no appropriate record found then go back to step 1. Step 3: Compute total distances/scores for the found records using formula 2 and find a record T app with the minimum total distance TD cur. Ties are arbitrarily broken.

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 15 -ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES Step 4: Let d i be distance from query condition q i to the last NCR-Value returned in the corresponding feature space, which is being managed by F i. Compute LBTD as follows: LBTD = min {TD cur, d i }, i=1,2…n(5) Step 5: If TD cur <= (1+ )LBTD, return T app as a (1+ )-approximate NN record for Q. Otherwise, go to step 6 Step 6: Compute the maximum searching radius for each q i with respect to TD cur using formula 3 and continue doing the search as steps from 1 to 5 until the algorithm is stopped at step 5. If the current searching radius of a certain q i is greater than or equal to its maximum searching radius then searching on F i is stopped See next slice

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 16 -ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES Lower Bound Total Distance - An example AB CD QRAttr1Attr2 AB Cq2 q1D

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 17 -ISA: FINDING APPROXIMATE NEAREST NEIGHBORS OF COMPLEX VAGUE QUERIES Approximate k-nearest neighbors See our paper for more details

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 18 EXPERIMENTAL RESULTS Data sets: Uniformly distributed: 2, 4, and 8 dimensions (100K objects for each of them) Real: 9 and 16 dimensions (more than 64K feature vectors of images, URL: Using the SH-tree [Dang et al 2001a] to manage multidimensional data Page size: 8KB 100 query points were randomly selected from each corresponding data set...

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 19 EXPERIMENTAL RESULTS 2-condition (4-d and 8-d) NN queries, different values

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 20 EXPERIMENTAL RESULTS 2-condition (4-d) k-NN queries, = 0.2

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 21 EXPERIMENTAL RESULTS 3-condition (2-d) NN queries, different values 2-condition NN queries (9-d and 16-d real data sets), =1 =1 means tolerant error is permitted up to 100% -ISA saved about 4.5 % and 1% of the affected object and disk access number, individually, for 16-d data set while it remained the accuracy at 71% One notable fact here is that the effective epsilon calculated as introduced in (Arya et al. 1998) is quite low, only This is a very promising result.

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 22 CONCLUSIONS -ISA: An Incremental Lower Bound Approach for Efficiently Finding Approximate Nearest Neighbor of Multi-Feature Queries in VQS -ISA is one of the vanguard solutions to dealing with this problem -ISA is very useful for application domains that the returned results need not to be exact but similar or approximate similar (with a certain tolerant error) to a given query. The experimental results have proven this. With a suitable value, the -ISA can save a very high percentage of the costs including both IO-cost and CPU-cost while it still preserves the accuracy of the returned results at a particularly very high value -ISA is applicable to not only numeric domains such as NCR- tables, but also any ranked input Application areas: TIS (tourist information systems), GIS, digital libraries, multimedia systems, etc.

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 23 More information URL: {khanh, jkueng,

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 24 Research related to dealing with complex vague queries The A 0 algorithm [Fagin 1996] (There are some improvements of Fagins algorithm, see the paper for more details): Finding top-k matches for a user query involving several multimedia attributes Problem: this algorithm assumes that random access is possible in the system. This assumption is correct only three following conditions hold: 1.there is at least a key for each subsystem, 2.there is a mapping between the keys, 3.and we must ensure that the mapping is one-to-one In VQS: condition (1) is always satisfied (each fuzzy field are the key for the corresponding NCR-table), but there is no the mapping one-to-one between the fuzzy fields Cannot be applied to our problem

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 25 Other approaches for multimedia databases: [Ortega et al 1997, Chaudhuri et al 1996, Boehm K. et al 2001] (see our paper) Chaudhuri et al introduced a solution to translate a top-k multi-feature query to a range query that the conventional DBMS can process. This approach employs information in the histograms kept by a relational system … Research related to dealing with complex vague queries (cont.)

Hagenberg -Linz -Prague- Vienna iiWAS 2002, September, Bandung, Indonesia, Page 26 ISA and J* algorithm The ISAThe J* algorithm The input is ranked with support of the incremental algorithm adapted for range queries Assume that the ranked input is available, do not show how to deal with it Reduce the database access cost first; this cost and the processed states are reduced by taking into account the hyper-sphere range queries and computing the maximum searching radii Reduce the processed states first, the database access cost is alleviated by iterative deepening technique (S. Russell and P. Norvig: Artificial Inteligence: A Modern Approach. Prentice Hall, Inc., 1995) Derived from the ICA that had been introduced earlier and had the same overall goals as the J* alg. Claimed to be the first alg. that can process joins of ranked input and multi-level joins