Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.

Slides:



Advertisements
Similar presentations
When Is Nearest Neighbors Indexable? Uri Shaft (Oracle Corp.) Raghu Ramakrishnan (UW-Madison)
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
CMU SCS : Multimedia Databases and Data Mining Lecture #7: Spatial Access Methods - Metric trees C. Faloutsos.
Scalable Content-Addressable Network Lintao Liu
External Memory Hashing. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B.
The Palm-tree Index Indexing with the crowd Ahmed R Mahmood*Walid G. Aref* Eduard Dragut*Saleh Basalamah** *Purdue University**Umm AlQura University.
CPSC 335 Dr. Marina Gavrilova Computer Science University of Calgary Canada.
Searching on Multi-Dimensional Data
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.
Image Indexing and Retrieval using Moment Invariants Imran Ahmad School of Computer Science University of Windsor – Canada.
Common approach 1. Define space: assign random ID (160-bit) to each node and key 2. Define a metric topology in this space,  that is, the space of keys.
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
B+-tree and Hashing.
Scalable Resource Information Service for Computational Grids Nian-Feng Tzeng Center for Advanced Computer Studies University of Louisiana at Lafayette.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
Exploiting Content Localities for Efficient Search in P2P Systems Lei Guo 1 Song Jiang 2 Li Xiao 3 and Xiaodong Zhang 1 1 College of William and Mary,
Chapter 3: Data Storage and Access Methods
1 Hash-Based Indexes Chapter Introduction : Hash-based Indexes  Best for equality selections.  Cannot support range searches.  Static and dynamic.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Freenet A Distributed Anonymous Information Storage and Retrieval System I Clarke O Sandberg I Clarke O Sandberg B WileyT W Hong.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
ICDE A Peer-to-peer Framework for Caching Range Queries Ozgur D. Sahin Abhishek Gupta Divyakant Agrawal Amr El Abbadi Department of Computer Science.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
FLANN Fast Library for Approximate Nearest Neighbors
K-Ary Search on Modern Processors Fakultät Informatik, Institut Systemarchitektur, Professur Datenbanken Benjamin Schlegel, Rainer Gemulla, Wolfgang Lehner.
INTRODUCTION TO PEER TO PEER NETWORKS Z.M. Joseph CSE 6392 – DB Exploration Spring 2006 CSE, UT Arlington.
Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.
Multi Feature Indexing Network MUFIN Similarity Search Platform for many Applications Pavel Zezula Faculty of Informatics Masaryk University, Brno MUFIN:
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Document retrieval Similarity –Vector space model –Multi dimension Search –Range query –KNN query Query processing example.
K Nearest Neighbors Classifier & Decision Trees
1 Pattern Classification X. 2 Content General Method K Nearest Neighbors Decision Trees Nerual Networks.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko.
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Fast Searching in Peer-to-Peer Networks Self-Organizing Parallel Search Clusters Rocky Dunlap.
Efficient Metric Index For Similarity Search Lu Chen, Yunjun Gao, Xinhan Li, Christian S. Jensen, Gang Chen.
1 Tree Indexing (1) Linear index is poor for insertion/deletion. Tree index can efficiently support all desired operations: –Insert/delete –Multiple search.
INTERACTIVELY BROWSING LARGE IMAGE DATABASES Ronald Richter, Mathias Eitz and Marc Alexa.
SIMILARITY SEARCH The Metric Space Approach Pavel Zezula, Giuseppe Amato, Vlastislav Dohnal, Michal Batko.
1 Scalable Distributed Data Structures Part 2 Witold Litwin Paris 9
Similarity Access for Networked Media Connectivity Pavel Zezula Masaryk University Brno, Czech Republic.
Multi-object Similarity Query Evaluation Michal Batko.
1. Efficient Peer-to-Peer Lookup Based on a Distributed Trie 2. Complex Queries in DHT-based Peer-to-Peer Networks Lintao Liu 5/21/2002.
1 30 November 2006 An Efficient Nearest Neighbor (NN) Algorithm for Peer-to-Peer (P2P) Settings Ahmed Sabbir Arif Graduate Student, York University.
BATON A Balanced Tree Structure for Peer-to-Peer Networks H. V. Jagadish, Beng Chin Ooi, Quang Hieu Vu.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Algorithms and Techniques in Structured Scalable Peer-to-Peer Networks
DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.
CS 347Notes081 CS 347: Parallel and Distributed Data Management Notes 08: P2P Systems.
Large Scale Sharing Marco F. Duarte COMP 520: Distributed Systems September 19, 2004.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
Introduction and File Structures Database System Implementation CSE 507 Some slides adapted from R. Elmasri and S. Navathe, Fundamentals of Database Systems,
INFO Week 7 Indexing and Searching Dr. Xia Lin Assistant Professor College of Information Science and Technology Drexel University.
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.
OceanStore : An Architecture for Global-Scale Persistent Storage Jaewoo Kim, Youngho Yi, Minsik Cho.
SIMILARITY SEARCH The Metric Space Approach
Data Indexing Herbert A. Evans.
Tree-based Indexing Hessam Zakerzadeh.
Spatial Indexing I Point Access Methods.
A Scalable content-addressable network
External Memory Hashing
Hash-Based Indexes Chapter 10
Multidimensional Indexes
Chapter 11 Instructor: Xin Zhang
Presentation transcript:

Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula

2 Presentation contents Motivation Metric spaces and similarity searching GHT* Concepts Generalized Hyperplane Tree Distributed architecture Experimental results Conclusions and future work

3 Motivation Searching is a fundamental problem Traditional search Numbers or strings Based on total linear order of keys New approach Free text, images, audio, video, etc. Impossible to structure in keys and records

4 Alternative Metric spaces Similarity searching

5 Metric space Set of objects (A) any class of objects, which allows distance computing for example text, audio or video files Metric function (d) positive reflexive symmetric triangle inequality

6 Similarity searching Range search objects at max distance r from object Q k -nearest neighbor search k nearest neighbor objects of object Q r Q Q

7 GHT* – concepts Data distributed among servers Multiple buckets with limited capacity Clients perform updates and search Bucket location algorithm Based on DDH and DST algorithms Exploits Generalized Hyperplane Tree

8 p2 p5 p1 p10 p3 p4 p11p6 p7 p8 p9 p12 p13 P14 Generalized Hyperplane Tree Single-site metric space indexing structure Allows similarity searching and is scalable Binary search tree Data stored in leaf nodes Inner nodes for routing Two “pivots” per nodep2p5 p5p2 p2 p4 p6 p12 p10 p9 p8p5 p3 p7 p11 p13 p14 p1

9 GHT* – distributed architecture GHT is used as search structure Leaf node represents a server unique server identifier servers extend the tree with leaf nodes for their local buckets Inner nodes store routing information GHT is replicated GHT can be inaccurate Update (image adjustment) messages

10 GHT* – distributed architecture

11 Experimental results – inserting Preliminary phase Tests for vector space with Euclidean distance function objectsminmaxavg Occupied buckets Occupied servers Overall bucket load Maximal tree depth Replication3.9%5.9%5%

12 Experimental results – searching 20 range queries with radius 50 points (match approx. 3 objects)

13 Conclusions First structure for scalable distributed similarity search Satisfies properties of SDDS Scalability – can expand to new servers through autonomous splits No hot-spot – all clients use as precise addressing as possible and learn from misaddressing Updates are local and never require updates to multiple clients Client performs only a few distance computations to locate servers

14 Future work More experiments Different metric spaces More complex evaluation Additional evaluated properties Nearest neighbor search Algorithm for parallel processing to better utilize distributed structure Experimental evaluation

Questions?