Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.

Slides:



Advertisements
Similar presentations
Hierarchical Cellular Tree: An Efficient Indexing Scheme for Content-Based Retrieval on Multimedia Databases Serkan Kiranyaz and Moncef Gabbouj.
Advertisements

Trees for spatial indexing
Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.
On Reinsertions in M-tree Jakub Lokoč Tomáš Skopal Charles University in Prague Department of Software Engineering Czech Republic.
Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.
 Definition of B+ tree  How to create B+ tree  How to search for record  How to delete and insert a data.
Improving the Performance of M-tree Family by Nearest-Neighbor Graphs Tomáš Skopal, David Hoksza Charles University in Prague Department of Software Engineering.
B+-trees. Model of Computation Data stored on disk(s) Minimum transfer unit: a page = b bytes or B records (or block) N records -> N/B = n pages I/O complexity:
1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)
Pivoting M-tree: A Metric Access Method for Efficient Similarity Search Tomáš Skopal Department of Computer Science, VŠB-Technical.
ADBIS 2003 Revisiting M-tree Building Principles Tomáš Skopal 1, Jaroslav Pokorný 2, Michal Krátký 1, Václav Snášel 1 1 Department of Computer Science.
1 Lecture 8: Data structures for databases II Jose M. Peña
Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.
B+-tree and Hashing.
Accessing Spatial Data
Scalable and Distributed Similarity Search in Metric Spaces Michal Batko Claudio Gennaro Pavel Zezula.
Last Time –Main memory indexing (T trees) and a real system. –Optimize for CPU, space, and logging. But things have changed drastically! Hardware trend:
A New Point Access Method based on Wavelet Trees Nieves R. Brisaboa, Miguel R. Luaces, Diego Seco Database Laboratory University of A Coruña A Coruña,
Chapter 3: Data Storage and Access Methods
Quick Review of material covered Apr 8 B+-Tree Overview and some definitions –balanced tree –multi-level –reorganizes itself on insertion and deletion.
Techniques and Data Structures for Efficient Multimedia Similarity Search.
Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
R-Trees: A Dynamic Index Structure for Spatial Data Antonin Guttman.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
1 SD-Rtree: A Scalable Distributed Rtree Witold Litwin & Cédric du Mouza & Philippe Rigaux.
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Introduction to The NSP-Tree: A Space-Partitioning Based Indexing Method Gang Qian University of Central Oklahoma November 2006.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
NM-Tree: Flexible Approximate Similarity Search in Metric and Non-metric Spaces Tomáš Skopal Jakub Lokoč Charles University in Prague Department of Software.
IKI 10100: Data Structures & Algorithms Ruli Manurung (acknowledgments to Denny & Ade Azurat) 1 Fasilkom UI Ruli Manurung (Fasilkom UI)IKI10100: Lecture17.
Fast BVH Construction on GPUs (Eurographics 2009) Park, Soonchan KAIST (Korea Advanced Institute of Science and Technology)
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Tomáš Skopal 1, Benjamin Bustos 2 1 Charles University in Prague, Czech Republic 2 University of Chile, Santiago, Chile On Index-free Similarity Search.
R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.
An Efficient CUDA Implementation of the Tree-Based Barnes Hut n-body Algorithm By Martin Burtscher and Keshav Pingali Jason Wengert.
Database Management Systems, R. Ramakrishnan 1 Algorithms for clustering large datasets in arbitrary metric spaces.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Exploiting Multithreaded Architectures to Improve Data Management Operations Layali Rashid The Advanced Computer Architecture U of C (ACAG) Department.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.
R* Tree By Rohan Sadale Akshay Kulkarni.  Motivation  Optimization criteria for R* Tree  High level Algorithm  Example  Performance Agenda.
Presenters: Amool Gupta Amit Sharma. MOTIVATION Basic problem that it addresses?(Why) Other techniques to solve same problem and how this one is step.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
Jeremy Iverson & Zhang Yun 1.  Chapter 6 Key Concepts ◦ Structures and access methods ◦ R-Tree  R*-Tree  Mobile Object Indexing  Questions 2.
Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.
Similarity Search without Tears: the OMNI- Family of All-Purpose Access Methods Michael Kelleher Kiyotaka Iwataki The Department of Computer and Information.
Spatial Data Management
Mehdi Kargar Department of Computer Science and Engineering
Data Indexing Herbert A. Evans.
Indexing and hashing.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Tree-Structured Indexes
Extra: B+ Trees CS1: Java Programming Colorado State University
Database Applications (15-415) DBMS Internals- Part III Lecture 15, March 11, 2018 Mohammad Hammoud.
Chapter 11: Indexing and Hashing
Spatio-Temporal Databases
KISS-Tree: Smart Latch-Free In-Memory Indexing on Modern Architectures
Indexing and Hashing Basic Concepts Ordered Indices
The BIRCH Algorithm Davitkov Miroslav, 2011/3116
Lecture 28: Index 3 B+ Trees
Chapter 11: Indexing and Hashing
Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research)
Donghui Zhang, Tian Xia Northeastern University
B+-trees In practice, B-trees are not used much as defined earlier.
Presentation transcript:

Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP

Presentation outline M-tree ◦ The original structure ◦ Simple parallel construction ◦ Concurrent parallel construction Parallel batch loading Experimental results

Motivation The trend in CPU development is oriented on multi core architectures - we need scalable algorithms, e.g., index construction Faster indexing - applications ◦ User wants to upload a lot of new objects ◦ More sophisticated indexing methods ◦ Re-indexing Scientists can perform much more tests

(euclidean 2D space) range query Q M-tree (metric tree) dynamic, balanced, and paged tree structure (like e.g. B + -tree, R-tree) the leaves are clusters of indexed objects O j (ground objects) routing entries in the inner nodes represent hyper-spherical metric regions (O i, r Oi ), recursively bounding the object clusters in leaves the triangle inequality allows to discard irrelevant M-tree branches (metric regions resp.) during query evaluation

Parallel M-tree construction Reading disk pages in parallel (I/O) ◦ Prediction – just one branch can be selected ◦ Using cache vs. data declustering ◦ SSD disks – solution of the problem? Parallel distance computation (CPU) ◦ Processing objects in a node (limited by capacity) ◦ Node splitting ◦ Concurrent processing of multiple new objects

Simple parallel construction 1)Inserting starts in the root node 2)Some routing item is selected using a heuristic (limited number of distances is evaluated in parallel) 3) The radius of the routing item can be updated 4) Object is delegated to the child node (nodes are processed sequentially) 5) If the actual node is leaf then insert new object else step 2 6) If the leaf node is overfull then split the node a) Compute distance matrix b) Promote new routing items c) Redistribute objects and set links new object m h The number of distance evaluations during one insertion is bounded by h x m Using m (and more) cores - we still have to wait until h distances are evaluated More than m cores can be exploited just for splitting (up to m x (m - 1) / 2) Acceptable for one object, but we usually need to insert a lot of objects – n x h !!!

Concurrent inserting One insertion is atomic operation – less parallel overhead Parallelism is not limited by the node capacity Complexity of insertions is almost the same (small differences depend on node utilization) Ideal task for parallelism Simple definition of the problem Simple work distribution between tasks Inserted objects have shared access to inner nodes – no blocking However, traditional inserting has to be improved by synchronization

Synchronisation problems Objects can’t be inserted just in parallel Routing items have to be updated (radius) ◦ One routing item can be changed by two threads ◦ Easy to solve using locks Updated leaf nodes must be locked ◦ Similar as for routing items Splitting ◦ Split may change tree hierarchy significantly ◦ It is complicated to synchronize more concurrent splits ◦ Locking during splitting may decrease speed up of concurrent inserting ◦ Is it necessary to perform concurrent splits??? Splitting can be postponed!

Postponed reinserting To avoid the split the most distant object is removed from the overfull node and its radius is decreased M-tree hierarchy is improved Used to avoid synchronization problems Removed object is inserted later

Parallel dynamic batch loading 1. Aggregation 2. Parallel batch loading Not all objects are inserted during the second step. Moreover, some objects are removed from the tree and stored. Some of them are inserted in traditional way to perform several splits. 3. Traditional inserting Postponed – will be inserted during the next batch “Split generating” – will be inserted in traditional way (exploiting limited parallelism) Not inserted objects To find scalability bottlenecks we measured Parallel batch loading time – PI Traditional inserts causing split time – ICS Traditional inserts not causing split time – INCS

Parallel dynamic batch loading Which objects insert in the traditional way? a) Randomly select several objects b) Postpone the “furthest” objects Postponed – will be inserted during the next batch “Split generating” – will be inserted in traditional way (exploiting limited parallelism) Not inserted objects Objects assigned to the same leaf node (same routing item) during concurrent inserting

Experimental results Two datasets CoPhIR (MPEG7 image features) ◦ feature vectors ◦ 76 dimension (12 color layout + 64 color structure) ◦ L distance Polygons ◦ 250,000 2D polygons ◦ 5-15 vertices ◦ Hausdorff distance

Experimental results (win) PolygonsCoPhIR CLASSIC CLASSIC CLASSIC Batch Batch Batch Construction time

Experimental results (win) DC by range queries PolygonsCoPhIR CLASSIC Batch Batch Batch

Experimental results (linux) MethodCoresTime (s)Utilization (%) M-tree M-tree16(5.2 x) Batch Batch Batch Batch Batch16(9.7 x) MethodPB time (s)ICS time (s)INCS time (s) Batch Batch Batch Batch Batch 16(14 x !!!) CoPhIR Dimension 76 ( ) L distance 24 / 25 inner/leaf node size 512MB cache size

Thank for your attention! References: P. Ciaccia, M. Patella, and P. Zezula M-tree: An efficient Access Method for Similarity Search in Metric Spaces In VLDB'97, pages , J. Lokoc and T. Skopal On reinsertions in m-tree In SISAP '08: Proceedings of the First International Workshop on Similarity Search and Applications (sisap 2008), pages 121{128, Washington, DC, USA, IEEE Computer Society. P. Zezula, P. Savino, F. Rabitti, G. Amato, and P. Ciaccia Processing m-tree with parallel resources In Proceedings of the 6th EDBT International Conference, 1998.