1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University.

Slides:



Advertisements
Similar presentations
Trees for spatial indexing
Advertisements

Clustered Pivot Tables for I/O-optimized Similarity Search Juraj Moško, Jakub Lokoč, Tomáš Skopal Department of Software Engineering Faculty of Mathematics.
Indexing DNA Sequences Using q-Grams
The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
DECISION TREES. Decision trees  One possible representation for hypotheses.
Proximity Searching in High Dimensional Spaces with a Proximity Preserving Order Edgar Chávez Karina Figueroa Gonzalo Navarro UNIVERSIDAD MICHOACANA, MEXICO.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
DIMENSIONALITY REDUCTION: FEATURE EXTRACTION & FEATURE SELECTION Principle Component Analysis.
Fast Algorithms For Hierarchical Range Histogram Constructions
Introduction to Histograms Presented By: Laukik Chitnis
Fast Algorithm for Nearest Neighbor Search Based on a Lower Bound Tree Yong-Sheng Chen Yi-Ping Hung Chiou-Shann Fuh 8 th International Conference on Computer.
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
Effectively Indexing Uncertain Moving Objects for Predictive Queries School of Computing National University of Singapore Department of Computer Science.
SASH Spatial Approximation Sample Hierarchy
Spatial Indexing I Point Access Methods. PAMs Point Access Methods Multidimensional Hashing: Grid File Exponential growth of the directory Hierarchical.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
1 Abstract This paper presents a novel modification to the classical Competitive Learning (CL) by adding a dynamic branching mechanism to neural networks.
High-Dimensional Similarity Search using Data-Sensitive Space Partitioning ┼ Sachin Kulkarni 1 and Ratko Orlandic 2 1 Illinois Institute of Technology,
Basic Data Mining Techniques Chapter Decision Trees.
Basic Data Mining Techniques
Spatial Indexing I Point Access Methods.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
E.G.M. Petrakissearching1 Searching  Find an element in a collection in the main memory or on the disk  collection: (K 1,I 1 ),(K 2,I 2 )…(K N,I N )
Recommender systems Ram Akella November 26 th 2008.
Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Birch: An efficient data clustering method for very large databases
Ensemble Learning (2), Tree and Forest
By Ravi Shankar Dubasi Sivani Kavuri A Popularity-Based Prediction Model for Web Prefetching.
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
A Quantitative Analysis and Performance Study For Similar- Search Methods In High- Dimensional Space Presented By Umang Shah Koushik.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
M- tree: an efficient access method for similarity search in metric spaces Reporter : Ximeng Liu Supervisor: Rongxing Lu School of EEE, NTU
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
The Curse of Dimensionality Richard Jang Oct. 29, 2003.
INTERACTIVELY BROWSING LARGE IMAGE DATABASES Ronald Richter, Mathias Eitz and Marc Alexa.
CS848 Similarity Search in Multimedia Databases Dr. Gisli Hjaltason Content-based Retrieval Using Local Descriptors: Problems and Issues from Databases.
Security in Outsourced Association Rule Mining. Agenda  Introduction  Approximate randomized technique  Encryption  Summary and future work.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
IMinMax B.C. Ooi, K.-L Tan, C. Yu, S. Stephen. Indexing the Edges -- A Simple and Yet Efficient Approach to High dimensional Indexing. ACM SIGMOD-SIGACT-
Eick: kNN kNN: A Non-parametric Classification and Prediction Technique Goals of this set of transparencies: 1.Introduce kNN---a popular non-parameric.
Computational Intelligence: Methods and Applications Lecture 15 Model selection and tradeoffs. Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
CS 8751 ML & KDDInstance Based Learning1 k-Nearest Neighbor Locally weighted regression Radial basis functions Case-based reasoning Lazy and eager learning.
Dense-Region Based Compact Data Cube
Indexing Multidimensional Data
Spatial Data Management
Auburn University
SIMILARITY SEARCH The Metric Space Approach
Data Mining Soongsil University
CS 540 Database Management Systems
Spatial Indexing I Point Access Methods.
Clustering in Ratemaking: Applications in Territories Clustering
Machine Learning Basics
Location Privacy.
CSCI1600: Embedded and Real Time Software
K Nearest Neighbor Classification
File Processing : Query Processing
Nearest-Neighbor Classifiers
Junqi Zhang+ Xiangdong Zhou+ Wei Wang+ Baile Shi+ Jian Pei*
Design of Hierarchical Classifiers for Efficient and Accurate Pattern Classification M N S S K Pavan Kumar Advisor : Dr. C. V. Jawahar.
Indexing and Hashing B.Ramamurthy Chapter 11 2/5/2019 B.Ramamurthy.
DATABASE HISTOGRAMS E0 261 Jayant Haritsa
Continuous Density Queries for Moving Objects
CSCI1600: Embedded and Real Time Software
Memory-Based Learning Instance-Based Learning K-Nearest Neighbor
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University

2 Nearest Neighbors Query Dims Overlap Accessed

3 Cluster Partitioning Based B + -tree

4 Index Structure What’s the optimal extent to partition ?   iDistance : by experiments   Ours: by cost model to predict

5 Object of Cluster Partitioning - Lowest Query Cost Appropriate M : Distribute M to each cluster Overall number of clusters :

6 Dimension Curse dim>10 : tree<scan< VA-file dim scan> VA-file Non uniform : tree VA-file   VA-file defect   How to improve tree performance ?

7 Tree and scan — which better ? tree  advantage : filter data instead of linear scan the whole file  disadvantage : position cost for each data is the height of intermediate nodes,which is higher than scan scan  advantage : position cost for each data is 0  disadvantage : linear scan the whole file

8 Cost that view from each point  (C<1) : tree - useful - compared with scan  ( C>=1) : tree - useless - compared with scan

9 Data distribution and index performance Known work : index data in a single index DIMS tree Real image data set : Non uniform Non uniform data aggregate tree FAST

10 Data type Sparse data tree<scan Dense data tree>scan

11 Hybrid data type hybrid index hybrid index Sequencial file B + -tree Sparse data dense data tree<scantree>scan

12 How to differentiate data type ? Each data as a unit difficult Each cluster ring as a unit easier

13 Clsuter partitioning What extent ?

14 Clsuter partitioning based B + -tree

15 Clsuter partitioning based image retrieval system Outer rings of custers are often accessed

16 Some rings of custers are often accessed Treat outer rings as sparse rings? ?

17 Frequence of being accessed for each ring

18

19 Hybrid index - cut branches ( according to the contribution of each ring to the query cost ) Expected cost Cost by linear scan

20 Standard of rings being cut - Index Capability IC ( index capability ): Question : how to determine ?

21 Estimate - query samping Question : for large database , lot of queris bring expensive cost  Object : given confidence a% , make minimum

22 Threshold of rings being cut  When IC equal 0 : Rule :  When the probability of ring being accessed by queries is lower than this threshold, this ring should remain in the tree, or else, it should be cut into the sequence file for linear scan.

23 Query sampling algorithm : When or or , stop sampling. User can balance the accuracy and efficiency of sampling by tuning the confidence a% , and the complexity of this algorithm is less than.

24 Query algorithm of hybrid index Linear scan the sequence file for sparse data Retrieve the dense data on the B + -tree

25 Thanks!