Download presentation
Presentation is loading. Please wait.
Published byArline Lawrence Modified over 9 years ago
1
1 Queryy Sampling Based High Dimensional Hybrid Index Junqi Zhang, Xiangdong Zhou Fudan University
2
2 Nearest Neighbors Query Dims Overlap Accessed
3
3 Cluster Partitioning Based B + -tree
4
4 Index Structure What’s the optimal extent to partition ? iDistance : by experiments Ours: by cost model to predict
5
5 Object of Cluster Partitioning - Lowest Query Cost Appropriate M : Distribute M to each cluster Overall number of clusters :
6
6 Dimension Curse dim>10 : tree<scan< VA-file dim scan> VA-file Non uniform : tree VA-file VA-file defect How to improve tree performance ?
7
7 Tree and scan — which better ? tree advantage : filter data instead of linear scan the whole file disadvantage : position cost for each data is the height of intermediate nodes,which is higher than scan scan advantage : position cost for each data is 0 disadvantage : linear scan the whole file
8
8 Cost that view from each point (C<1) : tree - useful - compared with scan ( C>=1) : tree - useless - compared with scan
9
9 Data distribution and index performance Known work : index data in a single index DIMS tree Real image data set : Non uniform Non uniform data aggregate tree FAST
10
10 Data type Sparse data tree<scan Dense data tree>scan
11
11 Hybrid data type hybrid index hybrid index Sequencial file B + -tree Sparse data dense data tree<scantree>scan
12
12 How to differentiate data type ? Each data as a unit difficult Each cluster ring as a unit easier
13
13 Clsuter partitioning What extent ?
14
14 Clsuter partitioning based B + -tree
15
15 Clsuter partitioning based image retrieval system Outer rings of custers are often accessed
16
16 Some rings of custers are often accessed Treat outer rings as sparse rings? ?
17
17 Frequence of being accessed for each ring
18
18
19
19 Hybrid index - cut branches ( according to the contribution of each ring to the query cost ) Expected cost Cost by linear scan
20
20 Standard of rings being cut - Index Capability IC ( index capability ): Question : how to determine ?
21
21 Estimate - query samping Question : for large database , lot of queris bring expensive cost Object : given confidence a% , make minimum
22
22 Threshold of rings being cut When IC equal 0 : Rule : When the probability of ring being accessed by queries is lower than this threshold, this ring should remain in the tree, or else, it should be cut into the sequence file for linear scan.
23
23 Query sampling algorithm : When or or , stop sampling. User can balance the accuracy and efficiency of sampling by tuning the confidence a% , and the complexity of this algorithm is less than.
24
24 Query algorithm of hybrid index Linear scan the sequence file for sparse data Retrieve the dense data on the B + -tree
25
25 Thanks!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.