Preferential top-k search over local data dissertation thesis RNDr. Martin Šumák supervisor: doc. RNDr. Stanislav Krajči, PhD. consultant: RNDr. Peter Gurský, PhD.
Outline Top-k search – motivation and example – restrictions and assumptions R-tree-based solution – normalization of data – R ++ -tree Grid file-based solution Experiments – Comparison with B + -trees-based solution, table scan, etc Preferential top-k search over local data, Dissertation thesis, RNDr. Martin Šumák2
Top-k search Example – find top 20 apartments with 3 or 4 rooms, not at first floor, with price about not exceeding euro – moreover, price is the most important attribute and floor is the least important attribute Preferential top-k search over local data, Dissertation thesis, RNDr. Martin Šumák
Top-k query k = 20 preferences to attribute’s values – fuzzy functions importance of attributes – weights w price = 3 w rooms = 2 w floor = Preferential top-k search over local data - dissertation thesis - Martin Šumák 4
Top-k query Overall value of object O is 3*f price (O price ) + 2*f rooms (O rooms ) + 1*f floor (O floor ) In general c(f price (O price ), f rooms (O rooms ), f floor (O floor )) Preferential top-k search over local data - dissertation thesis - Martin Šumák 5 Function c has to be monotone!
The goal of top-k search to find top-k objects effectively – by processing minimum amount of data restrictions and assumptions – all the data is accessible locally – all attributes are numerical Preferential top-k search over local data - dissertation thesis - Martin Šumák 6
R-tree-based solution object – a vector of n numbers – a point of n-dimensional space – R-tree, R*-tree, R + -tree, R ++ -tree Preferential top-k search over local data - dissertation thesis - Martin Šumák 7
From kNN to top-k search k nearest neighbour – known incremental algorithm – distance from “query point Z” is the measure of “closeness” Preferential top-k search over local data - dissertation thesis - Martin Šumák 8
From kNN to top-k search top-k search – overall value (h) is the measure of “goodness” – by replacing distance with overall value and reversing order we change the result from kNN to top-k Preferential top-k search over local data - dissertation thesis - Martin Šumák 9
Analogy of kNN and top-k search Correctness Efficiency Preferential top-k search over local data - dissertation thesis - Martin Šumák 10 top-k kNN
Disproportion of attribute values floor, area, price – very different ranges – solution: normalization – linear transformation of attribute values to interval [0; 1] Another disproportion comes from weights Preferential top-k search over local data - dissertation thesis - Martin Šumák 11
Normalization applicability Useful for – R*-tree Meaningless for – R-tree (proven for the quadratic split method) – R + -tree, R ++ -tree – Grid file Preferential top-k search over local data - dissertation thesis - Martin Šumák 12
Why the R ++ -tree Zero overlaps & minimum bounding rectangles may cause a problem when adding new object R + -tree avoids overlaps at the price of rectangles size Preferential top-k search over local data - dissertation thesis - Martin Šumák 13
The R ++ -tree idea Preferential top-k search over local data - dissertation thesis - Martin Šumák 14 Zero overlaps & minimum bounding rectangles may cause a problem when adding new object R ++ -tree keeps two rectangles for each node – the minimum one and the parent covering one
The R ++ -tree properties Height-balanced Zero overlaps Overflow nodes at leaf level only Minimum node occupancy is 1 For the top-k search purposes, attribute values can be strings or any other comparable values (not just numbers) Preferential top-k search over local data - dissertation thesis - Martin Šumák 15
Top-k search over Grid file Grid file is a spatial index for point data We used static Grid file without extra directory Preferential top-k search over local data - dissertation thesis - Martin Šumák 16
Top-k search over Grid file We have proven correctness and efficiency as well Preferential top-k search over local data - dissertation thesis - Martin Šumák 17