A N I MPROVED I NDEXING S CHEME FOR R ANGE Q UERIES Yvonne Yao Adviser: Professor Huiping Guo
D ATABASE - AS - A -S ERVICE Business organizations handle a large amount of data (TB) Cost of managing and maintaining these data onsite is high DAS DBMSs outsourcing Clients rely on service providers for data management and maintenance Cost is a lot lowered. But…
D ATABASE - AS - A -S ERVICE Security of data is not guaranteed Service providers are untrusted Store only an encrypted form of data onto the remote server Only users with the correct key(s) can have access How then can we query the encrypted data? Retrieve and decrypt the entire table, and apply SQL statements on it. Too expensive! A more realistic approach was discovered
D ATABASE - AS - A -S ERVICE
B UCKETIZATION Various approaches to build meta-data: B+-tree based, hash-based, and bucket-based What is bucketization? Partition of attribute data into several buckets Each bucket is identified by an ID Bucket IDs are stored, along with encrypted data, on the remote server Client keeps partition information as meta-data General bucketization approach Equi-width Equi-depth
E XAMPLE 1
PartitionID [0.0 ~ 1.0]Bucket_1 [1.1 ~ 2.0]Bucket_2 [2.1 ~ 3.0]Bucket_3 [3.1 ~ 4.0]Bucket_4
E XAMPLE 1 User query: SELECT * FROM grades WHERE gpa < 3.0 Q server : SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ Size of superset is 29, of which 7 of them are false positives
Q UERY O PTIMAL B UCKETIZATION General idea: minimizing the bucket cost of each bucket Input: V = { v 1, v 2, v 3, …, v n } where v 1 < v 2 < v 3 < … < v n F = Frequency of each value M = Number of buckets to fill Output: a matrix indicating the boundary of each bucket
Q UERY O PTIMAL B UCKETIZATION QOB Finds optimum solutions to two smaller sub-problems one contains the leftmost M -1 buckets covering the ( n-i ) smallest points Another contains the rightmost single bucket covering the remaining i points V = { v 1, v 2, v 3, v 4, v 5, v 6, …, v n-3, v n-2, v n-1, v n } n-i points go to last i points go to M -1 buckets last bucket
E XAMPLE 2 PartitionID [0.7 ~ 1.2]Bucket_1 [1.5 ~ 2.5]Bucket_2 [2.8 ~ 3.0]Bucket_3 [3.5 ~ 4.0]Bucket_4
E XAMPLE 2 Q server : SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3’ Same as the general bucketization method In most cases, QOB can outperform the conventional bucketization strategy, but not always
D EVIATION B UCKETIZATION Built upon QOB, takes the same parameters Has two levels of buckets First level: same as those produced by QOB Second level: bucketization of deviation values, the difference between the value itself to the average of the bucket Each first-level-bucket has at most M second level buckets QOB has at most M buckets, while DB has at most M 2 buckets
D EVIATION B UCKETIZATION DB Run QOB ( D, M ) Construct First-Level-Buckets from boundary matrix For each First-Level-Bucket Initialize empty datasets v i ’ and f i ’ For each v i in the bucket v i ’ = v i ’ ∪ v i ’ – avg() f i ’ = f i ’ ∪ 1 Create a new dataset d i = ( v i ’, f i ’ ) Run QOB( d i, M )
E XAMPLE 3 PartitionIDAvg [0.7 ~ 1.2]Bucket_10.93 [1.5 ~ 2.5]Bucket_21.84 [2.8 ~ 3.0]Bucket_32.93 [3.5 ~ 4.0]Bucket_43.67 PartitionIDAvg ……… [2.8 ~ 2.8]Bucket_3_12.8 [2.9 ~ 2.9]Bucket_3_22.9 [3.0 ~ 3.0]Bucket_3_33.0 ………
E XAMPLE 3 Q server : SELECT * FROM egrades WHERE gpaID = ‘Bucket_1’ OR gpaID = ‘Bucket_2’ OR gpaID = ‘Bucket_3_1’ OR gpaID = ‘Bucket_3_2’ In this case, no false positives are returned Generally, false positives will still be returned, just the number of them will be greatly reduced
E XPERIMENTS Two datasets Synthetic dataset: 10 5 integers from [0, 999] Real dataset: 10 3 data points from the Aspect column of the Forest CoverType database in UCI’s KDD Archive Two sets of queries Q syn Q real
E XPERIMENT 1
E XPERIMENT 2
Thank You