Download presentation
Presentation is loading. Please wait.
1
Joining Massive High-Dimensional Datasets
Tamer Kahveci Christian A. Lang Ambuj K. Singh Department of Computer Science University of California at Santa Barbara 11/19/2018 Kahveci, Lang, Singh
2
Motivation: Sample Queries
Join is fundamental database primitive Spatial Join: Find all hotels in California that are within three miles of a recreation area. Sequence Join: Find all pairs of companies from New York Exchange and Tokyo Exchange that have similar closing prices for one month 11/19/2018 Kahveci, Lang, Singh
3
Motivation We assume limited buffer space.
Recreation areas (n) Hotels (m) We assume limited buffer space. Joining two datasets is expensive I/O cost CPU cost O(mn) Buffer Hotels Recreation areas 11/19/2018 Kahveci, Lang, Singh
4
The Naive Solution: NLJ
Dataset 1 (m pages) Buffer = B = 4 min{m,n}+mn/(B-1) page reads mn page comparisons Dataset 2 (n pages) We do not need to compare all page pairs! 11/19/2018 Kahveci, Lang, Singh
5
Outline Reducing search space: Prediction Matrix
Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh
6
PM-NLJ Dataset 1 Dataset 2
Predict the candidate page pairs using plane sweep method on an index structure. Dataset 1 Dataset 2 11/19/2018 Kahveci, Lang, Singh
7
Prediction of Join 11/19/2018 Kahveci, Lang, Singh
8
Prediction of Join 11/19/2018 Kahveci, Lang, Singh
9
PM-NLJ Dataset 1 Dataset 2
Predict the candidate page pairs using plane sweep method on an index structure. Dataset 1 The final estimate is called Prediction Matrix (PM). Restrict NLJ to marked entries of PM. We call this method PM-NLJ. Dataset 2 11/19/2018 Kahveci, Lang, Singh
10
PM-NLJ The number of marked entries = e. Dataset 1
Performance improvement rate = mn/e. Dataset 1 Dataset 2 Is there a better read schedule? 11/19/2018 Kahveci, Lang, Singh
11
Outline Reducing search space: Prediction Matrix
Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh
12
Minimizing Number of I/O: Square Clustering
PM-NLJ reads min{m’,n’}+e’ = 9 pages. Let B=6. Dataset 1 m’+n’ = 6 page reads suffices. Savings = e’-max{m’,n’}. Maximize e’ Minimize max{m’,n’} m’+n’ = B m’=n’=B/2. Dataset 2 11/19/2018 Kahveci, Lang, Singh
13
Minimizing Number of I/O: Square Clustering
Dataset 1 O(e) space & time complexity Can we reduce total I/O cost by reducing the amount of random seeks? Dataset 2 11/19/2018 Kahveci, Lang, Singh
14
Minimizing Random Seek Cost: Cost Clustering
Dataset 1 The location of the pages is important as well as their number! Dataset 2 11/19/2018 Kahveci, Lang, Singh
15
Minimizing Random Seek Cost: Cost Clustering
Dataset 1 O(e) space complexity O(e3/2) time Dataset 2 11/19/2018 Kahveci, Lang, Singh
16
Outline Reducing search space: Prediction Matrix
Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh
17
Maximizing Cache Reuse
Dataset 1 B = 5 pages C1 C2 C3 C4 C5 sum 5 4 2 21 C1 C3 Scenario 1 Cluster order = (C4,C1,C3,C5,C2) =19 page reads. C2 Dataset 2 C4 C5 11/19/2018 Kahveci, Lang, Singh
18
Maximizing Cache Reuse
Dataset 1 C1 C2 C3 C4 C5 sum 5 4 2 21 Scenario 1 = 19 C1 Scenario 2 Cluster order = (C4,C2,C1,C3,C5) =15 page reads. C3 C2 Dataset 2 C4 What is the best schedule? C5 11/19/2018 Kahveci, Lang, Singh
19
Sharing Graph (SG) Dataset 1 Sharing Graph 2 1 Dataset 2 3 C1 C2 C3 C4
11/19/2018 Kahveci, Lang, Singh
20
Finding Best Schedule Each schedule is a path on SG.
Cache reuse = sum of weights of the edges of the corresponding path on SG. Equivalent to TSP. NP-Complete. Use greedy heuristic to find optimal path. C2 2 C1 1 1 1 C3 3 C5 C4 11/19/2018 Kahveci, Lang, Singh
21
Outline Reducing search space: Prediction Matrix
Minimizing I/O cost by clustering Square Cluster Cost Cluster Maximizing buffer reuse Experimental results 11/19/2018 Kahveci, Lang, Singh
22
Experimental Setup: Datasets
Low dimensional data: 2-D road intersections of Long Beach (LBeach) & Montgomery County (MGcounty). 53K & 39K vectors High dimensional data: 60-D feature vectors for satellite image database (landsat). 275K vectors Sequence data: Human chromosome 18 (HChr18) & mouse chromosome 18 (MChr18) 4.2 M & 2.3 M nucleotides 11/19/2018 Kahveci, Lang, Singh
23
Experimental Setup: Compared Techniques
NLJ Epsilon Grid Order (EGO) [BBKK’01] BFRJ [HJR’97] PM-NLJ Random-SC SC CC 11/19/2018 Kahveci, Lang, Singh
24
Experimental Setup Three optimizations tested:
OPT 1: reducing space by using the PM. OPT 2: clustering. OPT 3: cluster scheduling. 11/19/2018 Kahveci, Lang, Singh
25
Itemized Cost Analysis
opt3 opt2 opt1 Join on MGCounty & LBeach 11/19/2018 Kahveci, Lang, Singh
26
Total Cost Analysis of Various Optimizations
Self-join on HChr18 11/19/2018 Kahveci, Lang, Singh Buffer Size (num pages)
27
Comparison of SC & CC 11/19/2018 Kahveci, Lang, Singh
28
Buffer Size (num pages)
Total Cost Analysis Join on landsat data 11/19/2018 Kahveci, Lang, Singh Buffer Size (num pages)
29
Database Size (num vectors per database)
Scalability Analysis Join on landsat data 11/19/2018 Kahveci, Lang, Singh Database Size (num vectors per database)
30
Discussion We proposed three optimizations for join operator.
Prediction matrix Clustering Buffer recycling SC is 2 to 86 times faster than competing techniques for spatial databases, and 13 to 133 times faster than competing techniques for sequence databases SC is very close to the optimal technique (CC). 11/19/2018 Kahveci, Lang, Singh
31
Future Directions The solution can be generalized to multi-way joins.
Similar optimizations can be applied to NN queries. Can be applied to biological data. 11/19/2018 Kahveci, Lang, Singh
32
THANK YOU Related Work Join without index Join with index
Arge et al 1998 Blasgen et al 1977 Bohm et al 2001 Chan et al 1997 Graefe 1994 Koudas et al 1997 Koudas et al 2000 Orenstein 1986 Patel et al 1996 Shim et al 2002 Xiao et al 2001 Join with index Bercken et al 2000 Bohm et al 2001 Brinkhoff et al 1993 Gurret et al 2000 Hjaltson et al 1998 Huang et al 1997 Lo et al 1994 Lo et al 1996 THANK YOU 11/19/2018 Kahveci, Lang, Singh
33
Using Sharing Graph to Determine Cache Reuse
Scenario 1 Scenario 2 C2 C2 2 2 C1 C1 1 1 1 1 1 1 C3 C3 3 3 C5 C5 C4 C4 Reuse = 1+1 = 2 Reuse = = 6 11/19/2018 Kahveci, Lang, Singh
34
Spatial Join Example Recreation areas Hotels 11/19/2018
Kahveci, Lang, Singh
35
Spatial Join Example Hotels Recreation areas 11/19/2018
Kahveci, Lang, Singh
36
The Naive Solution: NLJ
Dataset 1 (m pages) Buffer = B = 4 min{m,n}+mn/(B-1) page reads mn page comparisons Dataset 2 (n pages) We do not need to compare all page pairs! 11/19/2018 Kahveci, Lang, Singh
37
Reading Pages in a Better Order
1 seek + 4 page transfers 3 seeks + 3 page transfers 11/19/2018 Kahveci, Lang, Singh
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.