Download presentation
Presentation is loading. Please wait.
Published byAndra Tate Modified over 9 years ago
1
The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST
2
Tao, Faloutsos, Papadias2 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions
3
Tao, Faloutsos, Papadias3 Target query types DB = set of m –d points. n Range search (RS) n k nearest neighbor (KNN) n Regional distance (self-) join (RDJ) in Louisiana, find all pairs of music stores closer than 1mi to each other
4
Tao, Faloutsos, Papadias4 Target problem Estimate n Query selectivity n Query (I/O) cost n for any Lp metric n using a single method
5
Tao, Faloutsos, Papadias5 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O
6
Tao, Faloutsos, Papadias6 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions
7
Tao, Faloutsos, Papadias7 Older Query estimation approaches n Vast literature Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc BUT: They target specific cases (mostly range search selectivity under the L norm), and their extensions to other problems are unclear
8
Tao, Faloutsos, Papadias8 Main competitors n Local method Representative methods: Histograms n Global method Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations Representative methods: Fractal and power law
9
Tao, Faloutsos, Papadias9 Rationale and problems of histograms n Partition the data space into a set of buckets and assume (local) uniformity n Problems uniformity tricky/slow estimations, for all but the L norm
10
Tao, Faloutsos, Papadias10 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions
11
Tao, Faloutsos, Papadias11 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10
12
Tao, Faloutsos, Papadias12 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10
13
Tao, Faloutsos, Papadias13 “Density Trap” n Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! n This ‘trap’ will appear for any non- uniform dataset n Almost ALL real point-sets are non-uniform -> the trap is real
14
Tao, Faloutsos, Papadias14 “Density Trap” In short: is meaningless n What should we do instead?
15
Tao, Faloutsos, Papadias15 “Density Trap” In short: is meaningless n What should we do instead? n A: log(count_of_neighbors) vs log(area)
16
Tao, Faloutsos, Papadias16 Local power law n In more detail: ‘local power law’: nb: # neighbors of point p, within radius r c p : ‘local constant’ n p : ‘local exponent’ (= local intrinsic dimensionality)
17
Tao, Faloutsos, Papadias17 Local power law Intuitively: to avoid the ‘density trap’, use n n p :local intrinsic dimensionality n instead of density
18
Tao, Faloutsos, Papadias18 Does LPL make sense? n For point ‘q’: LPL gives nb q (r) = r 1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 10
19
Tao, Faloutsos, Papadias19 Local power law and Lx if a point obeys L.P.L under L , ditto for any other Lx metric, with same ‘local exponent’ -> LPL works easily, for ANY Lx metric
20
Tao, Faloutsos, Papadias20 Examples p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2 radius #neighbors(<=r) p1 p2
21
Tao, Faloutsos, Papadias21 Examples
22
Tao, Faloutsos, Papadias22 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions
23
Tao, Faloutsos, Papadias23 Proposed method n Main idea: if we know (or can approximate) the c p and n p of every point p, we can solve all the problems:
24
Tao, Faloutsos, Papadias24 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O
25
Tao, Faloutsos, Papadias25 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5
26
Tao, Faloutsos, Papadias26 Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends n only on the ‘local exponent’ n and NOT on the ‘local constant’, n nor on the cardinality of the dataset
27
Tao, Faloutsos, Papadias27 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point. Q: What to do?
28
Tao, Faloutsos, Papadias28 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point. Q: What to do? A: exploit locality:
29
Tao, Faloutsos, Papadias29 Implementation n nearby points: usually have similar local constants and exponents. Thus, one solution: n ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q
30
Tao, Faloutsos, Papadias30 Implementation n choose anchors: with sampling, DBS, or any other method.
31
Tao, Faloutsos, Papadias31 Implementation n (In addition to ‘anchors’, we also tried to use ‘patches’ of near- constant cp and np – it gave similar accuracy, for more complicated implementation)
32
Tao, Faloutsos, Papadias32 Experiments - Settings n Datasets SC that contain 40k points representing the coast lines of Scandinavia LB that include 53k points corresponding to locations in Long Beach county n Structure: R*-tree n Compare Power method to Minskew Global method (fractal)
33
Tao, Faloutsos, Papadias33 Experiments - Settings n The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods n Queries: Biased (following the data distribution) A query workload contains 500 queries We report the average error i |act i est i |/ i act i
34
Tao, Faloutsos, Papadias34 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5
35
Tao, Faloutsos, Papadias35 Range search selectivity n the LPL method wins
36
Tao, Faloutsos, Papadias36 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5
37
Tao, Faloutsos, Papadias37 n No known global method in this case n The LPL method wins, with higher margin Regional distance join selectivity
38
Tao, Faloutsos, Papadias38 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5
39
Tao, Faloutsos, Papadias39 Range search query cost
40
Tao, Faloutsos, Papadias40 k nearest neighbor cost
41
Tao, Faloutsos, Papadias41 Regional distance join cost
42
Tao, Faloutsos, Papadias42 Conclusions n We spot the “density trap” problem of the local uniformity assumption (<- histograms) n we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) n and we solved all posed problems:
43
Tao, Faloutsos, Papadias43 Conclusions – cont’d n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O
44
Tao, Faloutsos, Papadias44 Conclusions – cont’d n for any Lp metric (Lemma3.2) n using a single method (LPL & ‘anchors’) RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.