Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.

Similar presentations


Presentation on theme: "The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong."— Presentation transcript:

1 The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST

2 Tao, Faloutsos, Papadias2 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

3 Tao, Faloutsos, Papadias3 Target query types DB = set of m –d points. n Range search (RS) n k nearest neighbor (KNN) n Regional distance (self-) join (RDJ)  in Louisiana, find all pairs of music stores closer than 1mi to each other

4 Tao, Faloutsos, Papadias4 Target problem Estimate n Query selectivity n Query (I/O) cost n for any Lp metric n using a single method

5 Tao, Faloutsos, Papadias5 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

6 Tao, Faloutsos, Papadias6 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

7 Tao, Faloutsos, Papadias7 Older Query estimation approaches n Vast literature  Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc  BUT: They target specific cases (mostly range search selectivity under the L  norm), and their extensions to other problems are unclear

8 Tao, Faloutsos, Papadias8 Main competitors n Local method  Representative methods: Histograms n Global method  Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations  Representative methods: Fractal and power law

9 Tao, Faloutsos, Papadias9 Rationale and problems of histograms n Partition the data space into a set of buckets and assume (local) uniformity n Problems  uniformity  tricky/slow estimations, for all but the L  norm

10 Tao, Faloutsos, Papadias10 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

11 Tao, Faloutsos, Papadias11 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10

12 Tao, Faloutsos, Papadias12 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10

13 Tao, Faloutsos, Papadias13 “Density Trap” n Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! n This ‘trap’ will appear for any non- uniform dataset n Almost ALL real point-sets are non-uniform -> the trap is real

14 Tao, Faloutsos, Papadias14 “Density Trap” In short: is meaningless n What should we do instead?

15 Tao, Faloutsos, Papadias15 “Density Trap” In short: is meaningless n What should we do instead? n A: log(count_of_neighbors) vs log(area)

16 Tao, Faloutsos, Papadias16 Local power law n In more detail: ‘local power law’:  nb: # neighbors of point p, within radius r  c p : ‘local constant’  n p : ‘local exponent’ (= local intrinsic dimensionality)

17 Tao, Faloutsos, Papadias17 Local power law Intuitively: to avoid the ‘density trap’, use n n p :local intrinsic dimensionality n instead of density

18 Tao, Faloutsos, Papadias18 Does LPL make sense? n For point ‘q’: LPL gives nb q (r) = r 1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 10

19 Tao, Faloutsos, Papadias19 Local power law and Lx if a point obeys L.P.L under L , ditto for any other Lx metric, with same ‘local exponent’ -> LPL works easily, for ANY Lx metric

20 Tao, Faloutsos, Papadias20 Examples p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2 radius #neighbors(<=r) p1 p2

21 Tao, Faloutsos, Papadias21 Examples

22 Tao, Faloutsos, Papadias22 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

23 Tao, Faloutsos, Papadias23 Proposed method n Main idea: if we know (or can approximate) the c p and n p of every point p, we can solve all the problems:

24 Tao, Faloutsos, Papadias24 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

25 Tao, Faloutsos, Papadias25 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

26 Tao, Faloutsos, Papadias26 Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends n only on the ‘local exponent’ n and NOT on the ‘local constant’, n nor on the cardinality of the dataset

27 Tao, Faloutsos, Papadias27 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point.  Q: What to do?

28 Tao, Faloutsos, Papadias28 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point.  Q: What to do?  A: exploit locality:

29 Tao, Faloutsos, Papadias29 Implementation n nearby points: usually have similar local constants and exponents. Thus, one solution: n ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q

30 Tao, Faloutsos, Papadias30 Implementation n choose anchors: with sampling, DBS, or any other method.

31 Tao, Faloutsos, Papadias31 Implementation n (In addition to ‘anchors’, we also tried to use ‘patches’ of near- constant cp and np – it gave similar accuracy, for more complicated implementation)

32 Tao, Faloutsos, Papadias32 Experiments - Settings n Datasets  SC that contain 40k points representing the coast lines of Scandinavia  LB that include 53k points corresponding to locations in Long Beach county n Structure: R*-tree n Compare Power method to  Minskew  Global method (fractal)

33 Tao, Faloutsos, Papadias33 Experiments - Settings n The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods n Queries: Biased (following the data distribution)  A query workload contains 500 queries  We report the average error  i |act i  est i |/  i act i

34 Tao, Faloutsos, Papadias34 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

35 Tao, Faloutsos, Papadias35 Range search selectivity n the LPL method wins

36 Tao, Faloutsos, Papadias36 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

37 Tao, Faloutsos, Papadias37 n No known global method in this case n The LPL method wins, with higher margin Regional distance join selectivity

38 Tao, Faloutsos, Papadias38 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

39 Tao, Faloutsos, Papadias39 Range search query cost

40 Tao, Faloutsos, Papadias40 k nearest neighbor cost

41 Tao, Faloutsos, Papadias41 Regional distance join cost

42 Tao, Faloutsos, Papadias42 Conclusions n We spot the “density trap” problem of the local uniformity assumption (<- histograms) n we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) n and we solved all posed problems:

43 Tao, Faloutsos, Papadias43 Conclusions – cont’d n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

44 Tao, Faloutsos, Papadias44 Conclusions – cont’d n for any Lp metric (Lemma3.2) n using a single method (LPL & ‘anchors’) RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5


Download ppt "The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong."

Similar presentations


Ads by Google