The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.

The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong Kong UST

Tao, Faloutsos, Papadias2 Roadmap n Problem – motivation n Survey n Proposed method – main idea n Proposed method – details n Experiments n Conclusions

Tao, Faloutsos, Papadias3 Target query types DB = set of m –d points. n Range search (RS) n k nearest neighbor (KNN) n Regional distance (self-) join (RDJ)  in Louisiana, find all pairs of music stores closer than 1mi to each other

Tao, Faloutsos, Papadias4 Target problem Estimate n Query selectivity n Query (I/O) cost n for any Lp metric n using a single method

Tao, Faloutsos, Papadias5 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

Tao, Faloutsos, Papadias7 Older Query estimation approaches n Vast literature  Sampling, kernel estimation, single value decomposition, compressed histograms, sketches, maximal independence, Euler formula, etc  BUT: They target specific cases (mostly range search selectivity under the L  norm), and their extensions to other problems are unclear

Tao, Faloutsos, Papadias8 Main competitors n Local method  Representative methods: Histograms n Global method  Provides a single estimate corresponding to the average selectivity/cost of all queries, independently of their locations  Representative methods: Fractal and power law

Tao, Faloutsos, Papadias9 Rationale and problems of histograms n Partition the data space into a set of buckets and assume (local) uniformity n Problems  uniformity  tricky/slow estimations, for all but the L  norm

Tao, Faloutsos, Papadias11 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? 10

Tao, Faloutsos, Papadias12 Inherent defect of histograms n Density trap – what is the density in the vicinity of q? diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 Q: What is going on? A: we ask a silly question: ~ “what is the area of a line?” 10

Tao, Faloutsos, Papadias13 “Density Trap” n Not caused not by a mathematical oddity like the Hilbert curve, but by a line, a perfectly behaving Euclidean object! n This ‘trap’ will appear for any non- uniform dataset n Almost ALL real point-sets are non-uniform -> the trap is real

Tao, Faloutsos, Papadias14 “Density Trap” In short: is meaningless n What should we do instead?

Tao, Faloutsos, Papadias15 “Density Trap” In short: is meaningless n What should we do instead? n A: log(count_of_neighbors) vs log(area)

Tao, Faloutsos, Papadias16 Local power law n In more detail: ‘local power law’:  nb: # neighbors of point p, within radius r  c p : ‘local constant’  n p : ‘local exponent’ (= local intrinsic dimensionality)

Tao, Faloutsos, Papadias17 Local power law Intuitively: to avoid the ‘density trap’, use n n p :local intrinsic dimensionality n instead of density

Tao, Faloutsos, Papadias18 Does LPL make sense? n For point ‘q’: LPL gives nb q (r) = r 1 (no need for ‘density’, nor uniformity) diameter=10: 10/100 = 0.1 diameter=100: 100/10,000 = 0.01 10

Tao, Faloutsos, Papadias19 Local power law and Lx if a point obeys L.P.L under L , ditto for any other Lx metric, with same ‘local exponent’ -> LPL works easily, for ANY Lx metric

Tao, Faloutsos, Papadias20 Examples p1 has higher ‘local exponent’ = ‘local intrinsic dimensionality’ than p2 radius #neighbors(<=r) p1 p2

Tao, Faloutsos, Papadias21 Examples

Tao, Faloutsos, Papadias23 Proposed method n Main idea: if we know (or can approximate) the c p and n p of every point p, we can solve all the problems:

Tao, Faloutsos, Papadias24 Target Problem n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

Tao, Faloutsos, Papadias25 Target Problem n for any Lp metric (Lemma3.2) n using a single method RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

Tao, Faloutsos, Papadias26 Theoretical results interesting observation: (Thm3.4): the cost of a kNN query q depends n only on the ‘local exponent’ n and NOT on the ‘local constant’, n nor on the cardinality of the dataset

Tao, Faloutsos, Papadias27 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point.  Q: What to do?

Tao, Faloutsos, Papadias28 Implementation n Given a query point q, we need its local exponent and constants to perform estimation n but: too expensive to store, for every point.  Q: What to do?  A: exploit locality:

Tao, Faloutsos, Papadias29 Implementation n nearby points: usually have similar local constants and exponents. Thus, one solution: n ‘anchors’: pre-compute the LPLaw for a set of representative points (anchors) – use nearest ‘anchor’ to q

Tao, Faloutsos, Papadias30 Implementation n choose anchors: with sampling, DBS, or any other method.

Tao, Faloutsos, Papadias31 Implementation n (In addition to ‘anchors’, we also tried to use ‘patches’ of near- constant cp and np – it gave similar accuracy, for more complicated implementation)

Tao, Faloutsos, Papadias32 Experiments - Settings n Datasets  SC that contain 40k points representing the coast lines of Scandinavia  LB that include 53k points corresponding to locations in Long Beach county n Structure: R*-tree n Compare Power method to  Minskew  Global method (fractal)

Tao, Faloutsos, Papadias33 Experiments - Settings n The LPLaw coefficients of each anchor point are computed using L∞ 0.05-neighborhoods n Queries: Biased (following the data distribution)  A query workload contains 500 queries  We report the average error  i |act i  est i |/  i act i

Tao, Faloutsos, Papadias35 Range search selectivity n the LPL method wins

Tao, Faloutsos, Papadias37 n No known global method in this case n The LPL method wins, with higher margin Regional distance join selectivity

Tao, Faloutsos, Papadias39 Range search query cost

Tao, Faloutsos, Papadias40 k nearest neighbor cost

Tao, Faloutsos, Papadias41 Regional distance join cost

Tao, Faloutsos, Papadias42 Conclusions n We spot the “density trap” problem of the local uniformity assumption (<- histograms) n we show how to resolve it, using the ‘local intrinsic dimension’ instead (-> ‘Local Power Law’) n and we solved all posed problems:

Tao, Faloutsos, Papadias43 Conclusions – cont’d n for any Lp metric n using a single method RSKNNRDJ Sel.XXXX I/O

Tao, Faloutsos, Papadias44 Conclusions – cont’d n for any Lp metric (Lemma3.2) n using a single method (LPL & ‘anchors’) RSKNNRDJ Sel.Thm3.1XXXXThm3.2 I/OThm3.3Thm3.4Thm3.5

The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.

Similar presentations

Presentation on theme: "The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong.

Similar presentations

Presentation on theme: "The Power-Method: A Comprehensive Estimation Technique for Multi- Dimensional Queries Yufei Tao U. Hong Kong Christos Faloutsos CMU Dimitris Papadias Hong."— Presentation transcript:

Similar presentations

About project

Feedback