Presentation is loading. Please wait.

Presentation is loading. Please wait.

Placing Skips Optimally in Expectation Flavio Chierichetti, Silvio Lattanzi, Federico Mari Alessandro Panconesi Supported by.

Similar presentations


Presentation on theme: "Placing Skips Optimally in Expectation Flavio Chierichetti, Silvio Lattanzi, Federico Mari Alessandro Panconesi Supported by."— Presentation transcript:

1 Placing Skips Optimally in Expectation Flavio Chierichetti, Silvio Lattanzi, Federico Mari Alessandro Panconesi Supported by

2 Problem Statement

3 Answering conjunctive queries 3791419234147 2471119415762 Latte Macchiato query: latte macchiato

4 Answering conjunctive queries 3791419234147 2471119415762 Compute the intersection of 2 sorted lists Latte Macchiato query: latte macchiato

5 Merging 3791419234147 247919415762 Latte Macchiato

6 Merging 3791419234147 247919415762 Latte Macchiato

7 Merging 3791419234147 247919415762 Latte Macchiato

8 Merging 3791419234147 247919415762 Latte Macchiato

9 Merging 3791419234147 247919415762 Latte Macchiato

10 Skips 12345678 78171819415762 Latte Macchiato

11 Skips 12345678 78171819415762 Latte Macchiato

12 Skips 12345678 78171819415762 Latte Macchiato

13 Conventional WSDM t Skips are placed every  N many positions

14 Question If we know the query distribution, can we place skips better?

15 Problem statement If we know the query distribution, can we place skips in order to minimize the expected time of a merge?

16 Problem statement If we know the query distribution, can we place skips in order to minimize the expected time of a merge? Is the assumption realistic?

17 The Power of the Law

18 The query distribution contains a lot of information. Can we provably take advantage of it?

19 Algorithms to follow work with any query distribution whatsoever

20 Algorithms to follow work with any query distribution whatsoever..and can be extended to deal with soft conjunctions

21 Outline Skip placement policies A matter of definitions Algorithms Experiments

22 Skip Placement Policies

23 Spaghetti Skips

24 t

25 Simple Skips t

26 t This is the most interesting case

27 A Matter of Definitions

28 Useful Documents: 1 3791419234147 world 247919415762 cup q : world cup Relevant docs are “useful”

29 But usefulness does not coincide with relevance

30 Useful Documents: 2 1415?1847??? world cup q : world cup 1319414362??? Is the skip useful for q?

31 Useful Documents: 2 1415?1847??? world cup q : world cup 1319414362??? Is the skip useful for q?

32 Useful Documents: 2 1415?1847??? world cup q : world cup 1319414362??? Is the skip useful for q?

33 Useful Documents: 2 1415?1847??? world cup q : world cup 1319414362??? The skip is useful

34 Useful Documents: 2 1415??47??? world cup q : world cup 1319414362???

35 Useful Documents: 2 1415??47??? world cup q : world cup 1319414362???

36 Useful Documents: 2 141516?47??? world cup q : world cup 1319414362???

37 Useful Documents: 2 1415161847??? world cup q : world cup 1319414362???

38 Useful Documents: 2 1415161847??? world cup q : world cup 1319414362???

39 Useful Documents: 2 1415161847??? world cup q : world cup 1319414362???

40 Useful Documents: 2 1415161847??? world cup q : world cup 1319414362??? The skip is useless

41 Useful Documents: 2 1415161847??? world cup q : world cup 1319414362??? The skip is useless 18 cannot be skipped

42 Useful Documents: 2 Useful documents are those that cannot be avoided during a merge

43 Induced Distributions The query distribution induces another distribution on the postings 123ijkn platypus p1p1 p2p2 pipi pnpn

44 Induced Distributions The query distribution induces another distribution on the postings 123ijkn platypus p1p1 p2p2 pipi pnpn p i = Pr(i useful for q | platypus  q)

45 Induced Distributions The query distribution induces another distribution on the postings 123ijkn platypus p1p1 p2p2 pipi pnpn We will assume this distribution to be known

46 In practice these probabilities can be approximated using a small sample of the query universe Induced Distributions

47 Making Life Simple Events like “a is useful” and “b is useful” are not independent..but from now on we will assume that they are

48 Making Life Simple Events like “a is useful” and “b is useful” are not independent..but from now on we will assume that they are This simplifying assumption will be vindicated by our experiments

49 Algorithms

50 Input: a list with, for each doc, the probability that it is useful Output: skip placement that minimizes the expected time to merge

51 Algorithms Input: a list with, for each doc, the probability that it is useful Output: skip placement that minimizes the expected time to merge cost of a merge = #elements read in posting list

52 Algorithms O(nt) algorithm for spaghetti skips, where t is the average length of a skip O(n log n) for simple skips

53 Algorithms O(nt) algorithm for spaghetti skips, where t is the average length of a skip O(n log n) for simple skips O(n log n) algorithm for simple skips is by far the most interesting

54 Simple Skips t: d 1 d 2..d i..d n & p 1 p 2..p n Build the solution from left to right M(i) is best placement for prefix d 1..d i

55 Computing M(i) In computing M(i) we have two choices. We either place a skip landing at position i or we do not

56 Computing M(i) i M(i-1) If we place no skip to i then M(i) = M(i-1)

57 Computing M(i) j i M(j) G(j,i)

58 Want this in O(log n) max j M(j) + G(j, i) Computing M(i) j i M(j) G(j,i) M(i) = max { M(i-1), max j M(j) + G(j, i) }

59 Computing M(i) T(i) i M(T(i)) G(T(i),i) max j M(j) + G(j, i) = M(T(i)) + G(T(i),j)

60 Monotonicity of T(i) T(i) i i+1

61 Monotonicity of T(i) T(i)i T(i+1)i+1 T(i) ≤ T(i+1)

62 Monotonicity of T(i,k) ki T(i,k) is best jump to i under the additional constraint that it must start no later than k

63 Monotonicity of T(i,k) Key lemma: T(i,k) ≤ T(i+1,k) ki

64 Monotonicity of T(i,k) Let î be the smallest index i such that T(i,k)=k. Then, T(j,k) = T(j,k-1) j < î k j ≥ î

65 Updating T(i,k) T(i,k-1) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 i2i2 i2i2 i3i3 i3i3 i3i3 i4i4 i4i4 i4i4 i4i4

66 Updating T(i,k) T(i,k-1) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 i2i2 i2i2 i3i3 i3i3 i3i3 i4i4 i4i4 i4i4 i4i4 1 < i 1 < i 2 < i 3 < i 4 < k-1

67 Updating T(i,k) T(i,k-1) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 i2i2 i2i2 i3i3 i3i3 i3i3 i4i4 i4i4 i4i4 i4i4 j The best skip to reach j starts at position i 1

68 Updating T(i,k) T(i,k-1) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 i2i2 i2i2 i3i3 i3i3 i3i3 i4i4 i4i4 i4i4 i4i4 T(i,n) gives the optimal placement

69 Updating T(i,k) T(i,k-1) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 i2i2 i2i2 i3i3 i3i3 i3i3 i4i4 i4i4 i4i4 i4i4 T(i,n) gives the optimal placement T(,1)  T(,2)  …  T(,k)  …  T(,n)

70 Updating T(i,k) T(i,k-1) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 i2i2 i2i2 i3i3 i3i3 i3i3 i4i4 i4i4 i4i4 i4i4 min {i: T(i,k)=k}

71 Updating T(i,k) T(i,k-1) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 i2i2 i2i2 i3i3 i3i3 i3i3 i4i4 i4i4 i4i4 i4i4 T(i,k) 11111i1i1 i1i1 i1i1 i1i1 i2i2 i2i2 kkkkkkkkk min {i: T(i,k)=k}

72 The resulting algorithm takes O(N logN) where N is the length of the list

73 Experiments

74 Space

75 Time to merge

76 Build up time

77 Size of query sample for spaghetti skips

78 Size of query sample for simple skips 1/256

79 The Bottomline Simple skips are the solution of choice (for power law distributions): They merge as fast as spaghetti skips (the general case) They occupy less space Build time is much faster They need a smaller sample to collect statistics on document usefulness

80 Summing up First attempt to exploit in a rigorous way knowledge of the distribution Much work remains to be done but results are encouraging

81 Extensions Taking the cache into account Taking dependencies into account Compare against skip list and other data structures Thanks for your attention


Download ppt "Placing Skips Optimally in Expectation Flavio Chierichetti, Silvio Lattanzi, Federico Mari Alessandro Panconesi Supported by."

Similar presentations


Ads by Google