Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Similar presentations


Presentation on theme: "Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava."— Presentation transcript:

1 Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

2 Strings as sets s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’ How similar is s 1 and s 2 ?

3 TF/IDF weighted similarity Inverse Document Frequency (idf): ‘Main’ is common ‘Maine’ is not idf(t) = log 2 [1 + N / df(t)] Term Frequency (tf): ‘Main’ appears twice in s 2 Similarity: Inner Product

4 Is TF important? Information retrieval: Given a query string retrieve relevant documents Relational databases: Given a query string retrieve relevant strings In practice TF is small in many applications

5 IDF similarity Query q = {t 1, …, t n } Set s = {r 1, …, r m } Length len(s) = ( t 2 s idf(t) 2 ) 1/2 I(q, s) =  t 2 s \ q idf(t) 2 / len(s) len(q) IDF is as good as TF/IDF in practice!

6 How can I build an index? Let w(t, s) = idf(t) / len(s) Then I(q, s) =  t 2 q \ s w(t, s) w(t, q) So Decompose strings into tokens Compute the idf of each token Create one inverted list per token Sort lists by string id: Do a merge join Sort lists by w: Run TA/NRA

7 Example: Sort by id

8 Example: Sort by w NRA: Round robin list accesses Main memory hash table Computes lower and upper bounds per entry

9 Semantic properties of IDF Order Preservation: For all t 1  t 2 : if w(t 1, s) < w(t 1, r), then w(t 2, s) < w(t 2, r) Length Boundedness: Query q, set s, threshold  – I(q, s) >=  )  len(q) < len(s) < len(q) / 

10 Improved NRA Order Preservation determines if a given set appears in a list or not t i : encounter s 1, then s 2 t k : encounter s 2 first Length Boundedness restricts the search in a small portion of lists

11 Something surprising Lemma: NRA reads arbitrarily more elements than iNRA Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property

12 Any other strategies? NRA style is breadth-first Try depth-first: Sort query lists in decreasing idf order –Let q = {t 1, …, t n } and idf(t 1 ) > idf(t 2 ) > …> idf(t n ) Let i be the maximum length a set s in t i can have s.t. I(q, s) >= , assuming that s exists in all t k > t i – i =  I <= k <= n idf(t k ) 2 /  len(q) i is a natural cutoff point 1 > 2 > … > n

13 Shortest-First Sort q={t 1, …, t n } in decreasing idf order Let candidate set C For 1 <= i <= n Skip to first entry with len(s) >=  len(q) Compute i Let  i = min( i, len(q) / ) Repeat –s = pop next element from t i –Maintain lower/upper bounds of entries in C Until len(s) > max(max len C,  i )

14 Comparison with NRA Lemma: Let q={t 1, …, t n } and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF But surprisingly

15 A hybrid strategy Run iNRA normally Use i and max len C to stop reading from a particular list This guarantees that iNRA stops with or before SF Drawback of NRA variants: Very high book keeping cost compared to SF

16 Experiments DBLP, IMDB and YellowPages datasets Actors, movies, authors, businesses etc. Vary threshold, query size, query strings and mistakes Test wall-clock time, pruning power Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based

17 Wall-clock time vs. Threshold

18 Wall-clock time vs. Query size TA NRA Sort-by-id iTA SF

19 Space

20 Conclusion Proposed a simplified TF/IDF measure Identified strong monotonicity properties Used the properties to design efficient algorithms SF works best overall in practice Achieves sub-second answers in most practical cases

21 Q&A

22 Pruning power vs. Threshold

23 Pruning power vs. Query size NRA TA iTA


Download ppt "Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava."

Similar presentations


Ads by Google