Download presentation
Presentation is loading. Please wait.
Published byLiliana Louise Cannon Modified over 9 years ago
1
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava
2
Strings as sets s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’ How similar is s 1 and s 2 ?
3
TF/IDF weighted similarity Inverse Document Frequency (idf): ‘Main’ is common ‘Maine’ is not idf(t) = log 2 [1 + N / df(t)] Term Frequency (tf): ‘Main’ appears twice in s 2 Similarity: Inner Product
4
Is TF important? Information retrieval: Given a query string retrieve relevant documents Relational databases: Given a query string retrieve relevant strings In practice TF is small in many applications
5
IDF similarity Query q = {t 1, …, t n } Set s = {r 1, …, r m } Length len(s) = ( t 2 s idf(t) 2 ) 1/2 I(q, s) = t 2 s \ q idf(t) 2 / len(s) len(q) IDF is as good as TF/IDF in practice!
6
How can I build an index? Let w(t, s) = idf(t) / len(s) Then I(q, s) = t 2 q \ s w(t, s) w(t, q) So Decompose strings into tokens Compute the idf of each token Create one inverted list per token Sort lists by string id: Do a merge join Sort lists by w: Run TA/NRA
7
Example: Sort by id
8
Example: Sort by w NRA: Round robin list accesses Main memory hash table Computes lower and upper bounds per entry
9
Semantic properties of IDF Order Preservation: For all t 1 t 2 : if w(t 1, s) < w(t 1, r), then w(t 2, s) < w(t 2, r) Length Boundedness: Query q, set s, threshold – I(q, s) >= ) len(q) < len(s) < len(q) /
10
Improved NRA Order Preservation determines if a given set appears in a list or not t i : encounter s 1, then s 2 t k : encounter s 2 first Length Boundedness restricts the search in a small portion of lists
11
Something surprising Lemma: NRA reads arbitrarily more elements than iNRA Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property
12
Any other strategies? NRA style is breadth-first Try depth-first: Sort query lists in decreasing idf order –Let q = {t 1, …, t n } and idf(t 1 ) > idf(t 2 ) > …> idf(t n ) Let i be the maximum length a set s in t i can have s.t. I(q, s) >= , assuming that s exists in all t k > t i – i = I <= k <= n idf(t k ) 2 / len(q) i is a natural cutoff point 1 > 2 > … > n
13
Shortest-First Sort q={t 1, …, t n } in decreasing idf order Let candidate set C For 1 <= i <= n Skip to first entry with len(s) >= len(q) Compute i Let i = min( i, len(q) / ) Repeat –s = pop next element from t i –Maintain lower/upper bounds of entries in C Until len(s) > max(max len C, i )
14
Comparison with NRA Lemma: Let q={t 1, …, t n } and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF But surprisingly
15
A hybrid strategy Run iNRA normally Use i and max len C to stop reading from a particular list This guarantees that iNRA stops with or before SF Drawback of NRA variants: Very high book keeping cost compared to SF
16
Experiments DBLP, IMDB and YellowPages datasets Actors, movies, authors, businesses etc. Vary threshold, query size, query strings and mistakes Test wall-clock time, pruning power Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based
17
Wall-clock time vs. Threshold
18
Wall-clock time vs. Query size TA NRA Sort-by-id iTA SF
19
Space
20
Conclusion Proposed a simplified TF/IDF measure Identified strong monotonicity properties Used the properties to design efficient algorithms SF works best overall in practice Achieves sub-second answers in most practical cases
21
Q&A
22
Pruning power vs. Threshold
23
Pruning power vs. Query size NRA TA iTA
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.