Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava
Strings as sets s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’ How similar is s 1 and s 2 ?
TF/IDF weighted similarity Inverse Document Frequency (idf): ‘Main’ is common ‘Maine’ is not idf(t) = log 2 [1 + N / df(t)] Term Frequency (tf): ‘Main’ appears twice in s 2 Similarity: Inner Product
Is TF important? Information retrieval: Given a query string retrieve relevant documents Relational databases: Given a query string retrieve relevant strings In practice TF is small in many applications
IDF similarity Query q = {t 1, …, t n } Set s = {r 1, …, r m } Length len(s) = ( t 2 s idf(t) 2 ) 1/2 I(q, s) = t 2 s \ q idf(t) 2 / len(s) len(q) IDF is as good as TF/IDF in practice!
How can I build an index? Let w(t, s) = idf(t) / len(s) Then I(q, s) = t 2 q \ s w(t, s) w(t, q) So Decompose strings into tokens Compute the idf of each token Create one inverted list per token Sort lists by string id: Do a merge join Sort lists by w: Run TA/NRA
Example: Sort by id
Example: Sort by w NRA: Round robin list accesses Main memory hash table Computes lower and upper bounds per entry
Semantic properties of IDF Order Preservation: For all t 1 t 2 : if w(t 1, s) < w(t 1, r), then w(t 2, s) < w(t 2, r) Length Boundedness: Query q, set s, threshold – I(q, s) >= ) len(q) < len(s) < len(q) /
Improved NRA Order Preservation determines if a given set appears in a list or not t i : encounter s 1, then s 2 t k : encounter s 2 first Length Boundedness restricts the search in a small portion of lists
Something surprising Lemma: NRA reads arbitrarily more elements than iNRA Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property
Any other strategies? NRA style is breadth-first Try depth-first: Sort query lists in decreasing idf order –Let q = {t 1, …, t n } and idf(t 1 ) > idf(t 2 ) > …> idf(t n ) Let i be the maximum length a set s in t i can have s.t. I(q, s) >= , assuming that s exists in all t k > t i – i = I <= k <= n idf(t k ) 2 / len(q) i is a natural cutoff point 1 > 2 > … > n
Shortest-First Sort q={t 1, …, t n } in decreasing idf order Let candidate set C For 1 <= i <= n Skip to first entry with len(s) >= len(q) Compute i Let i = min( i, len(q) / ) Repeat –s = pop next element from t i –Maintain lower/upper bounds of entries in C Until len(s) > max(max len C, i )
Comparison with NRA Lemma: Let q={t 1, …, t n } and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF But surprisingly
A hybrid strategy Run iNRA normally Use i and max len C to stop reading from a particular list This guarantees that iNRA stops with or before SF Drawback of NRA variants: Very high book keeping cost compared to SF
Experiments DBLP, IMDB and YellowPages datasets Actors, movies, authors, businesses etc. Vary threshold, query size, query strings and mistakes Test wall-clock time, pruning power Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based
Wall-clock time vs. Threshold
Wall-clock time vs. Query size TA NRA Sort-by-id iTA SF
Space
Conclusion Proposed a simplified TF/IDF measure Identified strong monotonicity properties Used the properties to design efficient algorithms SF works best overall in practice Achieves sub-second answers in most practical cases
Q&A
Pruning power vs. Threshold
Pruning power vs. Query size NRA TA iTA