Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava

Strings as sets s 1 = “Main St. Maine”: ‘Main’ ‘St.’ ‘Maine’ ‘Mai’ ‘ain’ ‘in ’ ‘n S’ ‘ St’ ‘St.’ ‘t. ’ … s 2 = “Main St. Main”: ‘Main’ ‘St.’ ‘Main’ How similar is s 1 and s 2 ?

TF/IDF weighted similarity Inverse Document Frequency (idf): ‘Main’ is common ‘Maine’ is not idf(t) = log 2 [1 + N / df(t)] Term Frequency (tf): ‘Main’ appears twice in s 2 Similarity: Inner Product

Is TF important? Information retrieval: Given a query string retrieve relevant documents Relational databases: Given a query string retrieve relevant strings In practice TF is small in many applications

IDF similarity Query q = {t 1, …, t n } Set s = {r 1, …, r m } Length len(s) = ( t 2 s idf(t) 2 ) 1/2 I(q, s) =  t 2 s \ q idf(t) 2 / len(s) len(q) IDF is as good as TF/IDF in practice!

How can I build an index? Let w(t, s) = idf(t) / len(s) Then I(q, s) =  t 2 q \ s w(t, s) w(t, q) So Decompose strings into tokens Compute the idf of each token Create one inverted list per token Sort lists by string id: Do a merge join Sort lists by w: Run TA/NRA

Example: Sort by id

Example: Sort by w NRA: Round robin list accesses Main memory hash table Computes lower and upper bounds per entry

Semantic properties of IDF Order Preservation: For all t 1  t 2 : if w(t 1, s) < w(t 1, r), then w(t 2, s) < w(t 2, r) Length Boundedness: Query q, set s, threshold  – I(q, s) >=  )  len(q) < len(s) < len(q) / 

Improved NRA Order Preservation determines if a given set appears in a list or not t i : encounter s 1, then s 2 t k : encounter s 2 first Length Boundedness restricts the search in a small portion of lists

Something surprising Lemma: NRA reads arbitrarily more elements than iNRA Lemma: NRA reads arbitrarily more elements than any algorithm that uses the Length Boundedness property

Any other strategies? NRA style is breadth-first Try depth-first: Sort query lists in decreasing idf order –Let q = {t 1, …, t n } and idf(t 1 ) > idf(t 2 ) > …> idf(t n ) Let i be the maximum length a set s in t i can have s.t. I(q, s) >= , assuming that s exists in all t k > t i – i =  I <= k <= n idf(t k ) 2 /  len(q) i is a natural cutoff point 1 > 2 > … > n

Shortest-First Sort q={t 1, …, t n } in decreasing idf order Let candidate set C For 1 <= i <= n Skip to first entry with len(s) >=  len(q) Compute i Let  i = min( i, len(q) / ) Repeat –s = pop next element from t i –Maintain lower/upper bounds of entries in C Until len(s) > max(max len C,  i )

Comparison with NRA Lemma: Let q={t 1, …, t n } and d the maximum depth SF descents over all lists. In the worst case iNRA will read (d – 1)(n – 1) elements more than SF But surprisingly

A hybrid strategy Run iNRA normally Use i and max len C to stop reading from a particular list This guarantees that iNRA stops with or before SF Drawback of NRA variants: Very high book keeping cost compared to SF

Experiments DBLP, IMDB and YellowPages datasets Actors, movies, authors, businesses etc. Vary threshold, query size, query strings and mistakes Test wall-clock time, pruning power Algorithms:NRA, TA, iNRA, iTA, SF, Hybrid, Sort-by-id, Improved SQL based

Wall-clock time vs. Threshold

Wall-clock time vs. Query size TA NRA Sort-by-id iTA SF

Conclusion Proposed a simplified TF/IDF measure Identified strong monotonicity properties Used the properties to design efficient algorithms SF works best overall in practice Achieves sub-second answers in most practical cases

Pruning power vs. Threshold

Pruning power vs. Query size NRA TA iTA

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Similar presentations

Presentation on theme: "Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.

Similar presentations

Presentation on theme: "Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava."— Presentation transcript:

Similar presentations

About project

Feedback