+ Efficient network aware search in collaborative tagging Sihem Amer Yahia, Michael Benedikt, Laks V.S. Lakshmanan, Julia Stoyanovich Presented by: Ashish Chawla CSE 6339 Spring 2009
+ Overview Opportunity: Explore keyword search in a context where query results are determined by opinion of network of taggers related to a seeker. Incorporate social behavior into processing search queries Network Aware Search Results determined by opinion of network. Existing top-k are too space intensive Dependence of scores on seeker’s network Investigate clustering seekers based on behavior of networks Del.icio.us datasets were used for experiments 2
+ Introduction What is Network Aware Search? Examples: Flickr, YouTube, del.icio.us, photo tagging on Facebook Users contribute content annotate items (photos, videos, URLs, …) with tags form social networks friends/family, interest-based need help discovering relevant content What is Relevance of an item? 3
+ What is Network-Aware Search? 4
+ Claims Define what is network-aware search. Improvise top-k algorithms to Network-Aware Search, by using score upper-bounds and EXACT strategy. Refine score upper-bounds based on the user’s network and tagging behavior 5
+ Data Model Roger, i1, music Roger, i3, music Roger, i5, sports … Hugo, i1, music Hugo, i22, music … Minnie, i2, sports … Linda, i2, football Linda, i28, news … Tagged(user u,item i,tag t) Taggers = u TaggedSeekers = u Link Link(u 1,v 1 ): directed edge Network (u) = { v | Link (u,v) } For seeker u 1 ε Seekers, Network(u 1 ) = neighbors of u 1 Link (user u, user v) 6
+ What are Scores? Query is a set of tags Q = {t 1,t 2,…,t n } example: fashion, www, sports, artificial intelligence For a seeker u, a tag t, and a item I (Score per tag) score(i,u,t) = f(|Network(u) {v, |Tagged(v,i,t)}|) Overall Score of the query score(i,u,Q) = g(score(i,u,t 1 ), score(i,u,t 2 ),…, score(i,u, t n )) f and g are monotone, where f = COUNT, g = SUM 7
+ Problem Statement Given a user query Q = t 1 … t n and a number k, we want to efficiently determine the top k items ie: k items with the highest overall score 8
+ Standard Top-k Processing Q = {t 1,t 2,…,t n } Inverted lists per tag, IL 1, IL 2, … IL n, sorted on scores score (i) = g(score(i, IL 1 ), score(i, IL 2 ), …, score(i, IL 3 )) Intuition high-scoring items are close to the top of most lists Fagin-style processing: NRA (no random access) access all lists sequentially in parallel maintain a heap sorted on partial scores stop when partial score of k th item > best case score of unseen/incomplete items 9
+ Item Item Item item780.5item380.6item170.7 item830.4item140.6item610.3 item170.3item50.6item810.2 item210.2item830.5item650.1 item910.1item210.3item100.1 item440.1 [0.9, 2.1] Item 17 [0.6, 2.1] item 25 [0.6, 2.1] worst score best-score Min top-2 score : 0.6 Threshold (Max of unseen tuples): 2.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? List 1 List 2 List 3 Candidates =2.1 NRA 10
+ item item item item item item item item item item item item item item item item item item item worst score best-score Min top-2 score : 0.9 Threshold (Max of unseen tuples): 1.8 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? item 17 [1.3, 1.8] item 83 [0.9, 2.0] item 25 [0.6, 1.9] item 38 [0.6, 1.8] item 78 [0.5, 1.8] List 1 List 2 List 3 Candidates NRA 11
+ item item item item item item item item item item item item item item item item item item item worst score best-score item 83 [1.3, 1.9] item 17 [1.3, 1.9] item 25 [0.6, 1.5] item 78 [0.5, 1.4] Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.3 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue List 1 List 2 List 3 Candidates NRA 12
+ item item item item item item item item item item item item item item item item item item item worst score best-score Min top-2 score : 1.3 Threshold (Max of unseen tuples): 1.1 Pruning Candidates: Min top-2 < best score of candidate Stopping Condition Threshold < min top-2 ? no more new items can get into top-2 but, extra candidates left in queue item item 83 [1.3, 1.9] item 25 [0.6, 1.4] List 1 List 2 List 3 Candidates NRA 13
+ item item item item item item item item item item item item item item item item item item item Min top-2 score : 1.6 Threshold (Max of unseen tuples): 0.8 Pruning Candidates: Min top-2 < best score of candidate item item List 1 List 2 List 3 Candidates NRA 14
+ NRA performs only sorted accesses (SA) (No Random Access) Random access (RA) lookup actual (final) score of an item often very useful Problems with NRA high bookkeeping overhead for “high” values of k, gain in even access cost not significant NRA 15
+ TA item item item item item item item item item item item item item item item item item item item lists sorted by score List 1 List 2 List 3 (a 1, a 2, a 3 ) 16
+ TA item item item item item item item item item item item item item item item item item item item lists sorted by score item item item Candidates min top-2 score: 1.6 maximum score for unseen items: 2.1 TA Algorithm: round 1 List 1 List 2 List 3 read one item from every list Random access 17
+ Computing Exact Scores: Naïve item score i7 i1i1 i2 i3 i4 i5 i6 i seeker Jane i7 i5i5 i9 i2 i6 i5 i8 i3 seeker Ann scoreitem tag = photos item score i7 i1i1 i8 i4 i2 i3 i6 i seeker Jane i4 i5i5 i2 i8 i7 i1 i6 i3 seeker Ann scoreitem tag = music Typical: Maintain single inverted list per (seeker, tag), items ordered by score + can use standard top-k algorithms -- high space overhead 18
+ Computing Score Upper-Bounds Space saving strategy. Maintain entries of the form (item,itemTaggers) where itemTaggers are all taggers who tagged the item with the tag. Here every item is stored at most once. Q now: what score to store with each entry? We store the maximum score that an item can have across all possible seekers. This is Global Upper-Bound strategy Limitation: Time to dynamically computing exact scores at query time. 19
+ Score Upper-Bounds Global Upper-Bound (GUB): 1 list per tag tag = music itemtaggers upper-bound i6 i1 i2 i3 i5 i4 i9 i7 i8 Miguel,… Kath, … Sam, … Miguel, … Peter, … Jane, … Mary, … Miguel, … Kath, … all seekers + low space overhead -- item upper-bounds, and list order(!) may differ from EXACT for most users -- time to dynamically computing exact scores at query time. How do we do top-k processing with score upper-bounds? Q: what score to store with each entry? We store the maximum score that an item can have across all possible seekers. 20
+ Top-k with Score Upper-Bounds gNRA - “generalized no random access” access all lists sequentially in parallel maintain a heap with partial exact scores stop when partial exact score of k th item > highest possible score from unseen/incomplete items (computed using current list upper-bounds) 21
+ gNRA – NRA Generalization 22
+ gTA – TA Generalization 23
+ Performance of Global Upper Bound (GUB) and Exact Space overhead total # number of entries in all inverted lists Query processing time # of cursor moves 24 GUBExact space (IL entries) 74K63M time K space baseline time baseline
+ Clustering and Query-Processing We want to reduce the distance between score upper-bound and the exact score. Greater the distance, more processing may be required Core idea Cluster users into groups and compute upper-bound for the group. Intuition group users whose behavior is similar 25
+ Clustering Seekers Cluster the seekers based on similarity in their scores (because score of an item depends on the network). Form an inverted list IL t,C for every tag t and cluster C (the score of an item being the maximum score over all seekers in the cluster). Query processing for Q = t 1.. t n and seeker u, we First find the cluster C(u) And then perform aggregation over the collection Global Upper-Bound (GUB) is where all seekers fall into the same cluster. 26
+ Clustering Seekers assign each seeker to a cluster compute an inverted list per cluster ub(i,t,C) = max u C |Network(u) {v|Tagged(v,i,t j )}| + tighter bounds, item order usually closer to EXACT order than in Global Upper-Bound -- space overhead still high (trade-off) 27 item taggers upper-bound chanel puma gucci adidas diesel versace nike prad a Miguel,… Kath, … Sam, … Miguel, … Peter, … Jane, … Mary, … Chris, … Global Upper-Bound item taggers upper-bound gucci versace chanel prada puma Bob,… Peter, … Mary, … Chris, … Alice, … Example of Clusters C1: seekers Bob & Alice item taggers upper-bound puma adidas diesel nike Miguel,… Sam, … Miguel, … Jane, … gucci Kath, … 5 C2: seekers Sam & Miguel
+ How do we cluster seekers? Finding a cluster that minimizes worst, average computation time of top-k algorithms is NP-hard. Proofs by reduction from independent task scheduling problem and minimum sum of squares problem Authors present some heuristics Use some form of Normalized Discounted Cumulative Gain (NDCG) which is a measure of the quality of a clustered list for a given seeker and keyword. The metric compares the ideal (exact score) order in inverted lists with actual (score upper-bound) order 28
+ NDCG - Example idocIDLog I – base2 Ranking Rank/log i – base 2 Ideal rankingIdeal ranking/log i –base 2 1D N/A3 2D D D D D Cumulative Gain (CG) Distributive CG 8.10 Ideal DCG 8.69 Normalized CG (nDCG) 8.10/8.69 =
+ Clustering Taggers For each tag t we partition the taggers into separate clusters. We form inverted list and an item i in the list for cluster C gets the score as max u seekers |Network(u) ∩ C ∩ {v 1 | Tagged(v 1,i,t)}| How to cluster taggers? Graph with nodes as taggers and an edge exists between nodes v 1 and v 2 iff: Items(v 1,t) ∩ Items(v 2,t) ≥ threshold 30
+ Clustering Seekers Metrics Space Global Upper Bound has the lowest overhead. ASC and NCT achieve an order of magnitude improvement in space overhead over Exact. Time Both gNRA and gTA outperform Global Upper-bound. ASC outperforms NCT on both sequential and total accesses in all cases for gTA and in all cases except one for gNRA. Inverted lists are shorter Score upper-bound order similar to exact score order for many users Average % improvement over Global Upper-Bound Normalized Cut: 38-72% Ratio Association 67-87% 31
+ Clustering Seekers Cluster-Seekers improves query execution time over GUB by at least an order of magnitude, for all queries and all users 32
+ Clustering Taggers Space Overhead is significantly lower than that of Exact and of Cluster- Seekers Time Best Case: all taggers relevant to a seeker will reside in a single cluster Worst Case: All taggers will reside in separate clusters. Idea: cluster taggers based on overlap in tagging assign each tagger to a cluster compute cluster upper-bounds: ub(i,t,C) = max u Seekers, v C |Network(u) {v |Tagged(v,i,t j )}| 33
+ Clustering Taggers 34
+ Conclusion and Next Steps Cluster-Taggers worked best for seekers whose network fell into at most 3 * #tags clusters For others, query execution time degraded due to the number of inverted lists that had to be processed For these seekers Cluster-Taggers outperformed Cluster-Seekers in all cases Cluster-Taggers outperforms Global Upper-Bound by %, in all cases. Extended traditional top-k algorithms Achieved a balance between time and space consumption. 35
+ Questions? WebCT/ Thank You!