Download presentation
Presentation is loading. Please wait.
Published byJunior Stanley Modified over 9 years ago
1
Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query Expansion
2
Web evolution Web constantly evolving and expanding New content with unique characteristics Most recently: active user participation in content generation Blogs Tweets Online social networks Wikis … 2
3
User (re)views and opinions “Explicit” user reviews Star ratings “Implicit” user reviews Sentiment analysis 3
4
Limitations of web search (based on a true story) delta airlines 700 ordered results 4
5
Query expansion and result refinement 5 delta airlines customer service delta airlines airmiles Services discussed in reviews delta airlines safety delta airlines customer service Reviews
6
Query expansion and result refinement 6 delta airlines food delta airlines legroom Services discussed in reviews with high on avg. ratings delta airlines customer service delta airlines delays atlanta delta airlines connections jfk Services discussed in reviews with low on avg. ratings delta airlines fees luggage
7
Faceted search? Faceted search: refine query result using predefined, static hierarchies (facets) Search-engine query expansion: use query logs to suggest frequent expansions Our approach Dynamically compute “facets” for each query Based on data characteristics (text + ratings) 7
8
Outline Problem definition Basic framework Improved framework Experimental results 8
9
Formal problem statement 9 q 1,…,q l x l+1,…,x r q 1,…,q l y l+1,…,y r q 1,…,q l z l+1,…,z r ●●● Query k query expansions r words each k query expansions r words each max or min F(q 1,…,q l,x l+1,…,x r ) Efficiently compute delta airlines e.g. delta airlines connections jfk Top-k query expansions r words each Top-k query expansions r words each
10
Scoring functions 10 Surprise(q 1,…,q l,x l+1,…,x r ) = # of docs containing all words expected # of docs containing all words assuming independence Avg(q 1,…,q l,x l+1,…,x r ) = Average rating of documents containing all words F(q 1,…,q l,x l+1,…,x r ) = # of documents of rating b containing all words F
11
Outline Problem definition Basic framework Improved framework Experimental results 11
12
Computing top-k expansions Query Q=q 1,…,q l Compute top-k expansions q 1,…,q l,x l+1,…,x r Enumerate all candidate expansions, compute score Challenge: compute c(q 1,…,x r ) (word co-occurrence) for all candidates Challenge: compute c(q 1,…,x r ) (word co-occurrence) for all candidates 12 F(q 1,…,q l,x l+1,…,x r ) = # of documents containing all words F
13
Computing word co-occurrences Pre-compute and store all possible word co- occurrences Assume 4 word co-occurrences A 50 distinct-word document has 230K 4-word sets Information from 1M documents: 230B 4-word sets Infeasible Compute co-occurrences on the fly Inefficient 13
14
Estimating word co-occurrences delta airlines delays delta airlines delta delays airlines delays 10000 3000 5000 delta airlines delays 20000 45000 30000 two-word co-occurrencesword occurrences 2000 low-order co-occurrences high-order co-occurrence 14
15
Query-expansion framework Query q 1,…,q l For each candidate expansion q 1,…,q l,x l+1,…,x r Use c(w i ), c(w i,w j ) to estimate c(q 1,…,q l,x l+1,…,x r ) Compute expansion score Update top-k heap End For Query q 1,…,q l For each candidate expansion q 1,…,q l,x l+1,…,x r Use c(w i ), c(w i,w j ) to estimate c(q 1,…,q l,x l+1,…,x r ) Compute expansion score Update top-k heap End For 15
16
Outline Problem definition Basic framework Maximum entropy estimation Improved framework Experimental results 16
17
Maximum entropy estimation c(w 1,w 2,w 3 ) p(w 1,w 2,w 3 ) p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p 7 +p 8 = p(w 1,w 2 ) = c(w 1,w 2 )/c(●) p 5 +p 6 +p 7 +p 8 = p(w 1 ) = c(w 1 )/c(●) 17 = p(w 1,w 2,w 3 ) c(●)
18
Maximum entropy estimation (Unique) maximum entropy distribution Computed using the Iterative Proportional Fitting algorithm p=[p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 ] T Ap=cAp=c max H(p)=-∑p i logp i p≥0 18
19
Query-expansion framework using the IPF algorithm Candidate expansion Ap=cAp=c Ap=cAp=c IPF Already considered expansions Top-k score threshold Iteration: 10 Iteration: 20 Iteration: 30 19 p1p2…pnp1p2…pn ME distribution Score
20
Outline Problem definition Basic framework Improved framework Experimental results 20
21
Entropy maximization Can we save work? We only require a single probability (p n ) We need to compute top-k expansions: a bound around p n could help us prune the expansion considered Not by using IPF Introduce ElliMax Determine p n by progressively bounding it p=[p 1 p 2 p 3 … p n-1 p n ] T Ap=cAp=c max H(p) p≥0 21
22
Improved query-expansion framework using the ElliMax algorithm Candidate expansion Ap=cAp=c Ap=cAp=c ElliMax Already considered expansions Top-k score threshold Iteration: 5 pnpn Score Iteration: 10Iteration: 15 22 Iteration: 20
23
Outline Problem definition Basic framework Improved framework ElliMax algorithm Experimental results 23
24
ElliMax algorithm: Ellipsoid method principles x* max F(x) Qx≥r max F(x) Qx≥r 24 Iteration: 0Iteration: 5 Iteration: 10
25
ElliMax algorithm max H(p) Ap=c p≥0 max H(p) Ap=c p≥0 p-space max H’(λ) Uλ≥-q max H’(λ) Uλ≥-q λ-space 1) Transform problem 2) Starting ellipsoid 3) Back to the p-space p1p1 p2p2 λ 2λ 2 λ 1λ 1 pnpn λ*λ* 25
26
Outline Problem definition Basic solution Improved solution Experimental results 26
27
Experimental Results (Performance) Time spent in Entropy Maximization Basic framework (Algorithm Direct) IPF algorithm Improved framework (Algorithm Bound) ElliMax algorithm Synthetic and real data 27
28
Direct vs Bound (Surprise) Top-10 expansions, 100k synthetic candidates Expansion size 3Expansion size 4 28
29
Direct vs Bound (Avg. Rating) Top-10 expansions, 100k synthetic candidates, ratings 0, 1 and 2 Expansion size 3Expansion size 4 29
30
Experimental Results (Quality) 30
31
Thank you! 31
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.