Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query.

Similar presentations


Presentation on theme: "Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query."— Presentation transcript:

1 Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query Expansion

2 Web evolution Web constantly evolving and expanding New content with unique characteristics Most recently: active user participation in content generation Blogs Tweets Online social networks Wikis … 2

3 User (re)views and opinions “Explicit” user reviews Star ratings “Implicit” user reviews Sentiment analysis 3

4 Limitations of web search (based on a true story) delta airlines 700 ordered results 4

5 Query expansion and result refinement 5 delta airlines customer service delta airlines airmiles Services discussed in reviews delta airlines safety delta airlines customer service Reviews

6 Query expansion and result refinement 6 delta airlines food delta airlines legroom Services discussed in reviews with high on avg. ratings delta airlines customer service delta airlines delays atlanta delta airlines connections jfk Services discussed in reviews with low on avg. ratings delta airlines fees luggage

7 Faceted search? Faceted search: refine query result using predefined, static hierarchies (facets) Search-engine query expansion: use query logs to suggest frequent expansions Our approach Dynamically compute “facets” for each query Based on data characteristics (text + ratings) 7

8 Outline Problem definition Basic framework Improved framework Experimental results 8

9 Formal problem statement 9 q 1,…,q l x l+1,…,x r q 1,…,q l y l+1,…,y r q 1,…,q l z l+1,…,z r ●●● Query k query expansions r words each k query expansions r words each max or min F(q 1,…,q l,x l+1,…,x r ) Efficiently compute delta airlines e.g. delta airlines connections jfk Top-k query expansions r words each Top-k query expansions r words each

10 Scoring functions 10 Surprise(q 1,…,q l,x l+1,…,x r ) = # of docs containing all words expected # of docs containing all words assuming independence Avg(q 1,…,q l,x l+1,…,x r ) = Average rating of documents containing all words F(q 1,…,q l,x l+1,…,x r ) = # of documents of rating b containing all words F

11 Outline Problem definition Basic framework Improved framework Experimental results 11

12 Computing top-k expansions Query Q=q 1,…,q l Compute top-k expansions q 1,…,q l,x l+1,…,x r Enumerate all candidate expansions, compute score Challenge: compute c(q 1,…,x r ) (word co-occurrence) for all candidates Challenge: compute c(q 1,…,x r ) (word co-occurrence) for all candidates 12 F(q 1,…,q l,x l+1,…,x r ) = # of documents containing all words F

13 Computing word co-occurrences Pre-compute and store all possible word co- occurrences Assume 4 word co-occurrences A 50 distinct-word document has 230K 4-word sets Information from 1M documents: 230B 4-word sets Infeasible Compute co-occurrences on the fly Inefficient 13

14 Estimating word co-occurrences delta airlines delays delta airlines delta delays airlines delays 10000 3000 5000 delta airlines delays 20000 45000 30000 two-word co-occurrencesword occurrences 2000 low-order co-occurrences high-order co-occurrence 14

15 Query-expansion framework Query q 1,…,q l For each candidate expansion q 1,…,q l,x l+1,…,x r Use c(w i ), c(w i,w j ) to estimate c(q 1,…,q l,x l+1,…,x r ) Compute expansion score Update top-k heap End For Query q 1,…,q l For each candidate expansion q 1,…,q l,x l+1,…,x r Use c(w i ), c(w i,w j ) to estimate c(q 1,…,q l,x l+1,…,x r ) Compute expansion score Update top-k heap End For 15

16 Outline Problem definition Basic framework Maximum entropy estimation Improved framework Experimental results 16

17 Maximum entropy estimation c(w 1,w 2,w 3 ) p(w 1,w 2,w 3 ) p1p1 p2p2 p3p3 p4p4 p5p5 p6p6 p7p7 p8p8 p 7 +p 8 = p(w 1,w 2 ) = c(w 1,w 2 )/c(●) p 5 +p 6 +p 7 +p 8 = p(w 1 ) = c(w 1 )/c(●) 17 = p(w 1,w 2,w 3 ) c(●)

18 Maximum entropy estimation (Unique) maximum entropy distribution Computed using the Iterative Proportional Fitting algorithm p=[p 1 p 2 p 3 p 4 p 5 p 6 p 7 p 8 ] T Ap=cAp=c max H(p)=-∑p i logp i p≥0 18

19 Query-expansion framework using the IPF algorithm Candidate expansion Ap=cAp=c Ap=cAp=c IPF Already considered expansions Top-k score threshold Iteration: 10 Iteration: 20 Iteration: 30 19 p1p2…pnp1p2…pn ME distribution Score

20 Outline Problem definition Basic framework Improved framework Experimental results 20

21 Entropy maximization Can we save work? We only require a single probability (p n ) We need to compute top-k expansions: a bound around p n could help us prune the expansion considered Not by using IPF Introduce ElliMax Determine p n by progressively bounding it p=[p 1 p 2 p 3 … p n-1 p n ] T Ap=cAp=c max H(p) p≥0 21

22 Improved query-expansion framework using the ElliMax algorithm Candidate expansion Ap=cAp=c Ap=cAp=c ElliMax Already considered expansions Top-k score threshold Iteration: 5 pnpn Score Iteration: 10Iteration: 15 22 Iteration: 20

23 Outline Problem definition Basic framework Improved framework ElliMax algorithm Experimental results 23

24 ElliMax algorithm: Ellipsoid method principles x* max F(x) Qx≥r max F(x) Qx≥r 24 Iteration: 0Iteration: 5 Iteration: 10

25 ElliMax algorithm max H(p) Ap=c p≥0 max H(p) Ap=c p≥0 p-space max H’(λ) Uλ≥-q max H’(λ) Uλ≥-q λ-space 1) Transform problem 2) Starting ellipsoid 3) Back to the p-space p1p1 p2p2 λ 2λ 2 λ 1λ 1 pnpn λ*λ* 25

26 Outline Problem definition Basic solution Improved solution Experimental results 26

27 Experimental Results (Performance) Time spent in Entropy Maximization Basic framework (Algorithm Direct) IPF algorithm Improved framework (Algorithm Bound) ElliMax algorithm Synthetic and real data 27

28 Direct vs Bound (Surprise) Top-10 expansions, 100k synthetic candidates Expansion size 3Expansion size 4 28

29 Direct vs Bound (Avg. Rating) Top-10 expansions, 100k synthetic candidates, ratings 0, 1 and 2 Expansion size 3Expansion size 4 29

30 Experimental Results (Quality) 30

31 Thank you! 31


Download ppt "Nikos Sarkas, Univ. of Toronto Nilesh Bansal, Univ. of Toronto Gautam Das, Univ. of Texas at Arlington Nick Koudas, Univ. of Toronto Measure-driven Keyword-query."

Similar presentations


Ads by Google