Download presentation
Presentation is loading. Please wait.
1
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results
2
@ Carnegie Mellon Databases Popularity as a Surrogate for Quality Search engines want to measure the “quality” of pages Quality hard to define and measure Various “popularity” measures are used in ranking – e.g., in-links, PageRank, user traffic 1.--------- 2.--------- 3.---------
3
@ Carnegie Mellon Databases Relationship Between Popularity and Quality Popularity : depends on the number of users who “like” a page – relies on both awareness and quality of the page Popularity correlated with quality – when awareness is large
4
@ Carnegie Mellon Databases Problem Popularity/quality correlation weak for young pages – Even if of high quality, may not (yet) be popular due to lack of user awareness Plus, process of gaining popularity inhibited by “entrenchment effect”
5
@ Carnegie Mellon Databases Entrenchment Effect Search engines show entrenched (already- popular) pages at the top Users discover pages via search engines; tend to focus on top results 1.--------- 2.--------- 3.--------- 4.--------- 5.--------- 6.--------- … entrenched pages user attention
6
@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary
7
@ Carnegie Mellon Databases Evidences of the Entrenchment More news, less diversity - New York Times Do search engines suppress controversy? - Susan L. Gerhart Googlearchy Distinction of retrievability and visibility Bias on the Web - Comm. of the ACM Are search engines biased? - Chris Sherman The politics of search engines - IEEE Computer The political economy of linking on the Web -ACM conf. on Hypertext & Hypermedia
8
@ Carnegie Mellon Databases Quantification of Entrenchment Effect Impact of Search Engines on Page Popularity – Real Web study by Cho et. al. [WWW’04] – Pages downloaded every week from 154 sites – Partitioned into 10 groups based on initial link popularity – After 7 months, 70% of new links to top 20% pages Decrease in PageRank for bottom 50% pages
9
@ Carnegie Mellon Databases Alternative Approaches to Counter-act Entrenchment Effect Weight links to young pages more – [Baeza-Yates et. al SPIRE ’02] – Proposed an age-based variant of PageRank Extrapolate quality based on increase in popularity – [Cho et. al SIGMOD ’05] – Proposed an estimate of quality based on the derivative of popularity
10
@ Carnegie Mellon Databases Our Approach: Randomized Rank Promotion Select random (young) pages to promote to good rank positions Rank position to promote to is chosen at random 1 2 3 500 501.. 1 500 2 499 501.. 3
11
@ Carnegie Mellon Databases Our Approach: Randomized Rank Promotion Consequence: Users visit promoted pages; improves quality estimate Compared with previous approaches: Does not rely on temporal measurements (+) Sub-optimal (-)
12
@ Carnegie Mellon Databases Exploration/Exploitation Tradeoff Exploration/Exploitation tradeoff – exploit known high-quality pages by assigning good rank positions – explore quality of new pages by promoting them in rank Existing search engines only exploit (to our knowledge)
13
@ Carnegie Mellon Databases Possible Objectives for Rank Promotion Fairness – Give each page an equal chance to become popular – Incentive for search engines to be fair? Quality – Maximize quality of search results seen by users (in aggregate) – Quality page p: extent to which users “like” p – Q(p) [0,1] our choice
14
@ Carnegie Mellon Databases Model of the Web Squash Linux Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.) A community is made up of a set of pages, interested users and related queries
15
@ Carnegie Mellon Databases Model of the Web Users visit pages only by issuing queries to search engine – Mixed surfing & searching considered in the paper Query answer = ordered list containing all pages in the corresponding community A single ranked list associated with each community – Since queries within a community are very similar
16
@ Carnegie Mellon Databases Model of the Web Consequence: Each community evolves independent of the other communities 1.--------- 2.--------- 3.--------- 4.--------- 5.--------- 6.--------- … 1.--------- 2.--------- 3.--------- 4.--------- 5.--------- 6.--------- … Community on Squash Community on Linux
17
@ Carnegie Mellon Databases Quality-Per-Click Metric (QPC) V(p,t) : number of visits to page p at time t QPC : average quality of pages viewed by users, amortized over time
18
@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary
19
@ Carnegie Mellon Databases Desiderata for Randomized Rank Promotion Want ability to: – Control exploration/exploitation tradeoff – “Select” certain pages as candidates for promotion – – “Protect’’ certain pages from demotion 1 2 3 500 501.. 1 500 2 499 501.. 3
20
@ Carnegie Mellon Databases Randomized Rank Promotion Scheme W WmWm W-W m Promotion pool 4 1 2 3 4 1 2 3 random ordering order by popularity LdLd LmLm Remainder
21
@ Carnegie Mellon Databases Randomized Rank Promotion Scheme LdLd k-1 r 1-r Promotion list k = 3 r = 0.5 Remainder 1 12 2 3 4 3456 12 LmLm
22
@ Carnegie Mellon Databases Parameters Promotion pool (W m ) – Uniform rank promotion : give an equal chance to each page – Selective rank promotion : exclusively target zero awareness pages Start rank (k) – rank to start randomization from Degree of randomization (r) – controls the tradeoff between exploration and exploitation
23
@ Carnegie Mellon Databases Tuning the Parameters Objective: maximize quality-per-click (QPC) Entrenchment in a community depends on many factors – Number of pages and users – Page lifetimes – Visits per user Two ways to tune – set parameters per community – one parameter setting for all communities
24
@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary
25
@ Carnegie Mellon Databases Popularity Evolution Cycle Popularity P(p,t) Rank R(p,t) Awareness A(p,t) Visit rate V(p,t)
26
@ Carnegie Mellon Databases Popularity to Rank Relationship Rank of a page under randomized rank promotion scheme – determined by a combination of popularity and randomness Deterministic Popularity-based-ranking is a special case – i.e., r=0 Unknown function F PR : rank as a function of the popularity of page p under a given randomized scheme R(p,t) = F PR (P(p,t)) DETAIL
27
@ Carnegie Mellon Databases Viewing Likelihood 0 0.2 0.4 0.6 0.8 1 1.2 050100150 Rank Probability of Viewing view probability rank Depends primarily on rank in list [Joachims KDD’02] From AltaVista data [Lempel et al. WWW’03]: F RV (r) r –1.5 DETAIL
28
@ Carnegie Mellon Databases Visit to Awareness Relationship Awareness A(p,t) : fraction of users who have visited page p at least once by time t DETAIL
29
@ Carnegie Mellon Databases Awareness to Popularity Relationship Quality Q(p) : extent to which users like page p (contribute towards its popularity) Popularity P(p,t) : DETAIL
30
@ Carnegie Mellon Databases Popularity Evolution Cycle Popularity P(p,t) Rank R(p,t) Awareness A(p,t) Visit rate V(p,t) F AP (A(p,t)) F VA (V(p,t)) F PR (P(p,t)) F RV (R(p,t))
31
@ Carnegie Mellon Databases Deriving Popularity Evolution Curve Popularity P(p,t) time (t) Next step : derive formula for popularity evolution curve Derive it using the awareness distribution of pages
32
@ Carnegie Mellon Databases Deriving Popularity Evolution Curve Assumptions – number of pages constant – Pages are created and retired according to a Poisson process with rate parameter – Quality distribution of pages is stationary In the steady state, both popularity and awareness distribution of the pages are stationary
33
@ Carnegie Mellon Databases Popularity Evolution Curve and Awareness Distribution Popularity Evolution Curve E(x,q) : time duration for which a page of quality q has popularity value x Next : derive popularity evolution curve using the awareness distribution Awareness distribution : fraction of pages of quality q whose awareness is i / (#users) DETAIL
34
@ Carnegie Mellon Databases Popularity Evolution Curve and Awareness Distribution : interpret it as the probability of a page of quality q to have awareness a i at any point of time We know that : Hence, DETAIL
35
@ Carnegie Mellon Databases Deriving Awareness Distribution : fraction of pages of quality q whose awareness is i / (#users) but remember that we do not know F PR yet R(p,t) = F PR (P(p,t)) Doing the steady state analysis, we get DETAIL
36
@ Carnegie Mellon Databases Deriving Awareness Distribution Start with an initial form of F PR ; iterate till convergence Good news: rank is a combination of popularity and randomness, we can derive F PR given. (ex. below) DETAIL
37
@ Carnegie Mellon Databases Summary of Where We Stand Formalized the popularity evolution cycle – Relationship between popularity evolution and awareness distribution – Derived the awareness distribution Next step: tune parameters Recall, goal is to obtain scheme that: 1. achieves high QPC (quality per click) 2. is robust across a wide range of community types
38
@ Carnegie Mellon Databases Tuning the Promotion Scheme Parameters: k, r and W m Objective: maximize QPC Influential factors: – Number of pages and users – Page lifetimes – Visits per user
39
@ Carnegie Mellon Databases Default Community Setting Number of pages = 10,000 * Number of users = 1000 Visits per user = 1000 visits per day Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ] * How Much Information? SIMS, Berkeley, 2003
40
@ Carnegie Mellon Databases Tuning: W m parameter -no promotion - uniform promotion - selective promotion k=1 and r=0.2
41
@ Carnegie Mellon Databases Tuning: k and r Optimal r: (0,1) Optimal r increases with increasing k Based on simulation (reason: analysis only accurate for small values of r)
42
@ Carnegie Mellon Databases Tuning: k and r Deciding k & r : – k >= 2 for “feeling lucky” – Minimize amount of “junk” perceived – Maximize QPC
43
@ Carnegie Mellon Databases Final Parameter Settings Promotion pool ( W m ) : zero-awareness pages Start rank (k) : 1 or 2 Randomization (r) : 0.1
44
@ Carnegie Mellon Databases Tuning the Promotion Scheme Parameters: k, r and W m Objective: maximize QPC Influential factors: – Number of pages and users – Page lifetimes – Visits per user
45
@ Carnegie Mellon Databases Influence of Number of Pages and Users
46
@ Carnegie Mellon Databases Influence of Page Lifetime and Visit rate
47
@ Carnegie Mellon Databases Influence of Visit Rate 1000 visits/day per user
48
@ Carnegie Mellon Databases Summary Entrenchment effect hurts search result quality Solution: Randomized rank promotion Model of Web evolution and QPC metric – Used to tune & evaluate randomized rank promotion Initial results – Significantly increases QPC – Robust across wide range of Web communities More study required
49
@ Carnegie Mellon Databases THE END Paper available at : www.cs.cmu.edu/~spandey
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.