Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results
@ Carnegie Mellon Databases Popularity as a Surrogate for Quality Search engines want to measure the “quality” of pages Quality hard to define and measure Various “popularity” measures are used in ranking – e.g., in-links, PageRank, user traffic
@ Carnegie Mellon Databases Relationship Between Popularity and Quality Popularity : depends on the number of users who “like” a page – relies on both awareness and quality of the page Popularity correlated with quality – when awareness is large
@ Carnegie Mellon Databases Problem Popularity/quality correlation weak for young pages – Even if of high quality, may not (yet) be popular due to lack of user awareness Plus, process of gaining popularity inhibited by “entrenchment effect”
@ Carnegie Mellon Databases Entrenchment Effect Search engines show entrenched (already- popular) pages at the top Users discover pages via search engines; tend to focus on top results … entrenched pages user attention
@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary
@ Carnegie Mellon Databases Evidences of the Entrenchment More news, less diversity - New York Times Do search engines suppress controversy? - Susan L. Gerhart Googlearchy Distinction of retrievability and visibility Bias on the Web - Comm. of the ACM Are search engines biased? - Chris Sherman The politics of search engines - IEEE Computer The political economy of linking on the Web -ACM conf. on Hypertext & Hypermedia
@ Carnegie Mellon Databases Quantification of Entrenchment Effect Impact of Search Engines on Page Popularity – Real Web study by Cho et. al. [WWW’04] – Pages downloaded every week from 154 sites – Partitioned into 10 groups based on initial link popularity – After 7 months, 70% of new links to top 20% pages Decrease in PageRank for bottom 50% pages
@ Carnegie Mellon Databases Alternative Approaches to Counter-act Entrenchment Effect Weight links to young pages more – [Baeza-Yates et. al SPIRE ’02] – Proposed an age-based variant of PageRank Extrapolate quality based on increase in popularity – [Cho et. al SIGMOD ’05] – Proposed an estimate of quality based on the derivative of popularity
@ Carnegie Mellon Databases Our Approach: Randomized Rank Promotion Select random (young) pages to promote to good rank positions Rank position to promote to is chosen at random
@ Carnegie Mellon Databases Our Approach: Randomized Rank Promotion Consequence: Users visit promoted pages; improves quality estimate Compared with previous approaches: Does not rely on temporal measurements (+) Sub-optimal (-)
@ Carnegie Mellon Databases Exploration/Exploitation Tradeoff Exploration/Exploitation tradeoff – exploit known high-quality pages by assigning good rank positions – explore quality of new pages by promoting them in rank Existing search engines only exploit (to our knowledge)
@ Carnegie Mellon Databases Possible Objectives for Rank Promotion Fairness – Give each page an equal chance to become popular – Incentive for search engines to be fair? Quality – Maximize quality of search results seen by users (in aggregate) – Quality page p: extent to which users “like” p – Q(p) [0,1] our choice
@ Carnegie Mellon Databases Model of the Web Squash Linux Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.) A community is made up of a set of pages, interested users and related queries
@ Carnegie Mellon Databases Model of the Web Users visit pages only by issuing queries to search engine – Mixed surfing & searching considered in the paper Query answer = ordered list containing all pages in the corresponding community A single ranked list associated with each community – Since queries within a community are very similar
@ Carnegie Mellon Databases Model of the Web Consequence: Each community evolves independent of the other communities … … Community on Squash Community on Linux
@ Carnegie Mellon Databases Quality-Per-Click Metric (QPC) V(p,t) : number of visits to page p at time t QPC : average quality of pages viewed by users, amortized over time
@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary
@ Carnegie Mellon Databases Desiderata for Randomized Rank Promotion Want ability to: – Control exploration/exploitation tradeoff – “Select” certain pages as candidates for promotion – – “Protect’’ certain pages from demotion
@ Carnegie Mellon Databases Randomized Rank Promotion Scheme W WmWm W-W m Promotion pool random ordering order by popularity LdLd LmLm Remainder
@ Carnegie Mellon Databases Randomized Rank Promotion Scheme LdLd k-1 r 1-r Promotion list k = 3 r = 0.5 Remainder LmLm
@ Carnegie Mellon Databases Parameters Promotion pool (W m ) – Uniform rank promotion : give an equal chance to each page – Selective rank promotion : exclusively target zero awareness pages Start rank (k) – rank to start randomization from Degree of randomization (r) – controls the tradeoff between exploration and exploitation
@ Carnegie Mellon Databases Tuning the Parameters Objective: maximize quality-per-click (QPC) Entrenchment in a community depends on many factors – Number of pages and users – Page lifetimes – Visits per user Two ways to tune – set parameters per community – one parameter setting for all communities
@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary
@ Carnegie Mellon Databases Popularity Evolution Cycle Popularity P(p,t) Rank R(p,t) Awareness A(p,t) Visit rate V(p,t)
@ Carnegie Mellon Databases Popularity to Rank Relationship Rank of a page under randomized rank promotion scheme – determined by a combination of popularity and randomness Deterministic Popularity-based-ranking is a special case – i.e., r=0 Unknown function F PR : rank as a function of the popularity of page p under a given randomized scheme R(p,t) = F PR (P(p,t)) DETAIL
@ Carnegie Mellon Databases Viewing Likelihood Rank Probability of Viewing view probability rank Depends primarily on rank in list [Joachims KDD’02] From AltaVista data [Lempel et al. WWW’03]: F RV (r) r –1.5 DETAIL
@ Carnegie Mellon Databases Visit to Awareness Relationship Awareness A(p,t) : fraction of users who have visited page p at least once by time t DETAIL
@ Carnegie Mellon Databases Awareness to Popularity Relationship Quality Q(p) : extent to which users like page p (contribute towards its popularity) Popularity P(p,t) : DETAIL
@ Carnegie Mellon Databases Popularity Evolution Cycle Popularity P(p,t) Rank R(p,t) Awareness A(p,t) Visit rate V(p,t) F AP (A(p,t)) F VA (V(p,t)) F PR (P(p,t)) F RV (R(p,t))
@ Carnegie Mellon Databases Deriving Popularity Evolution Curve Popularity P(p,t) time (t) Next step : derive formula for popularity evolution curve Derive it using the awareness distribution of pages
@ Carnegie Mellon Databases Deriving Popularity Evolution Curve Assumptions – number of pages constant – Pages are created and retired according to a Poisson process with rate parameter – Quality distribution of pages is stationary In the steady state, both popularity and awareness distribution of the pages are stationary
@ Carnegie Mellon Databases Popularity Evolution Curve and Awareness Distribution Popularity Evolution Curve E(x,q) : time duration for which a page of quality q has popularity value x Next : derive popularity evolution curve using the awareness distribution Awareness distribution : fraction of pages of quality q whose awareness is i / (#users) DETAIL
@ Carnegie Mellon Databases Popularity Evolution Curve and Awareness Distribution : interpret it as the probability of a page of quality q to have awareness a i at any point of time We know that : Hence, DETAIL
@ Carnegie Mellon Databases Deriving Awareness Distribution : fraction of pages of quality q whose awareness is i / (#users) but remember that we do not know F PR yet R(p,t) = F PR (P(p,t)) Doing the steady state analysis, we get DETAIL
@ Carnegie Mellon Databases Deriving Awareness Distribution Start with an initial form of F PR ; iterate till convergence Good news: rank is a combination of popularity and randomness, we can derive F PR given. (ex. below) DETAIL
@ Carnegie Mellon Databases Summary of Where We Stand Formalized the popularity evolution cycle – Relationship between popularity evolution and awareness distribution – Derived the awareness distribution Next step: tune parameters Recall, goal is to obtain scheme that: 1. achieves high QPC (quality per click) 2. is robust across a wide range of community types
@ Carnegie Mellon Databases Tuning the Promotion Scheme Parameters: k, r and W m Objective: maximize QPC Influential factors: – Number of pages and users – Page lifetimes – Visits per user
@ Carnegie Mellon Databases Default Community Setting Number of pages = 10,000 * Number of users = 1000 Visits per user = 1000 visits per day Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ] * How Much Information? SIMS, Berkeley, 2003
@ Carnegie Mellon Databases Tuning: W m parameter -no promotion - uniform promotion - selective promotion k=1 and r=0.2
@ Carnegie Mellon Databases Tuning: k and r Optimal r: (0,1) Optimal r increases with increasing k Based on simulation (reason: analysis only accurate for small values of r)
@ Carnegie Mellon Databases Tuning: k and r Deciding k & r : – k >= 2 for “feeling lucky” – Minimize amount of “junk” perceived – Maximize QPC
@ Carnegie Mellon Databases Final Parameter Settings Promotion pool ( W m ) : zero-awareness pages Start rank (k) : 1 or 2 Randomization (r) : 0.1
@ Carnegie Mellon Databases Tuning the Promotion Scheme Parameters: k, r and W m Objective: maximize QPC Influential factors: – Number of pages and users – Page lifetimes – Visits per user
@ Carnegie Mellon Databases Influence of Number of Pages and Users
@ Carnegie Mellon Databases Influence of Page Lifetime and Visit rate
@ Carnegie Mellon Databases Influence of Visit Rate 1000 visits/day per user
@ Carnegie Mellon Databases Summary Entrenchment effect hurts search result quality Solution: Randomized rank promotion Model of Web evolution and QPC metric – Used to tune & evaluate randomized rank promotion Initial results – Significantly increases QPC – Robust across wide range of Web communities More study required
@ Carnegie Mellon Databases THE END Paper available at :