Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.

Slides:

Advertisements

Similar presentations

Google News Personalization: Scalable Online Collaborative Filtering

Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Modelling Relevance and User Behaviour in Sponsored Search using Click-Data Adarsh Prasad, IIT Delhi Advisors: Dinesh Govindaraj SVN Vishwanathan* Group:

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.

1 Evolutionary Computational Inteliigence Lecture 6b: Towards Parameter Control Ferrante Neri University of Jyväskylä.

Experimental Design, Response Surface Analysis, and Optimization

Site Level Noise Removal for Search Engines André Luiz da Costa Carvalho Federal University of Amazonas, Brazil Paul-Alexandru Chirita L3S and University.

@ Carnegie Mellon Databases User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University.

Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.

Trust and Profit Sensitive Ranking for Web Databases and On-line Advertisements Raju Balakrishnan (Arizona State University)

Active Learning and Collaborative Filtering

22 May 2006 Wu, Goel and Davison Models of Trust for the Web (MTW) WWW2006 Workshop L EHIGH U NIVERSITY.

The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.

1 Walking on a Graph with a Magnifying Glass Stratified Sampling via Weighted Random Walks Maciej Kurant Minas Gjoka, Carter T. Butts, Athina Markopoulou.

Integrating Bayesian Networks and Simpson’s Paradox in Data Mining Alex Freitas University of Kent Ken McGarry University of Sunderland.

Evaluating Search Engine

Kuang-Hao Liu et al Presented by Xin Che 11/18/09.

1 Searching the Web Junghoo Cho UCLA Computer Science.

Synchronizing a Database To Improve Freshness Junghoo Cho Hector Garcia-Molina Stanford University.

Presented by Li-Tal Mashiach Learning to Rank: A Machine Learning Approach to Static Ranking Algorithms for Large Data Sets Student Symposium.

1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.

1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.

Ensemble Learning: An Introduction

CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

Link Analysis, PageRank and Search Engines on the Web

WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University.

Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.

1 Wavelet synopses with Error Guarantees Minos Garofalakis Phillip B. Gibbons Information Sciences Research Center Bell Labs, Lucent Technologies Murray.

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.

distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.

The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.

“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Active Learning for Class Imbalance Problem

A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA

Fan Guo 1, Chao Liu 2 and Yi-Min Wang 2 1 Carnegie Mellon University 2 Microsoft Research Feb 11, 2009.

Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.

When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.

1 ECE-517 Reinforcement Learning in Artificial Intelligence Lecture 7: Finite Horizon MDPs, Dynamic Programming Dr. Itamar Arel College of Engineering.

윤언근 DataMining lab.  The Web has grown exponentially in size but this growth has not been isolated to good-quality pages.  spamming and.

The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.

Implicit User Feedback Hongning Wang Explicit relevance feedback 2 Updated query Feedback Judgments: d 1 + d 2 - d 3 + … d k -... Query User judgment.

Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.

Binxing Jiao et. al (SIGIR ’10) Presenter : Lin, Yi-Jhen Advisor: Dr. Koh. Jia-ling Date: 2011/4/25 VISUAL SUMMARIZATION OF WEB PAGES.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.

1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.

1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.

Classification Ensemble Methods 1

Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

1 Friends and Neighbors on the Web Presentation for Web Information Retrieval Bruno Lepri.

@ Carnegie Mellon Databases 1 Finding Frequent Items in Distributed Data Streams Amit Manjhi V. Shkapenyuk, K. Dhamdhere, C. Olston Carnegie Mellon University.

Nonlinear differential equation model for quantification of transcriptional regulation applied to microarray data of Saccharomyces cerevisiae Vu, T. T.,

KAIST TS & IS Lab. CS710 Know your Neighbors: Web Spam Detection using the Web Topology SIGIR 2007, Carlos Castillo et al., Yahoo! 이 승 민.

03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.

Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:

Navigation Aided Retrieval Shashank Pandit & Christopher Olston Carnegie Mellon & Yahoo.

Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)

Privacy Vulnerability of Published Anonymous Mobility Traces Chris Y. T. Ma, David K. Y. Yau, Nung Kwan Yip (Purdue University) Nageswara S. V. Rao (Oak.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:

How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho

DTMC Applications Ranking Web Pages & Slotted ALOHA

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

1 Department of Engineering, 2 Department of Mathematics,

Retrieval Evaluation - Measures

Retrieval Performance Evaluation - Measures

Presentation transcript:

Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results

@ Carnegie Mellon Databases Popularity as a Surrogate for Quality Search engines want to measure the “quality” of pages Quality hard to define and measure Various “popularity” measures are used in ranking – e.g., in-links, PageRank, user traffic

@ Carnegie Mellon Databases Relationship Between Popularity and Quality Popularity : depends on the number of users who “like” a page – relies on both awareness and quality of the page Popularity correlated with quality – when awareness is large

@ Carnegie Mellon Databases Problem Popularity/quality correlation weak for young pages – Even if of high quality, may not (yet) be popular due to lack of user awareness Plus, process of gaining popularity inhibited by “entrenchment effect”

@ Carnegie Mellon Databases Entrenchment Effect Search engines show entrenched (already- popular) pages at the top Users discover pages via search engines; tend to focus on top results … entrenched pages user attention

@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary

@ Carnegie Mellon Databases Evidences of the Entrenchment More news, less diversity - New York Times Do search engines suppress controversy? - Susan L. Gerhart Googlearchy Distinction of retrievability and visibility Bias on the Web - Comm. of the ACM Are search engines biased? - Chris Sherman The politics of search engines - IEEE Computer The political economy of linking on the Web -ACM conf. on Hypertext & Hypermedia

@ Carnegie Mellon Databases Quantification of Entrenchment Effect Impact of Search Engines on Page Popularity – Real Web study by Cho et. al. [WWW’04] – Pages downloaded every week from 154 sites – Partitioned into 10 groups based on initial link popularity – After 7 months, 70% of new links to top 20% pages Decrease in PageRank for bottom 50% pages

@ Carnegie Mellon Databases Alternative Approaches to Counter-act Entrenchment Effect Weight links to young pages more – [Baeza-Yates et. al SPIRE ’02] – Proposed an age-based variant of PageRank Extrapolate quality based on increase in popularity – [Cho et. al SIGMOD ’05] – Proposed an estimate of quality based on the derivative of popularity

@ Carnegie Mellon Databases Our Approach: Randomized Rank Promotion Select random (young) pages to promote to good rank positions Rank position to promote to is chosen at random

@ Carnegie Mellon Databases Our Approach: Randomized Rank Promotion Consequence: Users visit promoted pages; improves quality estimate Compared with previous approaches: Does not rely on temporal measurements (+) Sub-optimal (-)

@ Carnegie Mellon Databases Exploration/Exploitation Tradeoff Exploration/Exploitation tradeoff – exploit known high-quality pages by assigning good rank positions – explore quality of new pages by promoting them in rank Existing search engines only exploit (to our knowledge)

@ Carnegie Mellon Databases Possible Objectives for Rank Promotion Fairness – Give each page an equal chance to become popular – Incentive for search engines to be fair? Quality – Maximize quality of search results seen by users (in aggregate) – Quality page p: extent to which users “like” p – Q(p) [0,1] our choice

@ Carnegie Mellon Databases Model of the Web Squash Linux Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.) A community is made up of a set of pages, interested users and related queries

@ Carnegie Mellon Databases Model of the Web Users visit pages only by issuing queries to search engine – Mixed surfing & searching considered in the paper Query answer = ordered list containing all pages in the corresponding community A single ranked list associated with each community – Since queries within a community are very similar

@ Carnegie Mellon Databases Model of the Web Consequence: Each community evolves independent of the other communities … … Community on Squash Community on Linux

@ Carnegie Mellon Databases Quality-Per-Click Metric (QPC) V(p,t) : number of visits to page p at time t QPC : average quality of pages viewed by users, amortized over time

@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary

@ Carnegie Mellon Databases Desiderata for Randomized Rank Promotion Want ability to: – Control exploration/exploitation tradeoff – “Select” certain pages as candidates for promotion – – “Protect’’ certain pages from demotion

@ Carnegie Mellon Databases Randomized Rank Promotion Scheme W WmWm W-W m Promotion pool random ordering order by popularity LdLd LmLm Remainder

@ Carnegie Mellon Databases Randomized Rank Promotion Scheme LdLd k-1 r 1-r Promotion list k = 3 r = 0.5 Remainder LmLm

@ Carnegie Mellon Databases Parameters Promotion pool (W m ) – Uniform rank promotion : give an equal chance to each page – Selective rank promotion : exclusively target zero awareness pages Start rank (k) – rank to start randomization from Degree of randomization (r) – controls the tradeoff between exploration and exploitation

@ Carnegie Mellon Databases Tuning the Parameters Objective: maximize quality-per-click (QPC) Entrenchment in a community depends on many factors – Number of pages and users – Page lifetimes – Visits per user Two ways to tune – set parameters per community – one parameter setting for all communities

@ Carnegie Mellon Databases Outline Problem introduction Evidence of entrenchment effect Key idea: Mitigate entrenchment by introducing randomness into ranking – Model of ranking and popularity evolution – Evaluation Summary

@ Carnegie Mellon Databases Popularity Evolution Cycle Popularity P(p,t) Rank R(p,t) Awareness A(p,t) Visit rate V(p,t)

@ Carnegie Mellon Databases Popularity to Rank Relationship Rank of a page under randomized rank promotion scheme – determined by a combination of popularity and randomness Deterministic Popularity-based-ranking is a special case – i.e., r=0 Unknown function F PR : rank as a function of the popularity of page p under a given randomized scheme R(p,t) = F PR (P(p,t)) DETAIL

@ Carnegie Mellon Databases Viewing Likelihood Rank Probability of Viewing view probability rank Depends primarily on rank in list [Joachims KDD’02] From AltaVista data [Lempel et al. WWW’03]: F RV (r)  r –1.5 DETAIL

@ Carnegie Mellon Databases Visit to Awareness Relationship Awareness A(p,t) : fraction of users who have visited page p at least once by time t DETAIL

@ Carnegie Mellon Databases Awareness to Popularity Relationship Quality Q(p) : extent to which users like page p (contribute towards its popularity) Popularity P(p,t) : DETAIL

@ Carnegie Mellon Databases Popularity Evolution Cycle Popularity P(p,t) Rank R(p,t) Awareness A(p,t) Visit rate V(p,t) F AP (A(p,t)) F VA (V(p,t)) F PR (P(p,t)) F RV (R(p,t))

@ Carnegie Mellon Databases Deriving Popularity Evolution Curve Popularity P(p,t) time (t) Next step : derive formula for popularity evolution curve Derive it using the awareness distribution of pages

@ Carnegie Mellon Databases Deriving Popularity Evolution Curve Assumptions – number of pages constant – Pages are created and retired according to a Poisson process with rate parameter – Quality distribution of pages is stationary In the steady state, both popularity and awareness distribution of the pages are stationary

@ Carnegie Mellon Databases Popularity Evolution Curve and Awareness Distribution Popularity Evolution Curve E(x,q) : time duration for which a page of quality q has popularity value x Next : derive popularity evolution curve using the awareness distribution Awareness distribution : fraction of pages of quality q whose awareness is i / (#users) DETAIL

@ Carnegie Mellon Databases Popularity Evolution Curve and Awareness Distribution : interpret it as the probability of a page of quality q to have awareness a i at any point of time We know that : Hence, DETAIL

@ Carnegie Mellon Databases Deriving Awareness Distribution : fraction of pages of quality q whose awareness is i / (#users) but remember that we do not know F PR yet R(p,t) = F PR (P(p,t)) Doing the steady state analysis, we get DETAIL

@ Carnegie Mellon Databases Deriving Awareness Distribution Start with an initial form of F PR ; iterate till convergence Good news: rank is a combination of popularity and randomness, we can derive F PR given. (ex. below) DETAIL

@ Carnegie Mellon Databases Summary of Where We Stand Formalized the popularity evolution cycle – Relationship between popularity evolution and awareness distribution – Derived the awareness distribution Next step: tune parameters Recall, goal is to obtain scheme that: 1. achieves high QPC (quality per click) 2. is robust across a wide range of community types

@ Carnegie Mellon Databases Tuning the Promotion Scheme Parameters: k, r and W m Objective: maximize QPC Influential factors: – Number of pages and users – Page lifetimes – Visits per user

@ Carnegie Mellon Databases Default Community Setting Number of pages = 10,000 * Number of users = 1000 Visits per user = 1000 visits per day Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ] * How Much Information? SIMS, Berkeley, 2003

@ Carnegie Mellon Databases Tuning: W m parameter -no promotion - uniform promotion - selective promotion k=1 and r=0.2

@ Carnegie Mellon Databases Tuning: k and r Optimal r: (0,1) Optimal r increases with increasing k Based on simulation (reason: analysis only accurate for small values of r)

@ Carnegie Mellon Databases Tuning: k and r Deciding k & r : – k >= 2 for “feeling lucky” – Minimize amount of “junk” perceived – Maximize QPC

@ Carnegie Mellon Databases Final Parameter Settings Promotion pool ( W m ) : zero-awareness pages Start rank (k) : 1 or 2 Randomization (r) : 0.1

@ Carnegie Mellon Databases Tuning the Promotion Scheme Parameters: k, r and W m Objective: maximize QPC Influential factors: – Number of pages and users – Page lifetimes – Visits per user

@ Carnegie Mellon Databases Influence of Number of Pages and Users

@ Carnegie Mellon Databases Influence of Page Lifetime and Visit rate

@ Carnegie Mellon Databases Influence of Visit Rate 1000 visits/day per user

@ Carnegie Mellon Databases Summary Entrenchment effect hurts search result quality Solution: Randomized rank promotion Model of Web evolution and QPC metric – Used to tune & evaluate randomized rank promotion Initial results – Significantly increases QPC – Robust across wide range of Web communities More study required

@ Carnegie Mellon Databases THE END Paper available at :