Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

Introduction Simple Random Sampling Stratified Random Sampling
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Chapter 6 Sampling and Sampling Distributions
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Estimation A major purpose of statistics is to estimate some characteristics of a population. Take a sample from the population under study and Compute.
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Searching the Web Junghoo Cho UCLA Computer Science.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Evaluating Hypotheses
Link Analysis, PageRank and Search Engines on the Web
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Lecture 3 Sampling distributions. Counts, Proportions, and sample mean.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
The Excel NORMDIST Function Computes the cumulative probability to the value X Business Statistics: A First Course, 5e © 2009 Prentice-Hall, Inc
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
Statistics for Managers Using Microsoft® Excel 5th Edition
Statistics for Managers Using Microsoft® Excel 7th Edition
Mathematical Statistics Lecture Notes Chapter 8 – Sections
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
Copyright © 2012 Pearson Education. All rights reserved Copyright © 2012 Pearson Education. All rights reserved. Chapter 10 Sampling Distributions.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Multiple testing correction
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
by B. Zadrozny and C. Elkan
Fundamentals of Data Analysis Lecture 4 Testing of statistical hypotheses.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Chap 20-1 Statistics for Business and Economics, 6e © 2007 Pearson Education, Inc. Chapter 20 Sampling: Additional Topics in Sampling Statistics for Business.
Statistics for Data Miners: Part I (continued) S.T. Balke.
STAT 111 Introductory Statistics Lecture 9: Inference and Estimation June 2, 2004.
Random Sampling, Point Estimation and Maximum Likelihood.
6.1 - One Sample One Sample  Mean μ, Variance σ 2, Proportion π Two Samples Two Samples  Means, Variances, Proportions μ 1 vs. μ 2.
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
Statistics for Managers Using Microsoft Excel, 5e © 2008 Pearson Prentice-Hall, Inc.Chap 8-1 Statistics for Managers Using Microsoft® Excel 5th Edition.
Copyright © 2009 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
1 Chapter 6 Estimates and Sample Sizes 6-1 Estimating a Population Mean: Large Samples / σ Known 6-2 Estimating a Population Mean: Small Samples / σ Unknown.
Stat 112: Notes 2 Today’s class: Section 3.3. –Full description of simple linear regression model. –Checking the assumptions of the simple linear regression.
Chapter 5 Parameter estimation. What is sample inference? Distinguish between managerial & financial accounting. Understand how managers can use accounting.
Lecture 9 Chap 9-1 Chapter 2b Fundamentals of Hypothesis Testing: One-Sample Tests.
+ “Statisticians use a confidence interval to describe the amount of uncertainty associated with a sample estimate of a population parameter.”confidence.
AP Statistics Chapter 24 Comparing Means.
Confidence Interval & Unbiased Estimator Review and Foreword.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Sampling and estimation Petter Mostad
1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.
Sampling Design and Analysis MTH 494 Lecture-21 Ossam Chohan Assistant Professor CIIT Abbottabad.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Review Statistical inference and test of significance.
Random Sampling Algorithms with Applications Kyomin Jung KAIST Aug ERC Workshop.
Statistics 22 Comparing Two Proportions. Comparisons between two percentages are much more common than questions about isolated percentages. And they.
Review Law of averages, expected value and standard error, normal approximation, surveys and sampling.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
Can't Type? press F11 Can’t Hear? Check: Speakers, Volume or Re-Enter Seminar Put ? in front of Questions so it is easier to see them. 1 Welcome to Unit.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
Sampling Distributions
Propagation of Error Berlin Chen
Presentation transcript:

Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper

Page Quality: in search of unbiased web ranking Junghoo Cho, Sourashis Roy, Robert E. Adams UCLA Computer Science department (June 2005) Impact of search engines of page popularity Junghoo Cho, Sourashis Roy UCLA Computer Science department (May 2004) References

Overview What is the current algorithm search engines use. Motivation to improve. The proposed method. Implementation. Experimental results. Conclusions (problems & future work)

Search engines today Search engines today use a variant of the PageRank rating system to sort relevant results. PageRank (PR) tries to measure the “importance” of a page, by measuring its popularity.

What is PageRank ? Based on the random-surfer web-user model: A person starts surfing the web at a random page. The person advances by clicking on links in the page (selected at random). At each step, there is a small chance the person will jump to a new, random page. This model does not take into account search engines.

What is PageRank ? PageRank (PR) measures the probability that the random-surfer is at page p, at any given time. Computed by this formula:

The problem with PR PageRank rating creates a “rich-get- richer” phenomenon. Popular pages will become even more popular over time. Unpopular pages are hardly ever visited, because they remain unknown to users. They are doomed to obscurity.

The problem with PR This was observed in an experiment: The Open Directory ( was sampled, twice within seven months period. Change to number of incoming links to each page was recorded. Pages were divided into popularity groups, and the results…

The bias against low-PageRank pages

In their study, Cho and Roy show that in a search-dominant world, discovery time of new pages rises by a factor of 66 !

What can be done to remedy the situation ? Ranking should reflect quality of pages. Popularity is not a good-enough measure for quality, because there are many good, yet unknown, pages. We want to give new pages an equal opportunity (if they are of high quality).

How to define “quality” ? Quality is a very subjective notion. Let’s try to define it anyway… Page Quality – the probability that an average user will like the page when he/she visits it for the first time.

How to estimate quality ? Quality can be measured exactly if we show all users the page, and ask their opinion. It’s impractical to ask users their opinion on every page they visit. (PageRank is a good measure of quality, if all pages had been given equal opportunity to be discovered. That is no longer the case)

How to estimate quality ? We want to estimate quality, but from measurable quantities. We can talk about these quantities:  Q(p) – page quality. The probability that a user will like page p when exposed to it for the first time.  P(p,t) – page popularity. The fraction of users who like p at time t.  V(p,t) – visit popularity. The number of “visits” page p receives at unit time interval at time t.  A(p,t) – page awareness. The fraction of web users who are aware of page p, at time t.

Estimating quality Lemma 1 This is not sufficient – we can’t measure awareness. We can measure page popularity, P(p,t). How do we estimate Q(p) only from P(p,t) ? Proof: follows from definitions.

Estimating quality Observation 1 – popularity (as measured in incoming links) measures quality well for pages with the same age. Observation 2 – the popularity of new, high- quality pages, will increase faster then the popularity of new, low-quality pages. In other words, the time-derivative of popularity is also a measure of quality.

Estimating quality We need a web-user model to link popularity and quality. We start with these two assumptions: 1. Visit popularity is proportional to popularity; V(p,t) = rP(p,t) 2. Random visit hypothesis: a visit to page p can be from any user with equal probability.

Estimating quality Lemma 2 A(p,t) can be computed from past popularity: * n is number of web users.

Estimating quality Proof: By time t, page p was visited times. We compute the probability that some user, u, is not aware of p, after p was visited k times.

Estimating quality PR(i’th visitor to p is not u)=1-1/n PR(u didn’t visit p | p was visited k times) =

Estimating quality We can combine lemmas 1 and 2 to get a popularity as a function of time. Theorem: The proof is a bit long, we won’t go into it (available on hard copy, to those interested).

Estimating quality This is popularity vs. time, as predicted by our formula: (This trend was seen in practice, by companies such as NetRatings)

Estimating quality Important fact: Popularity converges to quality over a long period of time. We will use this fact to check estimates about quality later.

Estimating quality Lemma 3 Proof: We differentiate the equation P(p,t)=A(p,t)Q(p) by time, plug in the expression we found for A(p,t) in Lemma 2, and that’s it.

Estimating quality We define the “relative popularity increase function”:

Estimating quality

Theorem Q(p) = I(p,t)+P(p,t) at all times. Proof:

Estimating quality We can now estimate quality of page by measuring only popularity. What happens if quality changes in time? Is our estimate still good ?

Quality change over time In reality, quality of pages changes: Web pages change. Expectation of users rise as better pages appear all the time. Will the model handle changing quality well ?

Quality change over time Theorem If quality changed at time T (from Q 1 to Q 2 ), then for t > T, the estimate for quality is still:

Quality change over time Proof: After time T, we put users into three groups: (1) Users who visited the page before T. (group u1) (2) Users who visited the page after T. (group u2) (3) Users who never visited the page.

Quality change over time Fraction of users who like the page at time t>T: * After time t, group u 2 expands, while u 1 remains the same. We will have to compute u 2 (t).

Quality change over time From the proof of lemma 2 (calculation of awareness at time t) it is easy to see that:

Quality change over time Size of |u 1 -u 2 (t)|: * The size of intersection of u 1 and u 2 is their multiplication, because they are independent. (According to random-visit hypothesis, the probability that a user visits page p at time t is independent of his past visit history.)

Quality change over time

Q.E.D

Implementation

The implementation of a quality-estimator system is very simple: 1. “Sample” the web at different times. 2. Compute popularity (PageRank) for each page, and popularity change. 3. Estimate quality of each page.

Implementation But there are problems with this implementation. 1. Approximation error – we sample at discrete time points, not a continuous sample. 2. Quality change between samples makes estimate inaccurate. 3. We will have a time lag. Quality estimate will never be up-to-date.

Implementation Examining approximation error Q=0.5 ∆t=1 (units not specified!)

Implementation Examining slow change in quality Q(p,t)= t Q(p,t)=0.5+ct

Implementation Examining rapid change in quality

The Experiment

Evaluating a web metric such as quality is difficult. Quality is subjective. There is no standard corpus. Doing a user survey is not practical.

The Experiment The experiment is based on the observation that popularity converges to quality (assuming quality is constant). If we estimate quality of pages, and wait some time, we can check our estimates against the eventual popularity.

The Experiment The test was done on 154 web sites, obtained from the Open Directory ( All pages of these web sites were downloaded (~5 million).

First three snapshots were used to estimate quality, and fourth snapshot was used to check prediction 4 snapshots were taken at these times:

The Experiment Only “stable” pages were included (pages where quality estimates did not change much). This turned out to be most pages.

The Experiment Quality is taken to be PR(t=4). Quality estimator is measured against PR(t=3)

The Experiment The results:

The quality estimator metric seems better than PageRank. Its average error is smaller: Average error of Q3 estimator = 45% Average error of P3 estimator = 74% The distribution of error is also better.

Summary & Conclusions

Summary We saw the bias created by search engines. A more desirable ranking will rank pages by quality, not popularity.

Summary We can estimate quality from the link structure of the web (popularity and popularity evolution). Implementation is feasible, only slightly different than current PageRank system.

Summary Experimental results show that quality estimator is better than PageRank

Conclusions Problems & Future work: Statistical noise is not negligible for pages with low popularity. Experiment was done on small scale. Should try it on large scale.

Conclusions Problems & Future work: Can we use number of “visits” to pages to estimate popularity increase, instead of number of incoming links ? Theory is based on a web-user model that doesn’t take into account search engines. That is unrealistic in this day and age.

Follow up suggestions Many more interesting publications in Junghoo Cho’s website: Such as: Estimating Frequency of Change Shuffling the deck: Randomizing search results Automatic Identification of User Interest for Personalized Search

Other algorithms for ranking Extra material can be found at: popularity-algorithms/ Algorithms such as: Hub and Authority HITS HUBAVG