CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.

Slides:



Advertisements
Similar presentations
Effective Change Detection Using Sampling Junghoo John Cho Alexandros Ntoulas UCLA.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
Chapter 3 Producing Data 1. During most of this semester we go about statistics as if we already have data to work with. This is okay, but a little misleading.
Evaluating Search Engine
1. Estimation ESTIMATION.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
Estimating the Global PageRank of Web Communities Paper by Jason V. Davis & Inderjit S. Dhillon Dept. of Computer Sciences University of Texas at Austin.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.
1 Searching the Web Junghoo Cho UCLA Computer Science.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Computing Trust in Social Networks
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
Topic-Sensitive PageRank Taher H. Haveliwala. PageRank Importance is propagated A global ranking vector is pre-computed.
1 Automatic Identification of User Goals in Web Search Uichin Lee, Zhenyu Liu, Junghoo Cho Computer Science Department, UCLA {uclee, vicliu,
Hypothesis Tests for Means The context “Statistical significance” Hypothesis tests and confidence intervals The steps Hypothesis Test statistic Distribution.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
distributed web crawlers1 Implementation All following experiments were conducted with 40M web pages downloaded with Stanford’s webBase crawler in Dec.
1 BA 555 Practical Business Analysis Review of Statistics Confidence Interval Estimation Hypothesis Testing Linear Regression Analysis Introduction Case.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Chapter 9 Comparing More than Two Means. Review of Simulation-Based Tests  One proportion:  We created a null distribution by flipping a coin, rolling.
by B. Zadrozny and C. Elkan
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
Evidence Based Medicine
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
SEO and what it means to you. Short for search engine optimization, the process of increasing the amount of visitors to a Web site by ranking high in the.
CPSC 534L Notes based on the Data Mining book by A. Rajaraman and J. Ullman: Ch. 5.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The Common Shock Model for Correlations Between Lines of Insurance
Science Fair How To Get Started… (
Comp. Genomics Recitation 3 The statistics of database searching.
Issues in Estimation Data Generating Process:
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
Business Statistics, A First Course (4e) © 2006 Prentice-Hall, Inc. Chap 9-1 Chapter 9 Fundamentals of Hypothesis Testing: One-Sample Tests Business Statistics,
Chapter 8: Simple Linear Regression Yang Zhenlin.
Evaluating VR Systems. Scenario You determine that while looking around virtual worlds is natural and well supported in VR, moving about them is a difficult.
1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Regression Analysis1. 2 INTRODUCTION TO EMPIRICAL MODELS LEAST SQUARES ESTIMATION OF THE PARAMETERS PROPERTIES OF THE LEAST SQUARES ESTIMATORS AND ESTIMATION.
 SEO Terms A few additional terms Search site: This Web site lets you search through some kind of index or directory of Web sites, or perhaps both an.
Hypothesis Testing and Statistical Significance
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Search Engines and Link Analysis on the Web
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
DTMC Applications Ranking Web Pages & Slotted ALOHA
CS246 Page Refresh.
Objective of This Course
CS246: Information Retrieval
CS246: Search-Engine Scale
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Presentation transcript:

CS246 Search Engine Bias

Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com article  People “discover” pages through search engines  Top results: many users  Bottom results: no new users  Are we biased by search engines?

Junghoo "John" Cho (UCLA Computer Science)3 Research issues  Are we biased by search engines?  Impact of Search Engines on Page Popularity  Can we avoid search engine bias?  Page Quality: In Search of Unbiased Web Ranking  Shuffling a Stacked Deck: The Case for Partially Randomized Ranking of Search Engine Results

Junghoo "John" Cho (UCLA Computer Science)4 Questions to Address  Are the rich getting richer?  Web popularity evolution experiments  How much bias do search engines introduce?  Web user models and popularity evolution analysis  Any potential solution to the problem?  Less biased ranking metric  Introducing randomness to search results

Junghoo "John" Cho (UCLA Computer Science)5 Web Evolution Experiment  Collect Web history data  Is “rich-get-richer” happening?  From Oct until Oct  154 sites monitored  Top sites from each category of Open Directory  Pages downloaded every week  All pages in each site  A total of average 4M pages every week (65GB)

Junghoo "John" Cho (UCLA Computer Science)6 “Rich-Get-Richer” Problem  Construct weekly Web-link graph  From the downloaded data  Partition pages into 10 groups  Based on initial link popularity  Top 10% group, 10%-20% group, etc.  How many new links to each group after a month?  Rich-get-richer  More new links to top groups

Junghoo "John" Cho (UCLA Computer Science)7 Result: Simple Link Count  After 7 months  70% of new links to the top 20% group  No new links to bottom 60% groups

Junghoo "John" Cho (UCLA Computer Science)8 Result: PageRank  After 7 months  Decrease in PageRank for bottom 50% pages  Due to normalization of PageRank

Junghoo "John" Cho (UCLA Computer Science)9 Impact of Search Engines  Yes, the rich seems to get richer, but is it because of search engine?  Even further, is it really a “bias”?  Study of bias from search engine is necessary

Junghoo "John" Cho (UCLA Computer Science)10 Search Engine Bias  What we mean by bias?  What is the ideal ranking? How do search engines rank pages?

Junghoo "John" Cho (UCLA Computer Science)11 What is the Ideal Ranking?  Rank by intrinsic “quality” of a page?  Very subjective notion  Different quality judgment on the same page  Can there be an “objective” definition?

Junghoo "John" Cho (UCLA Computer Science)12 Page Quality Q(p)  The probability that an average Web user will like page p if he looks at it  In principle, we can measure Q(p) by 1.showing p to all Web users and 2.counting how many people like it  p1: 10,000 people, 8,000 liked it, Q(p1) = 0.8  p2: 10,000 people, 2,000 liked it, Q(p2) = 0.2  Democratic measure of quality  When consensus is hard to reach, pick the one that more people like

Junghoo "John" Cho (UCLA Computer Science)13 PageRank: Practical Ranking  A page is “important” if many pages link to it  Not every link is equal  A link from an “important” page matters more than others  PR(pi) = (1 - d) + d [PR(p1)/c1 + · · · + PR(pm)/cm]  Random-Surfer Model  When users follow links randomly, PR(pi) is the probability to reach pi

Junghoo "John" Cho (UCLA Computer Science)14 PageRank vs. Quality  PageRank ~ Page quality if everyone is given equal chance  High PageRank  high quality  To obtain high PageRank, many people should look at the page and like it.  Low PageRank  low quality?  PageRank is biased against new pages  How much bias for low PageRank pages?

Junghoo "John" Cho (UCLA Computer Science)15 Measuring Search Engine Bias  Ideal experiment: Divide the world into two groups  The users who do not use search engines  The users who use search engines very heavily  Compare popularity evolution  Problem: Difficult to conduct in practice

Junghoo "John" Cho (UCLA Computer Science)16 Theoretical Web-User Model  Let us do theoretical experiments!  Random-surfer model  Users follow links randomly  Never use serach engine  Search-dominant model  Users always start with a search engine  Only visit pages returned by search engine  Compare popularity evolution

Junghoo "John" Cho (UCLA Computer Science)17 Basic Definitions  Simple popularity P(p,t)  Fraction of Web users who like p at time t  E.g., 100,000 users, 10,000 like p, P(p,t)=0.1  Visit popularity V(p,t)  # users that visit p in a unit time  Awareness A(p,t)  Fraction of Web users who are aware of p  E.g, 100,000 users, 30,000 aware of p, A(p,t)=0.3  P(p,t) = Q(p) A(p,t)

Junghoo "John" Cho (UCLA Computer Science)18 Random-Surfer Model  Popularity-equivalence hypothesis  V(p,t) = r P(P,t) (r: proportionality constant)  Rationale: PageRank is visit popularity under the random-surfer model  Random-visit hypothesis  A visit done by any user with equal probability  Simplifying assumption

Junghoo "John" Cho (UCLA Computer Science)19 Random-Surfer Model: Analysis  Current popularity P(p,t)  Number of visitors from V(p,t) = r P(p,t)  Awareness increase ∆A(p,t)  Popularity increase ∆P(p,t)  New popularity P(p,t+1)  Formal analysis: Differential equation

Junghoo "John" Cho (UCLA Computer Science)20 Random-Surfer Model: Result  The popularity of page p evolves over time as  Q(p): quality of p  P(p,0): initial popularity of p at time zero  N: total number of Web users  R: proportionality constant

Junghoo "John" Cho (UCLA Computer Science)21 Random-Surfer Model: Popularity Evolution Q(p)=1 P(p,0)=10^-8 r/n = 1

Junghoo "John" Cho (UCLA Computer Science)22 Search-Dominant Model  V(p,t) ~ P(p,t)?  For i th result, how many clicks?  For PageRank P(p,t), what ranking?  Empirical measurements  New Visit-popularity hypothesis V(p,t) = r P(p,t) 9/4  Random-visit hypothesis

Junghoo "John" Cho (UCLA Computer Science)23 Search-Dominant Model: Popularity Evolution Same parameter as before

Junghoo "John" Cho (UCLA Computer Science)24 Comparison of Two Models  Time to final popularity  66 times increase!  Expansion stage  Random surfer: 12 time units  Search dominant: non existent Random-surfer modelSearch-dominant model

Junghoo "John" Cho (UCLA Computer Science)25 Reducing the Bias?  Many possibilities!  Can we measure quality?  Will randomness help?  Show some random pages in search results  Give a new page a chance

Junghoo "John" Cho (UCLA Computer Science)26 Measuring Quality: Basic Idea  Quality: probability of link creation by a new visitor  Assuming the same number of visitors  Q(p)  # of new links (or popularity increase) Quality estimator Q(p) =  P(p)

Junghoo "John" Cho (UCLA Computer Science)27 Measuring Quality: Problem (1)  Different number of visitors to each page  More visitors to more popular pages  How to account for # of visitors? Quality estimator Q(p) =  P(p) / P(p)  Idea: PageRank = # of visitors  Divide by current PageRank

Junghoo "John" Cho (UCLA Computer Science)28 Measuring Quality: Problem (2)  No more new links to very popular pages  Everyone already knows them  P(p) / P(p) ~ 0 for well-known pages  How to account for well-known pages? Quality estimator Q(p) =  P(p) / P(p) + C  P(p)  Idea: P(p) = Q(p) when everyone knows p  Use P(p) to measure Q(p) for well-known pages

Junghoo "John" Cho (UCLA Computer Science)29 Quality Estimator: Theory  Under the random-surfer model, Q(p) is  Essentially the same as the previous formula Q(p) =  P(p) / P(p) + C  P(p)

Junghoo "John" Cho (UCLA Computer Science)30 Is Quality Estimator Effective?  How to measure its effectiveness?  Implement it to a major search engine?  Any other alternatives?  Idea  Pages eventually obtain deserved popularity (however long it may take…)  “Future” PageRank ~ Q(p)

Junghoo "John" Cho (UCLA Computer Science)31 Quality Estimator: Evaluation  Q(p) as a predictor of future PageRank  Compare the correlations of  Current Q(p) with future PageRank  Current PageRank with future PageRank  Does Q(p) predicts future PageRank better? PR’(p) Q(p) PR(p) ?  Experiments  Download Web multiple times with long interval

Junghoo "John" Cho (UCLA Computer Science)32 Quality Estimator: Evaluation  Compare relative error  Result  For Q(p): err(p) = 0.45  For PR(p): err(p) = 0.74  Q(p) is significantly better than PR(p)

Junghoo "John" Cho (UCLA Computer Science)33 Quality Estimator: Detail

Junghoo "John" Cho (UCLA Computer Science)34 Randomization  Let us give new pages a chance to prove themselves  Introduce randomness in search results  Say, 10% of results are randomly selected from new pages  Why is randomization good?  New high-quality pages will be promoted quickly  But is it really important?  Counter argument  Most new pages are bad  Why should users bother looking at them?

Junghoo "John" Cho (UCLA Computer Science)35 Average Quality Per Click  Bottom line: User’s satisfaction  Make sure users like the pages they click  Tradeoff of randomization  Positive: High-quality new pages will become popular more quickly  Improvement in search quality  Negative: Randomly selected pages are likely to be of low quality  Decrease in search quality

Junghoo "John" Cho (UCLA Computer Science)36 Exploration/Exploitation Tradeoff

Junghoo "John" Cho (UCLA Computer Science)37 Joke Experiments (1)  Ranked list of jokes  Users click on a link and read a joke  Provide positive or negative feedback  “Simulated search”  Two ranked lists and user groups 1.Popularity-based: ranked by # of positive votes 2.Popularity + randomization  1000 users participated

Junghoo "John" Cho (UCLA Computer Science)38 Joke Experiments (2)  Ranking determines the popularity evolution of jokes  Compare the evolution and evaluate  Evaluation metric  Fraction of positive user votes  Result  Popularity only: 0.2  Popularity + randomization: 0.35

Junghoo "John" Cho (UCLA Computer Science)39 More Analytical Study  Based on search-dominant user model  But pages get created and deleted over time  In most cases, 10-20% randomization is helpful  Optimal randomness depends on exact parameter settings

Junghoo "John" Cho (UCLA Computer Science)40 Summary  Search engine bias  Do search engines make popular pages more popular?  Experimental and analytical study  Strong possibility  Possible solutions  Less biased ranking metric  Randomization in search results