1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Finding Topic-sensitive Influential Twitterers Presenter 吴伟涛 TwitterRank:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Link Analysis: PageRank
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
Information Retrieval Lecture 8 Introduction to Information Retrieval (Manning et al. 2007) Chapter 19 For the MSc Computer Science Programme Dell Zhang.
The influence of search engines on preferential attachment Dan Li CS3150 Spring 2006.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
 How many pages does it search?  How does it access all those pages?  How does it give us an answer so quickly?  How does it give us such accurate.
Sandeep Pandey 1, Sourashis Roy 2, Christopher Olston 1, Junghoo Cho 2, Soumen Chakrabarti 3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay Shuffling a Stacked.
CS246: Page Selection. Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar.
1 Information Management on the World-Wide Web Junghoo “John” Cho UCLA Computer Science.
1 Searching the Web Junghoo Cho UCLA Computer Science.
1 How to Crawl the Web Looksmart.com12/13/2002 Junghoo “John” Cho UCLA.
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Ensemble Learning: An Introduction
CS246 Search Engine Bias. Junghoo "John" Cho (UCLA Computer Science)2 Motivation “If you are not indexed by Google, you do not exist on the Web” --- news.com.
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
1 Internet and Data Management Junghoo “John” Cho UCLA Computer Science.
Evaluating Hypotheses
1 Crawling the Web Discovery and Maintenance of Large-Scale Web Data Junghoo Cho Stanford University.
How to Crawl the Web Junghoo Cho Hector Garcia-Molina Stanford University.
Link Analysis HITS Algorithm PageRank Algorithm.
WEB SCIENCE: SEARCHING THE WEB. Basic Terms Search engine Software that finds information on the Internet or World Wide Web Web crawler An automated program.
Internet Research Search Engines & Subject Directories.
Search Engine Optimization. What is SEO? Search engine optimization (SEO) is the process of improving the visibility of a website or a web page in search.
More Algorithms for Trees and Graphs Eric Roberts CS 106B March 11, 2013.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
5 Planning a Web Site Section 5.1 Determine the purpose of your Web site Define the target audience for your Web site Write a mission statement Section.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
How Search Engines Work. Any ideas? Building an index Dan taylor Flickr Creative Commons.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
1 Can People Collaborate to Improve the relevance of Search Results? Florian Eiteljörge June 11, 2013Florian Eiteljörge.
Pete Bohman Adam Kunk. What is real-time search? What do you think as a class?
The Business Model and Strategy of MBAA 609 R. Nakatsu.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Influence of Search Engines Christina Pong cs349.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
Pete Bohman Adam Kunk. Real-Time Search  Definition: A search mechanism capable of finding information in an online fashion as it is produced. Technology.
1 Efficient Crawling Through URL Ordering by Junghoo Cho, Hector Garcia-Molina, and Lawrence Page appearing in Computer Networks and ISDN Systems, vol.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
The Business Model of Google MBAA 609 R. Nakatsu.
Predictive Ranking -H andling missing data on the web Haixuan Yang Group Meeting November 04, 2004.
Sampling Techniques for Large, Dynamic Graphs Daniel Stutzbach – University of Oregon Reza Rejaie – University of Oregon Nick Duffield – AT&T Labs—Research.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
Page Quality: In Search of an Unbiased Web Ranking Seminar on databases and the internet. Hebrew University of Jerusalem Winter 2008 Ofir Cooper
 SEO Terms A few additional terms Search site: This Web site lets you search through some kind of index or directory of Web sites, or perhaps both an.
Uncertainty and confidence Although the sample mean,, is a unique number for any particular sample, if you pick a different sample you will probably get.
Web Search Engines Page Ranking. Web Search Engine Web Search Engine is a tool enabling document search, with respect to specified keywords, in the Web.
A Sublinear Time Algorithm for PageRank Computations CHRISTIA N BORGS MICHAEL BRAUTBA R JENNIFER CHAYES SHANG- HUA TENG.
Presented by : Manoj Kumar & Harsha Vardhana Impact of Search Engines on Page Popularity by Junghoo Cho and Sourashis Roy (2004)
CPS 49S Google: The Computer Science Within and its Impact on Society Shivnath Babu Spring 2007.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Jan 27, Digital Preservation Seminar1 Effective Page Refresh Policies for Web Crawlers Written By: Junghoo Cho & Hector Garcia-Molina Presenter:
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Crawling When the Google visit your website for the purpose of tracking, Google does this with help of machine, known as web crawler, spider, Google bot,
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
Neighborhood - based Tag Prediction
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
How to Crawl the Web Peking University 12/24/2003 Junghoo “John” Cho
PageRank and Markov Chains
Lesson Objectives Aims You should know about: – Web Technologies
Anatomy of a Search Search The Index:
Section 8.2 Day 2.
Presentation transcript:

1 Page Quality: In Search of an Unbiased Web Ranking Presented by: Arjun Dasgupta Adapted from slides by Junghoo Cho and Robert E. Adams SIGMOD 2005

2 Abstract. 1  This research is motivated by the dominance of the Google search engine and the bias that it may intro-duce to the users' perception of the Web.  According to a recent study, 75% of keyword searches on the Web are handled by Google [17]. ( )

3 Abstract. 2  Top 15 Search Destinations Home & Work Users, December 2004

4 Abstract. 3  Is there a better way of measuring the “quality” of a page than using the “popularity” of the page?  new definition of page quality based on the evolution of the Web link structure.  The goal is to alleviate the “ rich-get- richer ” phenomenon and help new and high-quality pages get the attention that they deserve.

5 Outline  Introduction  Quality and PageRank  Quality Estimator: Intuition  Author’s User-Visitation Model  Experiments  Conclusion

6 Outline  Introduction  Quality and PageRank  Quality Estimator: Intuition  Author’s User-Visitation Model  Experiments  Conclusion

7 INTRODUCTION. 1  A page not indexed by Google or ranked poorly by Google is therefore not likely to be viewed by many Web users.  Our research is motivated by this dominance of Google and the bias that it may introduce.  Is the people's per-ception of the Web inuenced by Google?  What kind of and how much bias does it introduce?  Is there a way to reduce this bias?

8 INTRODUCTION. 2  In this paper we first try to clarify the notion of page quality and introduce a formal definition of page quality.  PAGE QUALITY: we define the quality of a page as the probability that a Web user will like the page enough to create a link to it when he reads the page.

9 Outline  Introduction  Quality and PageRank  Quality Estimator: Intuition  Author’s User-Visitation Model  Experiments  Conclusion

10 Quality and PageRank  PageRank is that it is inherently biased against unpopular pages.  page quality can be a very subjective notion.  What do we mean by page quality? How can we quantify it?  to quantify the quality of a page as the probability that a random Web user will like the page enough to create a link to it once that user discovers the page for the first time.

11 Quality and PageRank  Denition 1 (Page quality): quality of a page p.  Ap represents the event that the user becomes newly aware of the page p.  Lp represents the event that the user likes the page and creates a link to p.  #For example, assuming the total number of Web users is 100, if 90 Web users like page p after they read it, its quality Q(p) is 0.9.  How to measure the quality of page p?  In the next section, we discuss how we can measure the quality of a page using the evolution of the Web link structure.

12 Outline  Introduction  Quality and PageRank  Quality Estimator: Intuition  Author’s User-Visitation Model  Experiments  Conclusion

13 Quality Estimator: Intuition. 1  Our main idea for quality measurement is as follows: (1)The quality of a page is how many users will like a page and create a link to the page when they discover it. (2) the increase in the link count (or popularity) is directly proportional to the quality of a page.   Thus, by measuring the increase in popularity, not the current popularity, we may estimate the page quality more accurately.  We will use PageRank for the purposes of this paper because of its success as a popularity metric, but we could just as easily substitute the number of links.

14 Quality Estimator: Intuition. 2  There exist two problems with this approach:  [1] pages are not visited by the same number of people.  To accommodate this fact, we need to divide the popularity increase by the number of visitors to this page. number of visitors ≈ current PageRank (popularity)  [2] the popularity of an already well-known page may not increase very much because it is already known to most Web users.   To measure the values by taking multiple snapshots

15 Outline  Introduction  Quality and PageRank  Quality Estimator: Intuition  Author’s User-Visitation Model  Experiments  Conclusion

16 User-Visitation Model. 1  Basic Definitions: First, we introduce two notions of popularity: (simple) popularity and visit popularity.  Denition 2 (Popularity) We define the popularity of page p at time t, P(p, t), as the fraction of Web users who like the page. # For example users (out of one million) like page p, its P(p,t)=0.1

17 User-Visitation Model. 2  Denition 3 (Visit popularity) We define the visit popularity of a page p at time t, V(p, t), as the number of “visits” or “page views” the page gets within a unit time interval at time t. # For example, if 100 users visit page p1 in the unit time interval from t, and if 200 users visit page p2 in the same time period, V(p2, t) is twice as large as V(p1, t).

18 User-Visitation Model. 3  Denition 4 (User awareness) We define the user awareness of page p at time t, A(p, t), as the fraction of the Web users who is aware of p at time t. (whether they like it or not)  #For example, if 100,000 users (say, out of one million)have visited the page p1 so far and are aware of the page, its user awareness, A(p1, t), is 0.1.

19 User-Visitation Model. 4 Lemma 1 The popularity of p at time t, P(p, t), is equal to the fraction of Web users who are aware of p at t, A(p, t), times the quality of p. P(p, t) = A(p, t)* Q(p)

20 User-Visitation Model. 5  User-visitation model: two hypotheses  First, to know the PageRank of page p is equivalent to the probability that a user will visit the page when the user randomly surfs the Web.  Proposition 1 (Popularity-equivalence hypothesis) The number of visits to page p within a unit time interval at time t is proportional to how many people like the page. That is, V(p, t) = r P(p, t)  where r is a normalization constant common to all pages.

21 User-Visitation Model. 6  Proposition 2 (Random-visit hypothesis) All web users will visit a particular page with equal probability.  #That is, if there exist n Web users and if a page p was just visited by a user, the visit may have been done by any Web user with 1/n probability.  Next section.

22 Outline  Introduction  Quality and PageRank  Quality Estimator: Intuition  Auteur’s User-Visitation Model  Experiments  Conclusion

23 Experiment. 1  our quality estimator should be a better “predictor” of the future PageRank than the current PageRank.  Based on this idea, we capture multiple snapshots of the Web, compute page quality, and compare today's quality value with the PageRank values in the future.

24 Outline  Introduction  Quality and PageRank  Quality Estimator: Intuition  Author’s User-Visitation Model  Experiments  Conclusion

25 Conclusion  Notion of PAGE QUALITY discusssed  This idea is used on top of the PAGE RANK algorithm to yield better ranked results