Overview of Web Ranking Algorithms: HITS and PageRank

Slides:



Advertisements
Similar presentations
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Advertisements

How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Web Search – Summer Term 2006 VI. Web Search - Ranking (cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
Authoritative Sources in a Hyperlinked Environment Hui Han CSE dept, PSU 10/15/01.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Multimedia Databases SVD II. SVD - Detailed outline Motivation Definition - properties Interpretation Complexity Case studies SVD properties More case.
Link Analysis, PageRank and Search Engines on the Web
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Presented by Zheng Zhao Originally designed by Soumya Sanyal
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Instructor: P.Krishna Reddy
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
Authoritative Sources in a Hyperlinked Environment Jon M. Kleinberg ACM-SIAM Symposium, 1998 Krishna Venkateswaran 1.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Google PageRank Algorithm
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
PageRank Algorithm -- Bringing Order to the Web (Hu Bin)
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Lecture #11 PageRank (II)
Link-Based Ranking Seminar Social Media Mining University UC3M
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Anatomy of a search engine
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Authoritative Sources in a Hyperlinked environment Jon M. Kleinberg
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

Overview of Web Ranking Algorithms: HITS and PageRank April 6, 2006 Presented by: Bill Eberle

Overview Problem Web as a Graph HITS PageRank Comparison

Problem Specific queries (scarcity problem). Broad-topic queries (abundance problem). Goal: to find the smallest set of “authoritative” sources.

Web as a Graph Web pages as nodes of a graph. Links as directed edges. my page www.uta.edu my page www.uta.edu www.uta.edu www.google.com www.google.com www.google.com

Link Structure of the Web Forward links (out-edges). Backward links (in-edges). Approximation of importance/quality: a page may be of high quality if it is referred to by many other pages, and by pages of high quality.

HITS HITS (Hyperlinked-Induced Topic Search) “Authoritative Sources in a Hyperlinked Environment”, Jon Kleinberg, Cornell University. 1998.

Authorities and Hubs Authority is a page which has relevant information about the topic. Hub is a page which has collection of links to pages about that topic. a1 a2 h a3 a4

Authorities and Hubs (cont.) Good hubs are the ones that point to good authorities. Good authorities are the ones that are pointed to by good hubs. h1 a1 a2 h2 a3 h3 a4 h4 a5 h5 a6

Finding Authorities and Hubs First, construct a focused sub-graph of the www. Second, compute Hubs and Authorities from the sub-graph.

Construction of Sub-graph Rootset Pages Expanded set Pages Search Engine Crawler Topic Forward link pages Rootset

Root Set and Base Set Use query term to collect a root set of pages from text-based search engine (AltaVista).

Root Set and Base Set (cont.) Expand root set into base set by including (up to a designated size cut-off): All pages linked to by pages in root set All pages that link to a page in root set

Hubs & Authorities Calculation Iterative algorithm on Base Set: authority weights a(p), and hub weights h(p). Set authority weights a(p) = 1, and hub weights h(p) = 1 for all p. Repeat following two operations (and then re-normalize a and h to have unit norm): h(v1) v1 v1 a(v1) h(v2) v2 p p v2 a(v2) h(v3) v3 v3 a(v3)

Example 0.45, 0.45 0.45, 0.45 Hub 0.45, Authority 0.45 0.45, 0.45

Example (cont.) 0.45, 0.9 1.35, 0.9 Hub 0.9, Authority 0.45 0.45, 0.9

Algorithmic Outcome Applying iterative multiplication (power iteration) will lead to calculating eigenvector of any “non-degenerate” initial vector. Hubs and authorities as outcome of process. Principal eigenvector contains highest hub and authorities.

Results Although HITS is only link-based (it completely disregards page content) results are quite good in many tested queries. When the authors tested the query “search engines”: The algorithm returned Yahoo!, Excite, Magellan, Lycos, AltaVista However, none of these pages described themselves as a “search engine” (at the time of the experiment)

Issues From narrow topic, HITS tends to end in more general one. Specific of hub pages - many links can cause algorithm drift. They can point to authorities in different topics. Pages from single domain / website can dominate result, if they point to one page - not necessarily a good authority.

Possible Enhancements Use weighted sums for link calculation. Take advantage of “anchor text” - text surrounding link itself. Break hubs into smaller pieces. Analyze each piece separately, instead of whole hub page as one. Disregard or minimize influence of links inside one domain. IBM expanded HITS into Clever; not seen as viable real-time search engine.

PageRank “The PageRank Citation Ranking: Bringing Order to the Web”, Lawrence Page and Sergey Brin, Stanford University. 1998.

Basic Idea Back-links coming from important pages convey more importance to a page. For example, if a web page has a link off the yahoo home page, it may be just one link but it is a very important one. A page has high rank if the sum of the ranks of its back-links is high. This covers both the case when a page has many back-links and when a page has a few highly ranked back-links.

Definition My page’s rank is equal to the sum of all the pages pointing to me.

Simplified PageRank Example Rank(u) = Rank of page u , where c is a normalization constant (c < 1 to cover for pages with no outgoing links).

Expanded Definition R(u): page rank of page u c: factor used for normalization (<1) Bu: set of pages pointing to u Nv: outbound links of v R(v): page rank of site v that points to u E(u): distribution of web pages that a random surfer periodically jumps (set to 0.15)

Problem 1 - Rank Sink Page cycles pointed by some incoming link. Loop will accumulate rank but never distribute it. Ranking increases, but does not effect any rank outside

Problem 2 - Dangling Links In general, many Web pages do not have either back links or forward links. Dangling links do not affect the ranking of any other page directly, so they are removed until all the PageRanks are calculated.

Random Surfer Model PageRank corresponds to the probability distribution of a random walk on the web graphs.

Solution – Escape Term Escape term: E(u) can be thought of as the random surfer gets bored periodically and jumps to a different page – not staying in the loop forever. We term this E to be a vector over all the web pages that accounts for each page’s escape probability (user defined parameter).

PageRank Computation - initialize vector over web pages Loop: - new ranks sum of normalized backlink ranks - compute normalizing factor - add escape term - control parameter While - stop when converged

Matrices A is designated to be a matrix, u and v correspond to the columns of this matrix. Given that A is a matrix, and R be a vector over all the Web pages, the dominant eigenvector is the one associated with the maximal eigenvalue.

Example AT=

Example (cont.) A = R = Normalized = R = c A R = M R c : eigenvalue R : eigenvector of A A = A x = λ x | A - λI | x = 0 R = Normalized =

Implementation 1. URL -> id 2. Store each hyperlink in a database. 3. Sort link structure by Parent id. 4. Remove dangling links. 5. Calculate the PR giving each page an initial value. 6. Iterate until convergence. 7. Add the dangling links.

Example Which of these three has the highest page rank? Page A Page B Page C

Example (cont.) Page A Page B Page C

Re-write the system of equations as a Matrix- Vector product. Example (cont.) Re-write the system of equations as a Matrix- Vector product. The PageRank vector is simply an eigenvector (scalar*vector = matrix*vector) of the coefficient matrix.

Example (cont.) PageRank = 0.4 PageRank = 0.2 PageRank = 0.4 Page A Page B Page C PageRank = 0.4

Example (cont.) A B with d= 0.5 Pr(A) PR(B) PR(C) C 1 2 3 . 11 12

Convergence PageRank computation is O(log(|V|)).

Other Applications Help user decide if a site is trustworthy. Estimate web traffic. Spam detection and prevention. Predict citation counts.

Issues Users are not random walkers. Starting point distribution (actual usage data as starting vector). Bias towards main pages. Linkage spam. No query specific rank.

PageRank vs. HITS HITS PageRank (CLEVER) (Google) computed for all web pages stored in the database prior to the query computes authorities only Trivial and fast to compute HITS (CLEVER) performed on the set of retrieved web pages for each query computes authorities and hubs easy to compute, but real-time execution is hard

References “Authoritative Sources in a Hyperlinked Environment”, Jon Kleinberg, Cornell University. 1998. “The PageRank Citation Ranking: Bringing Order to the Web”, Lawrence Page and Sergey Brin, Stanford University. 1998.