PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.

Slides:

Advertisements

Similar presentations

Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.

Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis: PageRank

Our purpose Giving a query on the Web, how can we find the most authoritative (relevant) pages?

CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.

How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.

11 - Markov Chains Jim Vallandingham.

DATA MINING LECTURE 12 Link Analysis Ranking Random walks.

Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.

Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.

Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)

Link Analysis, PageRank and Search Engines on the Web

Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.

1 Evaluating the Web PageRank Hubs and Authorities.

1 Evaluating the Web PageRank Hubs and Authorities.

CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.

PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.

CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

Link Analysis HITS Algorithm PageRank Algorithm.

CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:

Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.

Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.

Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα

The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.

Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.

1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:

CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.

Web Intelligence Web Communities and Dissemination of Information and Culture on the www.

1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.

Lecture 2: Extensions of PageRank

PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.

Ch 14. Link Analysis Padmini Srinivasan Computer Science Department

Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 

CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.

- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.

1 CS 430: Information Discovery Lecture 5 Ranking.

Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.

Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.

Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page

CS 440 Database Management Systems Web Data Management 1.

CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.

GRAPH AND LINK MINING 1. Graphs - Basics 2 Undirected Graphs Undirected Graph: The edges are undirected pairs – they can be traversed in any direction.

1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.

Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.

Jeffrey D. Ullman Stanford University. 3  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.

Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.

Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.

Combatting Web Spam Dealing with Non-Main-Memory Web Graphs SimRank

15-499:Algorithms and Applications

Search Engines and Link Analysis on the Web

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.

Link Analysis 2 Page Rank Variants

7CCSMWAL Algorithmic Issues in the WWW

DTMC Applications Ranking Web Pages & Slotted ALOHA

Degree and Eigenvector Centrality

Lecture 22 SVD, Eigenvector, and Web Search

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

CS 440 Database Management Systems

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

Solving Linear Equations

Junghoo “John” Cho UCLA

Graph and Link Mining.

Junghoo “John” Cho UCLA

Lecture 22 SVD, Eigenvector, and Web Search

Lecture 22 SVD, Eigenvector, and Web Search

Link Analysis II Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.

Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.

Presentation transcript:

PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman Stanford University

Intuition – (1) Web pages are important if people visit them a lot. But we can’t watch everybody using the Web. A good surrogate for visiting pages is to assume people follow links randomly. Leads to random surfer model: Start at a random page and follow random out-links repeatedly, from whatever page you are at. PageRank = limiting probability of being at a page.

Intuition – (2) Solve the recursive equations: “importance of a page = its share of the importance of each of its predecessor pages.” Equivalent to the random-surfer definition of PageRank. Technically, importance = the principal eigenvector of the transition matrix of the Web. A few fixups needed.

Transition Matrix of the Web Number the pages 1, 2,… . Page i corresponds to row and column i. M [i, j] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i. M [i, j] is the probability a surfer will next be at page i if it is now at page j. Or it is the share of j’s importance that i receives.

Example: Transition Matrix Suppose page j links to 3 pages, including i but not x. j i 1/3 x Called a stochastic matrix = “all columns sum to 1.”

Random Walks on the Web Suppose v is a vector whose i th component is the probability that a random surfer is at page i at a certain time. If a surfer chooses a successor page from page i at random, the probability distribution for surfers is then given by the vector Mv.

Random Walks – (2) Starting from any vector u, the limit M (M (…M (M u ) …)) is the long-term distribution of the surfers. The math: limiting distribution = principal eigenvector of M = PageRank. Note: If v is the limit of MM…Mu, then v satisfies the equation v = Mv, so v is an eigenvector of M with eigenvalue 1.

Example: The Web in 1839 y a m y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 Yahoo Amazon M’soft

Solving The Equations Because there are no constant terms, the equations v = Mv do not have a unique solution. Example: doubling each component of solution v yields another solution. In Web-sized examples, we cannot solve by Gaussian elimination anyway; we need to use relaxation (= iterative solution).

Simulating a Random Walk Start with the vector u = [1, 1,…, 1] representing the idea that each Web page is given one unit of importance. Note: it is more common to start with each vector element = 1/N, where N is the number of Web pages and to keep the sum of the elements at 1. Question for thought: Why such small values? Repeatedly apply the matrix M to u, allowing the importance to flow like a random walk. About 50 iterations is sufficient to estimate the limiting solution.

Example: Iterating Equations Equations v = Mv: y = y /2 + a /2 a = y /2 + m m = a /2 Note: “=” is really “assignment.” y a = m 1 1 3/2 1/2 5/4 1 3/4 9/8 11/8 1/2 6/5 3/5 . . .

The Surfers Yahoo Amazon M’soft

The Surfers Yahoo Amazon M’soft

The Surfers Yahoo Amazon M’soft

The Surfers Yahoo Amazon M’soft

In the Limit … Yahoo Amazon M’soft

Dead Ends Spider Traps Taxation Policies The Web Is More Complex Than That Dead Ends Spider Traps Taxation Policies We’re now going to forget whether the sets we deal with come from shingled documents or any other source, and concentrate on the sets themselves. We’ll learn the formal definition of “similarity” that is commonly used for sets – this notion is called “Jaccard similarity.” We’ll then learn how to construct signatures from sets using the minhashing technique, and we’ll prove the strong relationship between the similarity of the signatures and the sets they represent.

Real-World Problems Some pages are dead ends (have no links out). Such a page causes importance to leak out, or surfers to disappear. Other groups of pages are spider traps (all out-links are within the group). Eventually spider traps absorb all importance; all surfers get stuck in the trap.

Microsoft Becomes Dead End y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0 y a m Yahoo A substochastic matrix = “all columns sum to at most 1.” Amazon M’soft

Example: Effect of Dead Ends Equations v = Mv: y = y /2 + a /2 a = y /2 m = a /2 y a = m 1 1 1/2 3/4 1/2 1/4 5/8 3/8 1/4 . . .

Microsoft Becomes a Dead End Yahoo Amazon M’soft

Microsoft Becomes a Dead End Yahoo Amazon M’soft

Microsoft Becomes a Dead End Yahoo Amazon M’soft

Microsoft Becomes a Dead End Yahoo Amazon M’soft

In the Limit … Yahoo Amazon M’soft

M’soft Becomes Spider Trap y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m Yahoo Amazon M’soft

Example: Effect of Spider Trap Equations v = Mv: y = y /2 + a /2 a = y /2 m = a /2 + m y a = m 1 1 1/2 3/2 3/4 1/2 7/4 5/8 3/8 2 3 . . .

Microsoft Becomes a Spider Trap Yahoo Amazon M’soft

Microsoft Becomes a Spider Trap Yahoo Amazon M’soft

Microsoft Becomes a Spider Trap Yahoo Amazon M’soft

In the Limit … Yahoo Amazon M’soft

PageRank Solution to Traps, Etc. “Tax” each page a fixed percentage at each iteration. Add a fixed constant to all pages. Optional but useful: add exactly enough to balance the loss (tax + PageRank of dead ends). Models a random walk with a fixed probability of leaving the system, and a fixed number of new surfers injected into the system at each step. Divided equally among all pages.

Example: Microsoft is a Spider Trap; 20% Tax Equations v = 0.8(Mv) + 0.2: y = 0.8(y/2 + a/2) + 0.2 a = 0.8(y/2) + 0.2 m = 0.8(a/2 + m) + 0.2 Note: amount injected is chosen to balance the tax. If we started with 1/3 for each rather than 1, the 0.2 would be replaced by 0.0667. y a = m 1 1.00 0.60 1.40 0.84 0.60 1.56 0.776 0.536 1.688 7/11 5/11 21/11 . . .

Topic-Specific PageRank Focusing on Specific Pages Teleport Sets Interpretation as a Random Walk We’re now going to forget whether the sets we deal with come from shingled documents or any other source, and concentrate on the sets themselves. We’ll learn the formal definition of “similarity” that is commonly used for sets – this notion is called “Jaccard similarity.” We’ll then learn how to construct signatures from sets using the minhashing technique, and we’ll prove the strong relationship between the similarity of the signatures and the sets they represent.

Topic-Specific Page Rank Goal: Evaluate Web pages not just by popularity, but also by relevance to a particular topic, e.g. “sports” or “history.” Allows search queries to be answered based on interests of the user. Example: Search query [jaguar] wants different pages depending on whether you are interested in automobiles, nature, or sports. Might discover interests by browsing history, bookmarks, e.g.

Teleport Sets Assume each surfer has a small probability of “teleporting” at any tick. Teleport can go to: Any page with equal probability. As in the “taxation” scheme. A set of “relevant” pages (teleport set). For topic-specific PageRank. Note: can also inject surfers to compensate for surfers lost at dead ends. Or imagine a surfer always teleports from a dead end.

Example: Topic = Software Only Microsoft is in the teleport set. Assume 20% “tax.” I.e., probability of a teleport is 20%.

Only Microsoft in Teleport Set Yahoo Dr. Who’s phone booth. Amazon M’soft

Only Microsoft in Teleport Set Yahoo Amazon M’soft

Only Microsoft in Teleport Set Yahoo Amazon M’soft

Only Microsoft in Teleport Set Yahoo Amazon M’soft

Only Microsoft in Teleport Set Yahoo Amazon M’soft

Only Microsoft in Teleport Set Yahoo Amazon M’soft

Only Microsoft in Teleport Set Yahoo Amazon M’soft

Picking the Teleport Set One option is to choose the pages belonging to the topic in Open Directory. Another option is to “learn,” from a training set (which could be Open Directory), the typical words in pages belonging to the topic; use pages heavy in those words as the teleport set.

Application: Link Spam Spam farmers create networks of millions of pages designed to focus PageRank on a few undeserving pages. We’ll discuss this technology shortly. To minimize their influence, use a teleport set consisting of trusted pages only. Example: home pages of universities.

Hubs Authorities Solving the Implied Recursion HITS Hubs Authorities Solving the Implied Recursion We’re now going to forget whether the sets we deal with come from shingled documents or any other source, and concentrate on the sets themselves. We’ll learn the formal definition of “similarity” that is commonly used for sets – this notion is called “Jaccard similarity.” We’ll then learn how to construct signatures from sets using the minhashing technique, and we’ll prove the strong relationship between the similarity of the signatures and the sets they represent.

Hubs and Authorities (“HITS”) Mutually recursive definition: A hub links to many authorities; An authority is linked to by many hubs. Authorities turn out to be places where information can be found. Example: course home pages. Hubs tell where the authorities are. Example: departmental course-listing page.

Transition Matrix A HITS uses a matrix A[i, j] = 1 if page i links to page j, 0 if not. AT, the transpose of A, is similar to the PageRank matrix M, but AT has 1’s where M has fractions. Also, HITS uses column vectors h and a representing the degrees to which each page is a hub or authority, respectively. Computation of h and a is similar to the iterative way we compute PageRank.

Example: H&A Transition Matrix y 1 1 1 a 1 0 1 m 0 1 0 y a m Yahoo A = Amazon M’soft

Using Matrix A for HITS Powers of A and AT have elements whose values grow exponentially with the exponent, so we need scale factors λ and μ. Let h and a be column vectors measuring the “hubbiness” and authority of each page. Equations: h = λAa; a = μAT h. Hubbiness = scaled sum of authorities of successor pages (out-links). Authority = scaled sum of hubbiness of predecessor pages (in-links).

Consequences of Basic Equations From h = λAa; a = μAT h we can derive: h = λμAAT h a = λμATA a Compute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority. Technically, these equations let you solve for λμ as well as h and a. In practice, you don’t fix λμ, but rather scale the result at each iteration. Example: scale to keep largest value at 1.

Scale Doesn’t Matter Remember: it is only the direction of the vectors, or the relative hubbiness and authority of Web pages that matters. As for PageRank, the only reason to worry about scale is so you don’t get overflows or underflows in the values as you iterate.

Example: Iterating H&A a = λμATA a; h = λμAAT h 1 1 1 A = 1 0 1 0 1 0 1 1 0 AT = 1 0 1 3 2 1 AAT= 2 2 0 1 0 1 2 1 2 ATA= 1 2 1 . . . a(yahoo) a(amazon) a(m’soft) = 1 5 4 24 18 114 84 1+3 2 h(yahoo) = 1 h(amazon) = 1 h(microsoft) = 1 6 4 2 28 20 8 132 96 36 . . . 1.000 0.735 0.268

A Better Way to Solve HITS Start with h = [1,1,…,1]; multiply by AT to get first a; scale so largest component = 1; then multiply by A to get next h, and repeat until approximate convergence. You may be tempted to compute AAT and ATA first, then iterate multiplication by these matrices, as for PageRank. Question for thought: Why was the separate calculations of h and a actually less efficient than the method suggested above.