1 Evaluating the Web PageRank Hubs and Authorities.

Slides:



Advertisements
Similar presentations
Markov Models.
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Graphs, Node importance, Link Analysis Ranking, Random walks
Analysis of Large Graphs: TrustRank and WebSpam
Link Analysis: PageRank and Similar Ideas. Recap: PageRank Rank nodes using link structure PageRank: – Link voting: P with importance x has n out-links,
Link Analysis Francisco Moreno Extractos de Mining of Massive Datasets Rajamaran, Leskovec & Ullman.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Introduction to PageRank Algorithm and Programming Assignment 1 CSC4170 Web Intelligence and Social Computing Tutorial 4 Tutor: Tom Chao Zhou
Multimedia Databases SVD II. Optimality of SVD Def: The Frobenius norm of a n x m matrix M is (reminder) The rank of a matrix M is the number of independent.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
CS345 Data Mining Link Analysis 3: Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
Web Mining. Two Key Problems  Page Rank  Web Content Mining.
1 Evaluating the Web PageRank Hubs and Authorities.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis HITS Algorithm PageRank Algorithm.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
1 1 Lecture 4: Information Retrieval and Web Mining
Using Adaptive Methods for Updating/Downdating PageRank Gene H. Golub Stanford University SCCM Joint Work With Sep Kamvar, Taher Haveliwala.
Link Analysis 1 Wu-Jun Li Department of Computer Science and Engineering Shanghai Jiao Tong University Lecture 7: Link Analysis Mining Massive Datasets.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn technical terms: compute the principal eigenvector.
Lecture 2: Extensions of PageRank
Link Analysis in Web Mining Hubs and Authorities Spam Detection.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
1 HEINZ NIXDORF INSTITUTE University of Paderborn Algorithms and Complexity Christian Schindelhauer Search Algorithms Winter Semester 2004/ Nov.
1 CS Electronic Commerce Dan Boneh Yoav Shoham Jeff Ullman others...
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
CS345 Data Mining Link Analysis 2: Topic-Specific Page Rank Hubs and Authorities Spam Detection Anand Rajaraman, Jeffrey D. Ullman.
1 1 Lecture 4: Information Retrieval and Web Mining
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
Jeffrey D. Ullman Stanford University. 3  Mutually recursive definition:  A hub links to many authorities;  An authority is linked to by many hubs.
Jeffrey D. Ullman Stanford University.  Web pages are important if people visit them a lot.  But we can’t watch everybody using the Web.  A good surrogate.
Motivation Modern search engines for the World Wide Web use methods that require solving huge problems. Our aim: to develop multiscale techniques that.
15-499:Algorithms and Applications
Search Engines and Link Analysis on the Web
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Hubs and Authorities Jeffrey D. Ullman.
PageRank Random Surfers on the Web Transition Matrix of the Web Dead Ends and Spider Traps Topic-Specific PageRank Jeffrey D. Ullman Stanford University.
DTMC Applications Ranking Web Pages & Slotted ALOHA
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

1 Evaluating the Web PageRank Hubs and Authorities

2 Page Rank uIntuition: solve the recursive equation: “a page is important if important pages link to it.” uIn high-falutin’ terms: importance = the principal eigenvector of the stochastic matrix of the Web. wA few fixups needed.

3 Stochastic Matrix of the Web uEnumerate pages. uPage i corresponds to row and column i. uM [i,j ] = 1/n if page j links to n pages, including page i ; 0 if j does not link to i. wSeems backwards, but allows multiplication by M on the left to represent “follow a link.”

4 Example i j Suppose page j links to 3 pages, including i 1/3

5 Random Walks on the Web uSuppose v is a vector whose i th component is the probability that we are at page i at a certain time. uIf we follow a link from i at random, the probability distribution for the page we are then at is given by the vector Mv.

6 Random Walks --- (2) uStarting from any vector v, the limit M (M (…M (Mv ) …)) is the distribution of page visits during a random walk. uIntuition: pages are important in proportion to how often a random walker would visit them. uThe math: limiting distribution = principal eigenvector of M = PageRank.

7 Example: The Web in 1839 Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 1 m 0 1/2 0 y a m

8 Simulating a Random Walk uStart with the vector v = [1,1,…,1] representing the idea that each Web page is given one unit of “importance.” uRepeatedly apply the matrix M to v, allowing the importance to flow like a random walk. uLimit exists, but about 50 iterations is sufficient to estimate final distribution.

9 Example uEquations v = Mv : wy = y /2 + a /2 wa = y /2 + m wm = a /2 y a = m /2 1/2 5/4 1 3/4 9/8 11/8 1/2 6/5 3/5...

10 Solving The Equations uBecause there are no constant terms, these 3 equations in 3 unknowns do not have a unique solution. uAdd in the fact that y +a +m = 3 to solve. uIn Web-sized examples, we cannot solve by Gaussian elimination; we need to use relaxation (= iterative solution).

11 Real-World Problems uSome pages are “dead ends” (have no links out). wSuch a page causes importance to leak out. uOther (groups of) pages are spider traps (all out-links are within the group). wEventually spider traps absorb all importance.

12 Microsoft Becomes Dead End Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 0 y a m

13 Example uEquations v = Mv: wy = y /2 + a /2 wa = y /2 wm = a /2 y a = m /2 3/4 1/2 1/4 5/8 3/8 1/

14 M’soft Becomes Spider Trap Yahoo M’softAmazon y 1/2 1/2 0 a 1/2 0 0 m 0 1/2 1 y a m

15 Example uEquations v = Mv: wy = y /2 + a /2 wa = y /2 wm = a /2 + m y a = m /2 3/2 3/4 1/2 7/4 5/8 3/

16 Google Solution to Traps, Etc. u“Tax” each page a fixed percentage at each interation. uAdd the same constant to all pages. uModels a random walk in which surfer has a fixed probability of abandoning search and going to a random page next.

17 Ex: Previous with 20% Tax uEquations v = 0.8(Mv ) + 0.2: wy = 0.8(y /2 + a/2) wa = 0.8(y /2) wm = 0.8(a /2 + m) y a = m /11 5/11 21/11...

18 General Case uIn this example, because there are no dead-ends, the total importance remains at 3. uIn examples with dead-ends, some importance leaks out, but total remains finite.

19 Solving the Equations uBecause there are constant terms, we can expect to solve small examples by Gaussian elimination. uWeb-sized examples still need to be solved by relaxation.

20 Search-Engine Architecture uAll search engines, including Google, select pages that have the words of your query. uGive more weight to the word appearing in the title, header, etc. uInverted indexes speed the discovery of pages with given words.

21 Google Anti-Spam Devices uEarly search engines relied on the words on a page to tell what it is about. wLed to “tricks” in which pages attracted attention by placing false words in the background color on their page. uGoogle trusts the words in anchor text wRelies on others telling the truth about your page, rather than relying on you.

22 Use of Page Rank uPages are ordered by many criteria, including the PageRank and the appearance of query words. w“Important” pages more likely to be what you want. uPageRank is also an antispam device. wCreating bogus links to yourself doesn’t help if you are not an important page.

23 Speeding Convergence uNewton-like prediction of where components of the principal eigenvector are heading. uTake advantage of locality in the Web. uEach technique can reduce the number of iterations by 50%. wImportant --- PageRank can take days.

24 Predicting Component Values uThree consecutive values for the importance of a page suggests where the limit might be Guess for the next round

25 Exploiting Substructure  Pages from particular domains, hosts, or paths, like stanford.edu or www-db.stanford.edu/~ullman tend to have higher density of links. uInitialize PageRank using ranks within your local cluster, then ranking the clusters themselves.

26 Strategy uCompute local PageRanks (in parallel?). uUse local weights to establish intercluster weights on edges. uCompute PageRank on graph of clusters. uInitial rank of a page is the product of its local rank and the rank of its cluster. u“Clusters” are appropriately sized regions with common domain or lower-level detail.

27 In Pictures Local ranks Intercluster weights Ranks of clusters 1.5 Initial eigenvector

28 Hubs and Authorities uMutually recursive definition: wA hub links to many authorities; wAn authority is linked to by many hubs. uAuthorities turn out to be places where information can be found. wExample: course home pages. uHubs tell where the authorities are. wExample: CSD course-listing page.

29 Transition Matrix A uH&A uses a matrix A [i,j ] = 1 if page i links to page j, 0 if not. uA T, the transpose of A, is similar to the PageRank matrix M, but A T has 1’s where M has fractions.

30 Example Yahoo M’softAmazon y a m y a m A =

31 Using Matrix A for H&A uPowers of A and A T diverge in size, so we need scale factors. uLet h and a be vectors measuring the “hubbiness” and authority of each page.  Equations: h = λ Aa; a = μ A T h. wHubbiness = scaled sum of authorities of linked pages. wAuthority = scaled sum of hubbiness of predecessor pages.

32 Consequences of Basic Equations  From h = λ Aa; a = μ A T h we can derive:  h = λμ AA T h  a = λμ A T A a uCompute h and a by iteration, assuming initially each page has one unit of hubbiness and one unit of authority.  Pick an appropriate value of λμ.

33 Example A = A T = AA T = A T A= a(yahoo) a(amazon) a(m’soft) ======   3 h(yahoo) = 1 h(amazon) = 1 h(m’soft) =

34 Solving the Equations  Solution of even small examples is tricky, because the value of λμ is one of the unknowns.  Each equation like y = λμ (3y +2a +m) lets us solve for λμ in terms of y, a, m ; equate each expression for λμ. uAs for PageRank, we need to solve big examples by relaxation.

35 Details for h y = λμ (3y +2a +m) a = λμ (2y +2a ) m = λμ (y +m)  Solve for λμ: λμ = y /( 3y +2a +m) = a / (2y +2a ) = m / (y +m)

36 Details for h --- (2) uAssume y = 1. λμ = 1/( 3 +2a +m) = a / (2 +2a ) = m / (1+m) uCross-multiply second and third: a +am = 2m +2am or a = 2m /(1-m ) uCross multiply first and third: 1+m = 3m + 2am +m 2 or a =(1-2m -m 2 )/2m

37 Details for h --- (3) uEquate formulas for a : a = 2m /(1-m ) = (1-2m -m 2 )/2m uCross-multiply: 1 - 2m - m 2 - m + 2m 2 + m 3 = 4m 2 uSolve for m : m =.268 uSolve for a : a = 2m /(1-m ) =.735