Presentation is loading. Please wait.

Presentation is loading. Please wait.

Link analysis and Page Rank Algorithm

Similar presentations


Presentation on theme: "Link analysis and Page Rank Algorithm"— Presentation transcript:

1 Link analysis and Page Rank Algorithm
Slides from Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze

2 Why This Result and Order ?

3 Today’s lecture – hypertext and links
We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority to some pages? Is this useful for ranking? How likely is it that a page pointed to by the CERN home page is about high energy physics CERN, the European Organization for Nuclear Research, is one of the world's largest and most respected centers for scientific research

4 Links are everywhere Powerful sources of authenticity and authority
Mail spam – which accounts are spammers? Host quality – which hosts are “bad”? Phone call logs The Good, The Bad and The Unknown ? Good Bad

5 Example 1: Good/Bad/Unknown
The Good, The Bad and The Unknown Good nodes won’t point to Bad nodes All other combinations plausible ? Good ? ? Bad ?

6 Simple iterative logic
Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good ? Good ? ? Bad ?

7 Simple iterative logic
Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good ? Bad ?

8 Simple iterative logic
Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good Bad

9 The Web as a Directed Graph
Sec. 21.1 The Web as a Directed Graph Page A hyperlink Page B Anchor Hypothesis 1: A hyperlink between pages denotes a conferral of authority (quality signal) Hypothesis 2: The text in the anchor of the hyperlink on page A describes the target page B

10 Assumption 1: reputed sites

11 Assumption 2: annotation of target

12 Anchor Text WWW Worm - McBryan [Mcbr94]
Sec Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) “IBM home page” “ibm.com” “ibm” A million pieces of anchor text with “ibm” send a strong signal

13 Sec Indexing anchor text When indexing a document D, include (with some weight) anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today Big Blue today announced record profits for the quarter Joe’s computer hardware links Sun HP IBM

14 Google bombs Indexing anchor text can have unexpected side effects: Google bombs. whatelse does not have side effects? A Google bomb is a search with “bad” results due to maliciously manipulated anchor text Google introduced a new weighting function in January 2007 that fixed many Google bombs

15 Google bomb example

16 Sec Indexing anchor text Can score anchor text with weight depending on the authority of the anchor page’s website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust (more) the anchor text from them Increase the weight of off-site anchors (non-nepotistic scoring)

17 Link Analysis: PageRank

18 Google Pagerank System
Google was developed by Sergey Brin and Larry Page This is the method that Larry Page developed to rank and order the pages. Hence, the Pagerank. PageRank is a trademark of Google. The PageRank process has been patented.

19 Citation Analysis The impact factor of a journal = A/B
A is the number of current year citations to articles appearing in the journal during previous two years. B is the number of articles published in the journal during previous two years. Journal Title (AI) Impact Factor (2004) J. Mach. Learn. Res. 5.952 IEEE T. Pattern Anal. 4.352 IEEE T. Evolut. Comp. 3.688 Artif. Intell. 3.570 Mach. Learn. 3.258

20 Co-Citation A and B are co-cited by C, implying that
they are related or associated. The strength of co-citation between A and B is the number of times they are co-cited.

21 Co-Citation An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both A and B. A B

22 Bibliographic Coupling
Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? A B

23 Clusters from Co-Citation Graph (Larson 96)

24 The web isn’t scholarly citation
Millions of participants, each with self interests Spamming is widespread Once search engines began to use links for ranking (roughly 1998), link spam grew You can join a link farm – a group of websites that heavily link to one another

25 How would you order these site?
Suppose each of the nodes at right have the links shown in the directed graph. Which node is most important and should appear first?

26

27 The Basic Idea PageRank is a numeric value that represents how important a page is on the web. Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is.

28 Simplified PageRank algorithm
Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25. L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: C A D B C A D B

29 PageRank algorithm including damping factor
Assume page A has pages B, C, D ..., which point to it. The parameter d is a damping factor which can be set between 0 and 1. Usually set d to The PageRank of a page A is given as follows:

30 Pagerank scoring Imagine a user doing a random walk on web pages:
Sec. 21.2 Pagerank scoring Imagine a user doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably “In the long run” each page has a long-term visit rate - use this as the page’s score. 1/3

31 Not quite enough The web is full of dead-ends.
Sec. 21.2 Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ??

32 Teleporting At a dead end, jump to a random web page.
Sec. 21.2 Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter.

33 Result of teleporting Now cannot get stuck locally.
Sec. 21.2 Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited How do we compute this visit rate?

34 Intuitive Justification
A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. A page can have a high PageRank If there are many pages that point to it Or if there are some pages that point to it, and have a high PageRank.

35 Compute the Page Rank of A and B
Each page has one outgoing link. So that means C(A) = 1 and C(B) = 1. PR(A) = PR(B) =1

36 What is the PageRank of A, B, C
d=0.5 Can compute value in fractions, for example: 10/13

37 The Characteristics of PageRank™
We regard a small web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. PR(A) = PR(C) PR(B) = (PR(A) / 2) PR(C) = (PR(A) / 2 + PR(B)) We get the following PageRank™ values for the single pages: PR(A) = 14/13 = PR(B) = 10/13 = PR(C) = 15/13 = The sum of all pages' PageRanks is 3 and thus equals the total number of web pages.

38 We focus on two web pages A and B, with no hyperlink from A to B.
(There are many other web pages of course.) Let Old(B) be the pagerank of B. We now add a hyperlink from A to B. Should the new pagerank (New(B)) of B be higher than Old(B). Justify your answer

39 We're adding a link from A to B. Now suppose that there was a page
C such that A points to C, which in turn points to B, giving B relatively high pagerank from A through C to B. The introduction of the new link from A to B results in less pagerank flowing from A to C, and in turn into B through C. Thus, the extra contribution to B's pagerank from the new link out of A may be swamped by the diminished contribution from C to B. Hence, it's not obvious in general if the pagerank of B is monotone in links into B.

40 Hyperlink-Induced Topic Search (HITS)
In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links to pages answering the information need. e.g., “Bob’s list of cancer-related links.” Authority pages are direct answers to the information need occur recurrently on good hubs for the subject. Most approaches to search do not make the distinction between the two sets

41 HITS – Hubs and Authorities - Hyperlink-Induced Topic Search
A on the left is an authority A on the right is a hub

42 Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation.

43 Hubs and Authorities Together they tend to form a bipartite graph:


Download ppt "Link analysis and Page Rank Algorithm"

Similar presentations


Ads by Google