Link analysis and Page Rank Algorithm

Slides:



Advertisements
Similar presentations
Lecture 18: Link analysis
Advertisements

Link Analysis Mark Levene (Follow the links to learn more!)
Topic-Sensitive PageRank Presented by : Bratislav V. Stojanović University of Belgrade School of Electrical Engineering Page 1/29.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
CSE 522 – Algorithmic and Economic Aspects of the Internet Instructors: Nicole Immorlica Mohammad Mahdian.
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Link Analysis, PageRank and Search Engines on the Web
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Link Analysis HITS Algorithm PageRank Algorithm.
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
Google and the Page Rank Algorithm Székely Endre
S eminar on Page Ranking Techniques In Search Engines Phapale Gaurav S. [05 IT 6010] Guide: Prof. A. Gupta.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Adversarial Information Retrieval The Manipulation of Web Content.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
X-Informatics Web Search; Text Mining B 2013 Geoffrey Fox Associate Dean for.
The Technology Behind. The World Wide Web In July 2008, Google announced that they found 1 trillion unique webpages! Billions of new web pages appear.
Graph-based Algorithms in Large Scale Information Retrieval Fatemeh Kaveh-Yazdy Computer Engineering Department School of Electrical and Computer Engineering.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Overview of Web Ranking Algorithms: HITS and PageRank
CS349 – Link Analysis 1. Anchor text 2. Link analysis for ranking 2.1 Pagerank 2.2 Pagerank variants 2.3 HITS.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
COMP4210 Information Retrieval and Search Engines Lecture 9: Link Analysis.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
“In the beginning -- before Google -- a darkness was upon the land.” Joel Achenbach Washington Post.
1 CS 430: Information Discovery Lecture 5 Ranking.
CS 440 Database Management Systems Web Data Management 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
WEB SPAM.
Quality of a search engine
HITS Hypertext-Induced Topic Selection
Search Engines and Link Analysis on the Web
Lecture #11 PageRank (II)
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
The Anatomy of a Large-Scale Hypertextual Web Search Engine
An Efficient method to recommend research papers and highly influential authors. VIRAJITHA KARNATAPU.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 440 Database Management Systems
Information retrieval and PageRank
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Description of PageRank
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
PageRank PAGE RANK (determines the importance of webpages based on link structure) Solves a complex system of score equations PageRank is a probability.
CS276 Information Retrieval and Web Search
Presentation transcript:

Link analysis and Page Rank Algorithm Slides from Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze

Why This Result and Order ?

Today’s lecture – hypertext and links We look beyond the content of documents We begin to look at the hyperlinks between them Address questions like Do the links represent a conferral of authority to some pages? Is this useful for ranking? How likely is it that a page pointed to by the CERN home page is about high energy physics CERN, the European Organization for Nuclear Research, is one of the world's largest and most respected centers for scientific research

Links are everywhere Powerful sources of authenticity and authority Mail spam – which email accounts are spammers? Host quality – which hosts are “bad”? Phone call logs The Good, The Bad and The Unknown ? Good Bad

Example 1: Good/Bad/Unknown The Good, The Bad and The Unknown Good nodes won’t point to Bad nodes All other combinations plausible ? Good ? ? Bad ?

Simple iterative logic Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good ? Good ? ? Bad ?

Simple iterative logic Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good ? Bad ?

Simple iterative logic Good nodes won’t point to Bad nodes If you point to a Bad node, you’re Bad If a Good node points to you, you’re Good Good Bad

The Web as a Directed Graph Sec. 21.1 The Web as a Directed Graph Page A hyperlink Page B Anchor Hypothesis 1: A hyperlink between pages denotes a conferral of authority (quality signal) Hypothesis 2: The text in the anchor of the hyperlink on page A describes the target page B

Assumption 1: reputed sites

Assumption 2: annotation of target

Anchor Text WWW Worm - McBryan [Mcbr94] Sec. 21.1.1 Anchor Text WWW Worm - McBryan [Mcbr94] For ibm how to distinguish between: IBM’s home page (mostly graphical) IBM’s copyright page (high term freq. for ‘ibm’) Rival’s spam page (arbitrarily high term freq.) “IBM home page” “ibm.com” “ibm” A million pieces of anchor text with “ibm” send a strong signal www.ibm.com

Sec. 21.1.1 Indexing anchor text When indexing a document D, include (with some weight) anchor text from links pointing to D. Armonk, NY-based computer giant IBM announced today www.ibm.com Big Blue today announced record profits for the quarter Joe’s computer hardware links Sun HP IBM

Google bombs Indexing anchor text can have unexpected side effects: Google bombs. whatelse does not have side effects? A Google bomb is a search with “bad” results due to maliciously manipulated anchor text Google introduced a new weighting function in January 2007 that fixed many Google bombs

Google bomb example

Sec. 21.1.1 Indexing anchor text Can score anchor text with weight depending on the authority of the anchor page’s website E.g., if we were to assume that content from cnn.com or yahoo.com is authoritative, then trust (more) the anchor text from them Increase the weight of off-site anchors (non-nepotistic scoring)

Link Analysis: PageRank

Google Pagerank System Google was developed by Sergey Brin and Larry Page This is the method that Larry Page developed to rank and order the pages. Hence, the Pagerank. PageRank is a trademark of Google. The PageRank process has been patented.

Citation Analysis The impact factor of a journal = A/B A is the number of current year citations to articles appearing in the journal during previous two years. B is the number of articles published in the journal during previous two years. Journal Title (AI) Impact Factor (2004) J. Mach. Learn. Res. 5.952 IEEE T. Pattern Anal. 4.352 IEEE T. Evolut. Comp. 3.688 Artif. Intell. 3.570 Mach. Learn. 3.258

Co-Citation A and B are co-cited by C, implying that they are related or associated. The strength of co-citation between A and B is the number of times they are co-cited.

Co-Citation An alternate citation-based measure of similarity introduced by Small in 1973. Number of documents that cite both A and B. A B

Bibliographic Coupling Measure of similarity of documents introduced by Kessler in 1963. The bibliographic coupling of two documents A and B is the number of documents cited by both A and B. Size of the intersection of their bibliographies. Maybe want to normalize by size of bibliographies? A B

Clusters from Co-Citation Graph (Larson 96)

The web isn’t scholarly citation Millions of participants, each with self interests Spamming is widespread Once search engines began to use links for ranking (roughly 1998), link spam grew You can join a link farm – a group of websites that heavily link to one another

How would you order these site? Suppose each of the nodes at right have the links shown in the directed graph. Which node is most important and should appear first?

The Basic Idea PageRank is a numeric value that represents how important a page is on the web. Google figures that when one page links to another page, it is effectively casting a vote for the other page. The more votes that are cast for a page, the more important the page must be. Also, the importance of the page that is casting the vote determines how important the vote itself is. http://www.webworkshop.net/pagerank.html

Simplified PageRank algorithm Assume four web pages: A, B,C and D. Let each page would begin with an estimated PageRank of 0.25. L(A) is defined as the number of links going out of page A. The PageRank of a page A is given as follows: C A D B C A D B

PageRank algorithm including damping factor Assume page A has pages B, C, D ..., which point to it. The parameter d is a damping factor which can be set between 0 and 1. Usually set d to 0.85. The PageRank of a page A is given as follows:

Pagerank scoring Imagine a user doing a random walk on web pages: Sec. 21.2 Pagerank scoring Imagine a user doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably “In the long run” each page has a long-term visit rate - use this as the page’s score. 1/3

Not quite enough The web is full of dead-ends. Sec. 21.2 Not quite enough The web is full of dead-ends. Random walk can get stuck in dead-ends. Makes no sense to talk about long-term visit rates. ??

Teleporting At a dead end, jump to a random web page. Sec. 21.2 Teleporting At a dead end, jump to a random web page. At any non-dead end, with probability 10%, jump to a random web page. With remaining probability (90%), go out on a random link. 10% - a parameter.

Result of teleporting Now cannot get stuck locally. Sec. 21.2 Result of teleporting Now cannot get stuck locally. There is a long-term rate at which any page is visited How do we compute this visit rate?

Intuitive Justification A "random surfer" who is given a web page at random and keeps clicking on links, never hitting "back“, but eventually gets bored and starts on another random page. The probability that the random surfer visits a page is its PageRank. The d damping factor is the probability at each page the "random surfer" will get bored and request another random page. A page can have a high PageRank If there are many pages that point to it Or if there are some pages that point to it, and have a high PageRank.

Compute the Page Rank of A and B Each page has one outgoing link. So that means C(A) = 1 and C(B) = 1. PR(A) = PR(B) =1

What is the PageRank of A, B, C d=0.5 Can compute value in fractions, for example: 10/13

The Characteristics of PageRank™ We regard a small web consisting of three pages A, B and C, whereby page A links to the pages B and C, page B links to page C and page C links to page A. According to Page and Brin, the damping factor d is usually set to 0.85, but to keep the calculation simple we set it to 0.5. PR(A) = 0.5 + 0.5 PR(C) PR(B) = 0.5 + 0.5 (PR(A) / 2) PR(C) = 0.5 + 0.5 (PR(A) / 2 + PR(B)) We get the following PageRank™ values for the single pages: PR(A) = 14/13 = 1.07692308 PR(B) = 10/13 = 0.76923077 PR(C) = 15/13 = 1.15384615 The sum of all pages' PageRanks is 3 and thus equals the total number of web pages.

We focus on two web pages A and B, with no hyperlink from A to B. (There are many other web pages of course.) Let Old(B) be the pagerank of B. We now add a hyperlink from A to B. Should the new pagerank (New(B)) of B be higher than Old(B). Justify your answer

We're adding a link from A to B. Now suppose that there was a page C such that A points to C, which in turn points to B, giving B relatively high pagerank from A through C to B. The introduction of the new link from A to B results in less pagerank flowing from A to C, and in turn into B through C. Thus, the extra contribution to B's pagerank from the new link out of A may be swamped by the diminished contribution from C to B. Hence, it's not obvious in general if the pagerank of B is monotone in links into B.

Hyperlink-Induced Topic Search (HITS) In response to a query, instead of an ordered list of pages each meeting the query, find two sets of inter-related pages: Hub pages are good lists of links to pages answering the information need. e.g., “Bob’s list of cancer-related links.” Authority pages are direct answers to the information need occur recurrently on good hubs for the subject. Most approaches to search do not make the distinction between the two sets

HITS – Hubs and Authorities - Hyperlink-Induced Topic Search A on the left is an authority A on the right is a hub

Hubs and Authorities Thus, a good hub page for a topic points to many authoritative pages for that topic. A good authority page for a topic is pointed to by many good hubs for that topic. Circular definition - will turn this into an iterative computation.

Hubs and Authorities Together they tend to form a bipartite graph: