The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice, 1st edition by Croft, Metzler, and Strohman, Pearson, 2010, ISBN
The Internet (1969) is a network that’s Global Decentralized Redundant Made up of many different types of machines How many machines make up the Internet?
from Fluency with Information Technology, 4th edition by Lawrence Snyder, Addison-Wesley, 2010, ISBN
Sir Tim Berners-Lee
The World Wide Web (or just Web) is: Global Decentralized Redundant (sometimes) Made up of Web pages and interactive Web services How many Web pages are on the Web?
Links are useful to us humans for navigating Web sites and finding things Links are also useful to search engines Latest News anchor text destination link (URL)
How does anchor text apply to ranking? Anchor text describes the content of the destination page Anchor text is short, descriptive, and often coincides with query text Anchor text is typically written by a non-biased third party
We often represent Web pages as vertices and links as edges in a webgraph
An example:
Links may be interpreted as describing a destination Web page in terms of its: Popularity Importance We focus on incoming links (inlinks) And use this for ranking matching documents Drawback is obtaining incoming link data Authority Incoming link count
PageRank is a link analysis algorithm PageRank is accredited to Sergey Brin and Lawrence Page (the Google guys!) The original PageRank paper: ▪
Browse the Web as a random surfer: Choose a random number r between 0 and 1 If r < λ then go to a random page else follow a random link from the current page Repeat! The PageRank of page A (noted PR(A)) is the probability that this “random surfer” will be looking at that page
Jumping to a random page avoids getting stuck in: Pages that have no links Pages that only have broken links Pages that loop back to previously visited pages
PageRank of page C is the probability a random surfer is viewing page C Based on inlinks PR(C) = PR(A) / 2 + PR(B) / 1 We assume PageRank is distributed evenly across all pages (so 0.33 for A, B, and C) PR(C) = 0.33 / / 1 = 0.50
More generally: B u is the set of pages that point to u L v is the number of outgoing links from page v (not counting duplicate links)
We can account for the “random jumps” by incorporating constant λ into the equation: Typically, λ is low (e.g. λ = 0.15) (N is the number of pages)
A cycle tends to negate the effectiveness of the PageRank algorithm
Read and study Chapter 4.5