1 COMP4332 Web Data Thanks for Raymond Wong’s slides
2 Web Databases Raymond Wong
COMP53313 How to rank the webpages?
4 Ranking Methods HITS Algorithm PageRank Algorithm
COMP53315 HITS Algorithm HITS is a ranking algorithm which ranks “hubs” and “authorities”.
COMP53316 HITS Algorithm Authority vv Hub Each page has two weights 1.Authority weight a(v) 2.Hub weight h(v)
COMP53317 HITS Algorithm Each vertex has two weights Authority weight Hub weight Authority Weight v v Hub Weight a(v) = u v h(u) h(v) = v u a(u) A good authority has many edges from good hubs A good hub has many outgoing edges to good authorities
COMP53318 HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step
COMP53319 Step 1 – Sampling Step Given a user query with several terms Collect a set of pages that are very relevant – called the base set How to find base set? We retrieve all webpages that contain the query terms. The set of webpages is called the root set. Next, find the link pages, which are either pages with a hyperlink to some page in the root set or some page in the root set has hyperlink to these pages All pages found form the base set.
COMP HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step
COMP Step 2 – Iteration Step Goal: to find the base pages that are good hubs and good authorities
COMP Step 2 – Iteration Step N A M N: Netscape MS: Microsoft A: Amazon.com h(N) = a(N) + a(MS) + a(A) h(MS) = a(A) h(A) = a(N) + a(MS) = h(N) h(MS) h(A) a(N) a(MS) a(A) Adjacency matrix M = N MS A N A h(N) h(MS) h(A) a(N) a(MS) a(A)
COMP Step 2 – Iteration Step N A M N: Netscape MS: Microsoft A: Amazon.com a(N) = h(N) + h(A) a(MS) = h(N) + h(A) a(A) = h(N) + h(MS) = a(N) a(MS) a(A) h(N) h(MS) h(A) Adjacency matrix M = N MS A N A h(N) h(MS) h(A) a(N) a(MS) a(A)
COMP Step 2 – Iteration Step We have We derive
COMP Step 2 – Iteration Step N A M = N MS A N A M= N A N A MTMT = N A N A MM T = N MS A N A MTMMTM
COMP Step 2 – Iteration Step = N MS A N A MM T Iteration No Hub (non-normalized) N MS A Iteration No Hub (normalized) N MS A The sum of all elements in the vector = 3 N MS A Hub =
COMP Step 2 – Iteration Step Iteration No Authority (non-normalized) N MS A Iteration No Authority (normalized) N MS A The sum of all elements in the vector = 3 = N MS A N A MTMMTM N MS A Hub = N MS A Authority =
COMP How to Rank Many ways Rank in descending order of hub only Rank in descending order of authority only Rank in descending order of the value computed from both hub and authority (e.g., the sum of the hub value and the authority value) N MS A Hub = N MS A Authority =
COMP Ranking Methods HITS Algorithm PageRank Algorithm
COMP PageRank Algorithm (Google) Disadvantage of HITS: Since there are two concepts, namely hubs and authorities, we do not know which concept is more important for ranking. Advantage of PageRank: PageRank involves only one concept for ranking
COMP PageRank Algorithm (Google) PageRank Algorithm makes use of Stochastic approach to rank the pages
Link Structure of the Web 150 million web pages 1.7 billion links Backlinks and Forward links: A and B are C’s backlinks C is A and B’s forward link Intuitively, a webpage is important if it has a lot of backlinks. What if a webpage has only one link off
A Simple Version of PageRank u: a web page B u : the set of u’s backlinks N v : the number of forward links of page v c: the normalization factor to make ||R|| L1 = 1 (||R|| L1 = |R 1 + … + R n |)
An example of Simplified PageRank PageRank Calculation: first iteration
An example of Simplified PageRank PageRank Calculation: second iteration
An example of Simplified PageRank Convergence after some iterations
A Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!
An example of the Problem
Random Walks in Graphs The Random Surfer Model The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random The Modified Model The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E
Modified Version of PageRank E(u): a distribution of ranks of web pages that “users” jump to when they “gets bored” after successive links at random.
An example of Modified PageRank 33
Dangling Links Links that point to any page with no outgoing links Most are pages that have not been downloaded yet Affect the model since it is not clear where their weight should be distributed Do not affect the ranking of any other page directly Can be simply removed before pagerank calculation and added back afterwards
PageRank Implementation Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages Sort the link structure by ID Remove all the dangling links from the database Make an initial assignment of ranks and start iteration Choosing a good initial assignment can speed up the pagerank Adding the dangling links back.