1 COMP4332 Web Data Thanks for Raymond Wong’s slides.

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
CSE 5243 (AU 14) Graph Basics and a Gentle Introduction to PageRank 1.
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Information Networks Link Analysis Ranking Lecture 8.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Link Structure and Web Mining Shuying Wang
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
Presented by Zheng Zhao Originally designed by Soumya Sanyal
Chapter 8 Web Structure Mining Part-1 1. Web Structure Mining Deals mainly with discovering the model underlying the link structure of the web Deals with.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
Link Analysis.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Center for E-Business Technology Seoul National University Seoul, Korea BrowseRank: letting the web users vote for page importance Yuting Liu, Bin Gao,
Overview of Web Ranking Algorithms: HITS and PageRank
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Importance Measures on Nodes Lecture 2 Srinivasan Parthasarathy 1.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
15-499:Algorithms and Applications
Lecture #11 PageRank (II)
Link-Based Ranking Seminar Social Media Mining University UC3M
PageRank and Markov Chains
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
The Anatomy of a Large-Scale Hypertextual Web Search Engine
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 454 Advanced Internet Systems University of Washington
Sarthak Ahuja ( ) Saumya jain ( )
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
COMP5331 Web databases Prepared by Raymond Wong
Presentation transcript:

1 COMP4332 Web Data Thanks for Raymond Wong’s slides

2 Web Databases Raymond Wong

COMP53313 How to rank the webpages?

4 Ranking Methods HITS Algorithm PageRank Algorithm

COMP53315 HITS Algorithm HITS is a ranking algorithm which ranks “hubs” and “authorities”.

COMP53316 HITS Algorithm Authority vv Hub Each page has two weights 1.Authority weight a(v) 2.Hub weight h(v)

COMP53317 HITS Algorithm Each vertex has two weights Authority weight Hub weight Authority Weight v v Hub Weight a(v) =  u  v h(u) h(v) =  v  u a(u) A good authority has many edges from good hubs A good hub has many outgoing edges to good authorities

COMP53318 HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step

COMP53319 Step 1 – Sampling Step Given a user query with several terms Collect a set of pages that are very relevant – called the base set How to find base set? We retrieve all webpages that contain the query terms. The set of webpages is called the root set. Next, find the link pages, which are either pages with a hyperlink to some page in the root set or some page in the root set has hyperlink to these pages All pages found form the base set.

COMP HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step

COMP Step 2 – Iteration Step Goal: to find the base pages that are good hubs and good authorities

COMP Step 2 – Iteration Step N A M N: Netscape MS: Microsoft A: Amazon.com h(N) = a(N) + a(MS) + a(A) h(MS) = a(A) h(A) = a(N) + a(MS) = h(N) h(MS) h(A) a(N) a(MS) a(A) Adjacency matrix M = N MS A N A h(N) h(MS) h(A) a(N) a(MS) a(A)

COMP Step 2 – Iteration Step N A M N: Netscape MS: Microsoft A: Amazon.com a(N) = h(N) + h(A) a(MS) = h(N) + h(A) a(A) = h(N) + h(MS) = a(N) a(MS) a(A) h(N) h(MS) h(A) Adjacency matrix M = N MS A N A h(N) h(MS) h(A) a(N) a(MS) a(A)

COMP Step 2 – Iteration Step We have We derive

COMP Step 2 – Iteration Step N A M = N MS A N A M= N A N A MTMT = N A N A MM T = N MS A N A MTMMTM

COMP Step 2 – Iteration Step = N MS A N A MM T Iteration No Hub (non-normalized) N MS A Iteration No Hub (normalized) N MS A The sum of all elements in the vector = 3 N MS A Hub =

COMP Step 2 – Iteration Step Iteration No Authority (non-normalized) N MS A Iteration No Authority (normalized) N MS A The sum of all elements in the vector = 3 = N MS A N A MTMMTM N MS A Hub = N MS A Authority =

COMP How to Rank Many ways Rank in descending order of hub only Rank in descending order of authority only Rank in descending order of the value computed from both hub and authority (e.g., the sum of the hub value and the authority value) N MS A Hub = N MS A Authority =

COMP Ranking Methods HITS Algorithm PageRank Algorithm

COMP PageRank Algorithm (Google) Disadvantage of HITS: Since there are two concepts, namely hubs and authorities, we do not know which concept is more important for ranking. Advantage of PageRank: PageRank involves only one concept for ranking

COMP PageRank Algorithm (Google) PageRank Algorithm makes use of Stochastic approach to rank the pages

Link Structure of the Web 150 million web pages  1.7 billion links Backlinks and Forward links:  A and B are C’s backlinks  C is A and B’s forward link Intuitively, a webpage is important if it has a lot of backlinks. What if a webpage has only one link off

A Simple Version of PageRank u: a web page B u : the set of u’s backlinks N v : the number of forward links of page v c: the normalization factor to make ||R|| L1 = 1 (||R|| L1 = |R 1 + … + R n |)

An example of Simplified PageRank PageRank Calculation: first iteration

An example of Simplified PageRank PageRank Calculation: second iteration

An example of Simplified PageRank Convergence after some iterations

A Problem with Simplified PageRank A loop: During each iteration, the loop accumulates rank but never distributes rank to other pages!

An example of the Problem

Random Walks in Graphs The Random Surfer Model The simplified model: the standing probability distribution of a random walk on the graph of the web. simply keeps clicking successive links at random The Modified Model The modified model: the “random surfer” simply keeps clicking successive links at random, but periodically “gets bored” and jumps to a random page based on the distribution of E

Modified Version of PageRank E(u): a distribution of ranks of web pages that “users” jump to when they “gets bored” after successive links at random.

An example of Modified PageRank 33

Dangling Links Links that point to any page with no outgoing links Most are pages that have not been downloaded yet Affect the model since it is not clear where their weight should be distributed Do not affect the ranking of any other page directly Can be simply removed before pagerank calculation and added back afterwards

PageRank Implementation Convert each URL into a unique integer and store each hyperlink in a database using the integer IDs to identify pages Sort the link structure by ID Remove all the dangling links from the database Make an initial assignment of ranks and start iteration Choosing a good initial assignment can speed up the pagerank Adding the dangling links back.