COMP5331 Web databases Prepared by Raymond Wong

Slides:



Advertisements
Similar presentations
Matrices, Digraphs, Markov Chains & Their Use by Google Leslie Hogben Iowa State University and American Institute of Mathematics Leslie Hogben Iowa State.
Advertisements

Information Networks Link Analysis Ranking Lecture 8.
Graphs, Node importance, Link Analysis Ranking, Random walks
Link Analysis: PageRank
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Link Analysis Ranking. How do search engines decide how to rank your query results? Guess why Google ranks the query results the way it does How would.
How Google Relies on Discrete Mathematics Gerald Kruse Juniata College Huntingdon, PA
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
Page Rank.  Intuition: solve the recursive equation: “a page is important if important pages link to it.”  Maximailly: importance = the principal eigenvector.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Introduction to Graph  A graph consists of a set of vertices, and a set of edges that link together the vertices.  A graph can be: Directed: Edges are.
15-853Page :Algorithms in the Real World Indexing and Searching III (well actually II) – Link Analysis – Near duplicate removal.
Lexicon/dictionary DIC Inverted Index Allows quick lookup of document ids with a particular word Stanford UCLA MIT … PL(Stanford) PL(UCLA)
Link Analysis, PageRank and Search Engines on the Web
Link Structure and Web Mining Shuying Wang
(hyperlink-induced topic search)
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
PageRank Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 27, 2014.
Link Analysis HITS Algorithm PageRank Algorithm.
Undue Influence: Eliminating the Impact of Link Plagiarism on Web Search Rankings Baoning Wu and Brian D. Davison Lehigh University Symposium on Applied.
CS246 Link-Based Ranking. Problems of TFIDF Vector  Works well on small controlled corpus, but not on the Web  Top result for “American Airlines” query:
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
Stochastic Approach for Link Structure Analysis (SALSA) Presented by Adam Simkins.
Λ14 Διαδικτυακά Κοινωνικά Δίκτυα και Μέσα
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
1 Page Link Analysis and Anchor Text for Web Search Lecture 9 Many slides based on lectures by Chen Li (UCI) an Raymond Mooney (UTexas)
Using Hyperlink structure information for web search.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
Social Networking Algorithms related sections to read in Networked Life: 2.1,
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
Link Analysis on the Web An Example: Broad-topic Queries Xin.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
Overview of Web Ranking Algorithms: HITS and PageRank
Web Search. Crawling Start from some root site e.g., Yahoo directories. Traverse the HREF links. Search(initialLink) fringe.Insert( initialLink ); loop.
Link Analysis Rong Jin. Web Structure  Web is a graph Each web site correspond to a node A link from one site to another site forms a directed edge 
Ranking Link-based Ranking (2° generation) Reading 21.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris.
Intelligent Database Systems Lab Presenter: CHANG, SHIH-JIE Authors: Longzhuang Li, Yi Shang, Wei Zhang 2002.ACM. Improvement of HITS-based Algorithms.
- Murtuza Shareef Authoritative Sources in a Hyperlinked Environment More specifically “Link Analysis” using HITS Algorithm.
1 CS 430: Information Discovery Lecture 5 Ranking.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
Google's Page Rank. Google Page Ranking “The Anatomy of a Large-Scale Hypertextual Web Search Engine” by Sergey Brin and Lawrence Page
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Roberto Battiti, Mauro Brunato
The PageRank Citation Ranking: Bringing Order to the Web
15-499:Algorithms and Applications
HITS Hypertext-Induced Topic Selection
Methods and Apparatus for Ranking Web Page Search Results
Search Engines and Link Analysis on the Web
7CCSMWAL Algorithmic Issues in the WWW
Link-Based Ranking Seminar Social Media Mining University UC3M
Prepared by Rao Umar Anwar For Detail information Visit my blog:
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
A Comparative Study of Link Analysis Algorithms
Lecture 22 SVD, Eigenvector, and Web Search
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Sarthak Ahuja ( ) Saumya jain ( )
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
Lecture 22 SVD, Eigenvector, and Web Search
Lecture 22 SVD, Eigenvector, and Web Search
--WWW 2010, Hongji Bao, Edward Y. Chang
Presentation transcript:

COMP5331 Web databases Prepared by Raymond Wong Presented by Raymond Wong raywong@cse COMP5331

Web Databases Raymond Wong COMP5331

How to rank the webpages? COMP5331

Ranking Methods HITS Algorithm PageRank Algorithm COMP5331

HITS Algorithm HITS is a ranking algorithm which ranks “hubs” and “authorities”. COMP5331

HITS Algorithm Hub Authority Each page has two weights v Hub Authority v Each page has two weights Authority weight a(v) Hub weight h(v) COMP5331

HITS Algorithm Each vertex has two weights Authority weight Hub weight A good hub has many outgoing edges to good authorities Each vertex has two weights Authority weight Hub weight A good authority has many edges from good hubs Authority Weight v v Hub Weight COMP5331 a(v) = uv h(u) h(v) = vu a(u)

HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step COMP5331

Step 1 – Sampling Step Given a user query with several terms Collect a set of pages that are very relevant – called the base set How to find base set? We retrieve all webpages that contain the query terms. The set of webpages is called the root set. Next, find the link pages, which are either pages with a hyperlink to some page in the root set or some page in the root set has hyperlink to these pages All pages found form the base set. COMP5331

HITS Algorithm HITS involves two major steps. Step 1: Sampling Step Step 2: Iteration Step COMP5331

Step 2 – Iteration Step Goal: to find the base pages that are good hubs and good authorities COMP5331

Step 2 – Iteration Step Adjacency matrix M = N MS A N: Netscape MS: Microsoft A: Amazon.com N h(N) = a(N) + a(MS) + a(A) h(MS) = a(A) M h(A) = a(N) + a(MS) A = h(N) h(MS) h(A) a(N) a(MS) a(A) h(N) h(MS) h(A) a(N) a(MS) a(A) COMP5331

Step 2 – Iteration Step Adjacency matrix M = N MS A N: Netscape MS: Microsoft A: Amazon.com N a(N) = h(N) + h(A) a(MS) = h(N) + h(A) M a(A) = h(N) + h(MS) A = a(N) a(MS) a(A) h(N) h(MS) h(A) h(N) h(MS) h(A) a(N) a(MS) a(A) COMP5331

Step 2 – Iteration Step We have We derive COMP5331

Step 2 – Iteration Step = N MS A M = N MS A MT N M = N MS A MMT = N MS MTM A COMP5331

The sum of all elements in the vector = 3 MS A Hub = 1.5 0.402 1.098 Step 2 – Iteration Step = N MS A MMT Hub (non-normalized) Iteration No. 1 2 3 4 5 6 7 N MS A 1 6 2 4 7 2 5 7.071 1.929 5.143 7.091 1.909 5.182 7.096 1.904 5.192 7.098 1.902 5.195 Hub (normalized) The sum of all elements in the vector = 3 Iteration No. 1 2 3 4 5 6 7 N MS A 1 1.5 0.5 1 1.5 0.429 1.071 1.5 0.409 1.091 1.5 0.404 1.096 1.5 0.402 1.098 1.5 0.402 1.098 COMP5331

The sum of all elements in the vector = 3 MS A Hub = 1.5 0.402 1.098 Step 2 – Iteration Step N MS A Authority = 1.098 0.804 = N MS A MTM Authority (non-normalized) Iteration No. 1 2 3 4 5 6 7 N MS A 1 5 4 5.143 3.857 5.182 3.818 5.192 3.808 5.195 3.805 5.196 3.804 Authority (normalized) The sum of all elements in the vector = 3 Iteration No. 1 2 3 4 5 6 7 1 1.071 0.857 1.091 0.818 1.096 0.808 1.098 0.805 1.098 0.804 1.098 0.804 N MS A COMP5331

How to Rank Many ways Rank in descending order of hub only MS A Hub = 1.5 0.402 1.098 How to Rank N MS A Authority = 1.098 0.804 Many ways Rank in descending order of hub only Rank in descending order of authority only Rank in descending order of the value computed from both hub and authority (e.g., the sum of the hub value and the authority value) COMP5331

Ranking Methods HITS Algorithm PageRank Algorithm COMP5331

PageRank Algorithm (Google) Disadvantage of HITS: Since there are two concepts, namely hubs and authorities, we do not know which concept is more important for ranking. Advantage of PageRank: PageRank involves only one concept for ranking COMP5331

PageRank Algorithm (Google) PageRank Algorithm makes use of Stochastic approach to rank the pages COMP5331

PageRank Algorithm (Google) N: Netscape MS: Microsoft A: Amazon.com N Stochastic matrix M = N MS A M A COMP5331

PageRank Algorithm (Google) = N MS A M Page Rank Iteration No. 1 2 3 4 5 .. 33 N MS A 1 1 0.5 1.5 1.25 0.75 1 1.125 0.5 1.375 1.156 0.531 1.313 … 1.20 0.60 Microsoft (MS) is quite upset with this result. Microsoft decides to link only to itself from now on. COMP5331

PageRank Algorithm (Google) Stochastic matrix M = N MS A M A COMP5331

PageRank Algorithm (Google) Stochastic matrix M = N MS A M A COMP5331

PageRank Algorithm (Google) = N MS A M Page Rank Iteration No. 1 2 3 4 5 .. 40 N MS A 1 1 1.5 0.5 0.75 1.75 0.5 0.625 2 0.375 0.5 2.188 0.313 … 3 Microsoft (MS) is happy. It is the most important now. Others is not happy. COMP5331

PageRank Algorithm (Google) Spider trap: a group of one or more pages that have no links out of the group will eventually accumulate all the importance of the web. Microsoft has become a spider trap. How to solve it? COMP5331

PageRank Algorithm (Google) = N MS A M COMP5331

PageRank Algorithm (Google) = N MS A M Page Rank Iteration No. 1 2 3 4 5 .. 20 N MS A 1 1 1.4 0.6 0.84 1.56 0.6 0.776 1.688 0.536 0.725 1.765 0.510 … 0.636 1.909 0.455 We have a more reasonable distribution of importance than before. COMP5331