CSE 454 Advanced Internet Systems University of Washington

Slides:



Advertisements
Similar presentations
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Advertisements

1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
Graphs, Node importance, Link Analysis Ranking, Random walks
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
More on Rankings. Query-independent LAR Have an a-priori ordering of the web pages Q: Set of pages that contain the keywords in the query q Present the.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 21: Link Analysis.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
The Web as Network Networked Life CSE 112 Spring 2006 Prof. Michael Kearns.
Network Structure and Web Search Networked Life CIS 112 Spring 2010 Prof. Michael Kearns.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Motivation When searching for information on the WWW, user perform a query to a search engine. The engine return, as the query’s result, a list of Web.
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
Piyush Kumar (Lecture 2: PageRank) Welcome to COT5405.
Google’s Billion Dollar Eigenvector Gerald Kruse, PhD. John ‘54 and Irene ‘58 Dale Professor of MA, CS and I T Interim Assistant Provost Juniata.
MapReduce and Graph Data Chapter 5 Based on slides from Jimmy Lin’s lecture slides ( (licensed.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Targil 6 Notes This week: –Linear time Sort – continue: Radix Sort Some Cormen Questions –Sparse Matrix representation & usage. Bucket sort Counting sort.
Overview of Web Ranking Algorithms: HITS and PageRank
Keyword Search in Databases using PageRank By Michael Sirivianos April 11, 2003.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
Lecture #10 PageRank CS492 Special Topics in Computer Science: Distributed Algorithms and Systems.
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 CS 430: Information Discovery Lecture 5 Ranking.
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
Topics In Social Computing (67810) Module 1 (Structure) Centrality Measures, Graph Clustering Random Walks on Graphs.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
15-499:Algorithms and Applications
Link-Based Ranking Seminar Social Media Mining University UC3M
CSE 454 Advanced Internet Systems University of Washington
The Anatomy of a Large-Scale Hypertextual Web Search Engine
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CSE 454 Advanced Internet Systems University of Washington
CS 440 Database Management Systems
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Junghoo “John” Cho UCLA
COMP5331 Web databases Prepared by Raymond Wong
Link Analysis Many slides are borrowed from Stanford Data Mining Class taught by Drs Anand Rajaraman, Jeffrey D. Ullman, and Jure Leskovec.
Presentation transcript:

CSE 454 Advanced Internet Systems University of Washington Link Analysis CSE 454 Advanced Internet Systems University of Washington 11/3/2018 10:49 AM 1

Ranking Search Results TF / IDF Calculation Tag Information Title, headers Font Size / Capitalization Anchor Text on Other Pages Link Analysis HITS – (Hubs and Authorities) PageRank 11/3/2018 10:49 AM 2

Authority and Hub Pages (1) A page is a good authority (with respect to a given query) if it is pointed to by many good hubs (with respect to the query). A page is a good hub page if it points to many good authorities (for the query). Good authorities & hubs reinforce 11/3/2018 10:49 AM 3

Authority and Hub Pages (2) Authorities and hubs for a query tend to form a bipartite subgraph of the web graph. A page can be a good authority and a good hub. hubs authorities 11/3/2018 10:49 AM 4

Stability Stability Conclusion small changes to graph  small changes to weights. Conclusion HITS is not stable. But PageRank is quite stable! Details in a few slides 11/3/2018 10:49 AM 5

Pagerank Intuition Think of Web as a big graph. Suppose surfer keeps randomly clicking on the links. Importance of a page = probability of being on the page Derive transition matrix from adjacency matrix Suppose  N forward links from page P Then the probability that surfer clicks on any one is 1/N 11/3/2018 10:49 AM 6

Matrix Representation Let M be an NN matrix muv = 1/Nv if page v has a link to page u muv = 0 if there is no link from v to u Let R0 be the initial rank vector Let Ri be the N1 rank vector for ith iteration Then Ri = M  Ri-1 M R0 0 0 0 ½ 1 0 0 0 0 1 0 A B C D A B C D ¼ A B C D 11/3/2018 10:49 AM 7

Problem: PageRank Sinks. Sinks = Sets of Nodes with no out-edges. Why is this a problem? B A C 11/3/2018 10:49 AM 8

Solution to Sink Nodes … … 1/N … Let (1-c) denote the chance of a random transition out of a sink. N = the number of pages K = M* = cM + (1-c)K Ri = M* Ri-1 … … 1/N … 11/3/2018 10:49 AM 9

Computing PageRank - Example B C D 0 0 0 ½ 1 0 0 0 0 1 0 A B C D A B C D M = R0 R30 0.05 0.05 0.05 0.45 0.85 0.85 0.05 0.05 0.05 0.05 0.85 0.05 ¼ 0.176 0.332 0.316 M*= 11/3/2018 10:49 AM 10

Worked Example Note: assume 1/N = 0.15 Instead on R0 = uniform 1/N let R0 = 1 everywhere Page A Page B Page C Page D 11/3/2018 10:49 AM 11

Navigational Effects (matrix M) Page A 1 Page B 1 1*0.85/2 1*0.85/2 1*0.85 1*0.85 Page C 1 Page D 1 1*0.85 11/3/2018 10:49 AM 12

Random Jumps Give Extra 0.15 Page A: 0.85 (from Page C) + 0.15 (random) = 1 Page B: 0.425 (from Page A) + 0.15 (random) = 0.575 Page C: 0.85 (from Page D) + 0.85 (from Page B) + 0.425 (from Page A) + 0.15 (random) = 2.275 Page D: receives nothing but 0.15 (random) = 0.15 Page A 1 Page B 0.575 Page C 2.275 Page D 0.15 11/3/2018 10:49 AM 13

Round 2 Page A: 2.275*0.85 (from Page C) + 0.15 (random) = 2.08375 Page B: 1*0.85/2 (from Page A) + 0.15 (random) = 0.575 Page C: 0.15*0.85 (from D) + 0.575*0.85 (from B) + 1*0.85/2 (from Page A) +0.15 (random) = 1.19125 Page D: receives nothing but random 0.15 = 0.15 Page A 2.03875 Page B 0.575 Page C 1.1925 Page D 0.15 11/3/2018 10:49 AM 14

Example of calculation (4) After 20 iterations, we get Page A 1.490 Page B 0.783 Page C 1.577 Page D 0.15 11/3/2018 10:49 AM 15

Example - Conclusions Page C has highest importance in page graph! Page A has the next highest: Convergence requires Many iterations Is it guaranteed?? 11/3/2018 10:49 AM 16

Linear Algebraic Interpretation PageRank = principle eigenvector of M* in limit HITS = principle eigenvector of M*(M*)T Where [ ]T denotes transpose Can prove PageRank is stable And HITS isn’t T 2 3 4 3 2 4 = 11/3/2018 10:49 AM 17

Stability Analysis Make 5 subsets by deleting 30% randomly 1 3 1 1 1 2 5 3 3 2 3 12 6 6 3 4 52 20 23 4 5 171 119 99 5 6 135 56 40 8 10 179 159 100 7 8 316 141 170 6 9 257 107 72 9 13 170 80 69 18 PageRank much more stable 11/3/2018 10:49 AM 18

Practicality Challenges Stanford version of Google : But How? M no longer sparse (don’t represent explicitly!) Data too big for memory (be sneaky about disk usage) Stanford version of Google : 24 million documents in crawl 147GB documents 259 million links Computing pagerank “few hours” on single 1997 workstation But How? Next discussion from Haveliwala paper… Instead of representing M explicity, just do the multiplication w. the rank vector using the equeation. 11/3/2018 10:49 AM 19

Efficient Computation: Preprocess Remove ‘dangling’ nodes Pages w/ no children Then repeat process Since now more danglers Stanford WebBase 25 M pages 81 M URLs in the link graph After two prune iterations: 19 M nodes One issue with this model is dangling links. Dangling links are simply links that point to any page with no outgoing links. They a ect the model because it is not clear where their weight should be distributed, and there are a large number of them. Often these dangling links are simply pages that we have not downloaded yet, since it is hard to sample the entire web (in our 24 million pages currently downloaded, we have 51 million URLs not downloaded yet, and hence dangling). Because dangling links do not a ect the ranking of any other page directly, we simply remove them from the system until all the PageRanks are calculated. After all the PageRanks are calculated, they can be added back in, without a ecting things signi cantly. Notice the normalization of the other links on the same page as a link which was removed will change slightly, but this should not have a large e ect. 11/3/2018 10:49 AM 20

Representing ‘Links’ Table Stored on disk in binary format 1 2 4 3 5 12, 26, 58, 94 5, 56, 69 1, 9, 10, 36, 78 Source node (32 bit int) Outdegree (16 bit int) Destination nodes Size for Stanford WebBase: 1.01 GB Assumed to exceed main memory 11/3/2018 10:49 AM 21

Algorithm 1 =  s Source[s] = 1/N while residual > { Dest Links (sparse) Source source node dest node s Source[s] = 1/N while residual > { d Dest[d] = 0 while not Links.eof() { Links.read(source, n, dest1, … destn) for j = 1… n Dest[destj] = Dest[destj]+Source[source]/n } d Dest[d] = (1-c) * Dest[d] + c/N /* dampening c= 1/N */ residual = Source – Dest /* recompute every few iterations */ Source = Dest 11/3/2018 10:49 AM 23

Analysis =  If memory is big enough to hold Source & Dest Links Source source node dest node Analysis If memory is big enough to hold Source & Dest IO cost per iteration is | Links| Fine for a crawl of 24 M pages But web > 8 B pages in 2005 [Google] Increase from 320 M pages in 1997 [NEC study] If memory is big enough to hold just Dest Sort Links on source field Read Source sequentially during rank propagation step Write Dest to disk to serve as Source for next iteration IO cost per iteration is | Source| + | Dest| + | Links| If memory can’t hold Dest Random access pattern will make working set = | Dest| Thrash!!! Precision of floating point nubmers in array --- single precision is good enough. 11/3/2018 10:49 AM 24

Block-Based Algorithm Partition Dest into B blocks of D pages each If memory = P physical pages D < P-2 since need input buffers for Source & Links Partition Links into B files Linksi only has some of the dest nodes for each source Linksi only has dest nodes such that DD*i <= dest < DD*(i+1) Where DD = number of 32 bit integers that fit in D pages source node  = dest node Dest Links (sparse) Source 11/3/2018 10:49 AM 25

Partitioned Link File Source node (32 bit int) Outdegr (16 bit) Num out (16 bit) Destination nodes (32 bit int) 4 2 12, 26 Buckets 0-31 1 3 1 5 2 5 3 1, 9, 10 4 1 58 Buckets 32-63 1 3 1 56 2 5 1 36 4 1 94 1 3 1 69 Buckets 64-95 2 5 1 78 11/3/2018 10:49 AM 26

Analysis of Block Algorithm IO Cost per iteration = B*| Source| + | Dest| + | Links|*(1+e) e is factor by which Links increased in size Typically 0.1-0.3 Depends on number of blocks Algorithm ~ nested-loops join 11/3/2018 10:49 AM 27

Comparing the Algorithms 11/3/2018 10:49 AM 28

Adding PageRank to a SearchEngine Weighted sum of importance+similarity with query Score(q, d) = wsim(q, p) + (1-w)  R(p), if sim(q, p) > 0 = 0, otherwise Where 0 < w < 1 sim(q, p), R(p) must be normalized to [0, 1]. 11/3/2018 10:49 AM 29

Summary of Key Points PageRank Iterative Algorithm Rank Sinks Efficiency of computation – Memory! Single precision Numbers. Don’t represent M* explicitly. Break arrays into Blocks. Minimize IO Cost. Number of iterations of PageRank. Weighting of PageRank vs. doc similarity. 11/3/2018 10:49 AM 30