Copyright © 2000-2005 D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.

Slides:



Advertisements
Similar presentations
The Inside Story Christine Reilly CSCI 6175 September 27, 2011.
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
1 The PageRank Citation Ranking: Bring Order to the web Lawrence Page, Sergey Brin, Rajeev Motwani and Terry Winograd Presented by Fei Li.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis: PageRank
CS345 Data Mining Page Rank Variants. Review Page Rank  Web graph encoded by matrix M N £ N matrix (N = number of web pages) M ij = 1/|O(j)| iff there.
How PageRank Works Ketan Mayer-Patel University of North Carolina January 31, 2011.
DATA MINING LECTURE 12 Link Analysis Ranking Random walks.
Nested-Loop joins “one-and-a-half” pass method, since one relation will be read just once. Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
1 CS 430 / INFO 430: Information Retrieval Lecture 16 Web Search 2.
The PageRank Citation Ranking “Bringing Order to the Web”
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Zdravko Markov and Daniel T. Larose, Data Mining the Web: Uncovering Patterns in Web Content, Structure, and Usage, Wiley, Slides for Chapter 1:
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 April 2, 2006
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin and Lawrence Page.
Anatomy of Google (circa 1999) Slides from Project part B due a month from now (10/26)
Design Patterns for Efficient Graph Algorithms in MapReduce Jimmy Lin and Michael Schatz University of Maryland Tuesday, June 29, 2010 This work is licensed.
Scaling Personalized Web Search Glen Jeh, Jennfier Widom Stanford University Presented by Li-Tal Mashiach Search Engine Technology course (236620) Technion.
Date: Fri, 15 Feb :53: Subject: IOC awards presidency also to Gore (RNN)-- In a surprising, but widely anticipated move, the International.
Presented By: Wang Hao March 8 th, 2011 The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd.
CS345 Data Mining Link Analysis Algorithms Page Rank Anand Rajaraman, Jeffrey D. Ullman.
Link Analysis. 2 HITS - Kleinberg’s Algorithm HITS – Hypertext Induced Topic Selection For each vertex v Є V in a subgraph of interest: A site is very.
1 COMP4332 Web Data Thanks for Raymond Wong’s slides.
Overview of Web Data Mining and Applications Part I
“ The Initiative's focus is to dramatically advance the means to collect,store,and organize information in digital forms,and make it available for searching,retrieval,and.
PRESENTED BY ASHISH CHAWLA AND VINIT ASHER The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page and Sergey Brin, Stanford University.
The PageRank Citation Ranking: Bringing Order to the Web Larry Page etc. Stanford University, Technical Report 1998 Presented by: Ratiya Komalarachun.
Presented By: - Chandrika B N
The PageRank Citation Ranking: Bringing Order to the Web Presented by Aishwarya Rengamannan Instructor: Dr. Gautam Das.
1 Announcements Research Paper due today Research Talks –Nov. 29 (Monday) Kayatana and Lance –Dec. 1 (Wednesday) Mark and Jeremy –Dec. 3 (Friday) Joe and.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 University of Qom Information Retrieval Course Web Search (Link Analysis) Based on:
CS315 – Link Analysis Three generations of Search Engines Anchor text Link analysis for ranking Pagerank HITS.
1 CS 430: Information Discovery Lecture 9 Term Weighting and Ranking.
The PageRank Citation Ranking: Bringing Order to the Web Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd Presented by Anca Leuca, Antonis Makropoulos.
How Does a Search Engine Work? Part 2 Dr. Frank McCown Intro to Web Science Harding University This work is licensed under a Creative Commons Attribution-NonCommercial-
Targil 6 Notes This week: –Linear time Sort – continue: Radix Sort Some Cormen Questions –Sparse Matrix representation & usage. Bucket sort Counting sort.
Overview of Web Ranking Algorithms: HITS and PageRank
1 CSE 326: Data Structures: Hash Tables Lecture 12: Monday, Feb 3, 2003.
Searching  Google: page rank and anchor text  Hits: hubs and authorities  MSN’s Ranknet: learning to rank  Today’s web dragons.
PageRank. s1s1 p 12 p 21 s2s2 s3s3 p 31 s4s4 p 41 p 34 p 42 p 13 x 1 = p 21 p 34 p 41 + p 34 p 42 p 21 + p 21 p 31 p 41 + p 31 p 42 p 21 / Σ x 2 = p 31.
MA/CSSE 473 Day 28 Dynamic Programming Binomial Coefficients Warshall's algorithm Student questions?
Ranking Link-based Ranking (2° generation) Reading 21.
Understanding Google’s PageRank™ 1. Review: The Search Engine 2.
Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.
1 1 COMP5331: Knowledge Discovery and Data Mining Acknowledgement: Slides modified based on the slides provided by Lawrence Page, Sergey Brin, Rajeev Motwani.
1 CS 430: Information Discovery Lecture 5 Ranking.
MA/CSSE 473 Day 30 B Trees Dynamic Programming Binomial Coefficients Warshall's algorithm No in-class quiz today Student questions?
Link Analysis Algorithms Page Rank Slides from Stanford CS345, slightly modified.
Ljiljana Rajačić. Page Rank Web as a directed graph  Nodes: Web pages  Edges: Hyperlinks 2 / 25 Ljiljana Rajačić.
CS 540 Database Management Systems Web Data Management some slides are due to Kevin Chang 1.
IR Theory: Web Information Retrieval. Web IRFusion IR Search Engine 2.
Lecture 20. Graphs and network models 1. Recap Binary search tree is a special binary tree which is designed to make the search of elements or keys in.
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Web Mining Link Analysis Algorithms Page Rank. Ranking web pages  Web pages are not equally “important” v  Inlinks.
The PageRank Citation Ranking: Bringing Order to the Web
The PageRank Citation Ranking: Bringing Order to the Web
HITS Hypertext-Induced Topic Selection
CSE 454 Advanced Internet Systems University of Washington
CSE 454 Advanced Internet Systems University of Washington
The Anatomy of a Large-Scale Hypertextual Web Search Engine
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
HITS Hypertext Induced Topic Selection
CSE 454 Advanced Internet Systems University of Washington
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
HITS Hypertext Induced Topic Selection
Based on a talk by Mike Burrows
Presentation transcript:

Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington

Copyright © D.S.Weld Administrivia Schedule –Today (Me)Link Analysis & Alta Vista Case Studies –Thurs 11/31 (Adar) Web Log Mining –Tues 12/4 (Benaloh) Cryptography & Web Security –Thurs 12/6 (You) Project Presentations 12/3/2015 8:49 PM2

Copyright © D.S.Weld Project Presentations Length: 10 min + 3 min for questions –Practice presentation & gauge length Powerpoint or PDF Slides (Max 10) –Mail to me by 11am on 12/6 Every member should talk for some of slot Subtopics –Aspirations & reality of what you built –Demo? –Suprises (What was harder or easier than expected) –What did you learn? –Experiments & validation –Who did what 12/3/2015 8:49 PM3

Copyright © D.S.Weld12/3/2015 8:49 PM4 Ranking Search Results TF / IDF Calculation Tag Information –Title, headers Font Size / Capitalization Anchor Text on Other Pages Link Analysis –HITS – (Hubs and Authorities) –PageRank

Copyright © D.S.Weld12/3/2015 8:49 PM5 Pagerank Intuition Think of Web as a big graph. Suppose surfer keeps randomly clicking on the links. Importance of a page = probability of being on the page Derive transition matrix from adjacency matrix Suppose  N forward links from page P Then the probability that surfer clicks on any one is 1/N

Copyright © D.S.Weld12/3/2015 8:49 PM6 Matrix Representation Let M be an N  N matrix m uv = 1/N v if page v has a link to page u m uv = 0 if there is no link from v to u Let R 0 be the initial rank vector Let R i be the N  1 rank vector for i th iteration Then R i = M  R i-1 A B C D ½ ABCDABCD A B C D ¼¼¼¼¼¼¼¼ R0R0 M

Copyright © D.S.Weld12/3/2015 8:49 PM7 Problem: Page Sinks. Sink = node (or set of nodes) with no out-edges. Why is this a problem? A B C

Copyright © D.S.Weld12/3/2015 8:49 PM8 Solution to Sink Nodes Let: (1-c) = chance of random transition from a sink. N = the number of pages K = M * = cM + (1-c)K R i = M *  R i-1 … … 1 / N … …

Copyright © D.S.Weld12/3/2015 8:49 PM9 Computing PageRank - Example M = A B C D ½ ABCDABCD A B C D M*=M*= ¼¼¼¼¼¼¼¼ R0R0 R 30

Copyright © D.S.Weld12/3/2015 8:49 PM10 Authority and Hub Pages (1) A page is a good authority (with respect to a given query) if it is pointed to by many good hubs (with respect to the query). A page is a good hub page (with respect to a given query) if it points to many good authorities (for the query). Good authorities & hubs reinforce

Copyright © D.S.Weld12/3/2015 8:49 PM11 Authority and Hub Pages (2) Authorities and hubs for a query tend to form a bipartite subgraph of the web graph. A page can be a good authority and a good hub. hubsauthorities

Copyright © D.S.Weld12/3/2015 8:49 PM12 Linear Algebraic Interpretation PageRank = principle eigenvector of M * –in limit HITS = principle eigenvector of M *  (M * ) T –Where [ ] T denotes transpose Stability Small changes to graph  small changes to weights. –Can prove PageRank is stable –And HITS isn’t = T

Copyright © D.S.Weld12/3/2015 8:49 PM13 Stability Analysis (Empirical) Make 5 subsets by deleting 30% randomly PageRank much more stable

Copyright © D.S.Weld12/3/2015 8:49 PM14 Practicality Challenges –M no longer sparse (don’t represent explicitly!) –Data too big for memory (be sneaky about disk usage) Stanford Version of Google : –24 million documents in crawl –147GB documents –259 million links –Computing pagerank “few hours” on single 1997 workstation But How? –Next discussion from Haveliwala paper…

Copyright © D.S.Weld12/3/2015 8:49 PM15 Efficient Computation: Preprocess Remove ‘dangling’ nodes –Pages w/ no children Then repeat process –Since now more danglers Stanford WebBase –25 M pages –81 M URLs in the link graph –After two prune iterations: 19 M nodes

Copyright © D.S.Weld12/3/2015 8:49 PM16 Representing ‘Links’ Table Stored on disk in binary format Size for Stanford WebBase: 1.01 GB –Assumed to exceed main memory –(But source & dest assumed to fit) , 26, 58, 94 5, 56, 69 1, 9, 10, 36, 78 Source node (32 bit integer) Outdegree (16 bit int) Destination nodes (32 bit integers)

Copyright © D.S.Weld12/3/2015 8:49 PM17 Algorithm 1 =  dest links (sparse) source source node dest node  s Source[s] = 1/N while residual >  {  d Dest[d] = 0 while not Links.eof() { Links.read(source, n, dest 1, … dest n ) for j = 1… n Dest[dest j ] = Dest[dest j ]+Source[source]/n }  d Dest[d] = (1-c) * Dest[d] + c/N /* dampening c= 1/N */ residual =  Source – Dest  /* recompute every few iterations */ Source = Dest }

Copyright © D.S.Weld 12/3/2015 8:49 PM18 Analysis If memory can hold both source & dest –IO cost per iteration is | Links| –Fine for a crawl of 24 M pages –But web > 8 B pages in 2005 [Google] –Increase from 320 M pages in 1997 [NEC study] If memory only big enough to hold just dest…? –Sort Links on source field –Read Source sequentially during rank propagation step –Write Dest to disk to serve as Source for next iteration –IO cost per iteration is | Source| + | Dest| + | Links| But What if memory can’t even hold dest? –Random access pattern will make working set = | Dest| –Thrash!!! ….???? =  dest links source source node dest node

Copyright © D.S.Weld12/3/2015 8:49 PM19 Block-Based Algorithm Partition Dest into B blocks of D pages each –If memory = P physical pages –D < P-2 since need input buffers for Source & Links Partition (sorted) Links into B files –Links i only has some of the dest nodes for each source Specifically, Links i only has dest nodes such that DD*i <= dest < DD*(i+1) Where DD = number of 32 bit integers that fit in D pages =  dest links (sparse) source source node dest node

Copyright © D.S.Weld12/3/2015 8:49 PM20 3 Partitioned Link File , , 9, 10 Source node (32 bit int) Outdegr (16 bit) Destination nodes (32 bit integer) 2 1 Num out (16 bit) Buckets 0-31 Buckets Buckets 64-95

Copyright © D.S.Weld12/3/2015 8:49 PM21 Analysis of Block Algorithm IO Cost per iteration = –B*| Source| + | Dest| + | Links|*(1+e) –e is factor by which Links increased in size Typically Depends on number of blocks Algorithm ~ nested-loops join

Copyright © D.S.Weld12/3/2015 8:49 PM22 Comparing the Algorithms

Copyright © D.S.Weld12/3/2015 8:49 PM23 Comparing the Algorithms

Copyright © D.S.Weld12/3/2015 8:49 PM24 Adding PageRank to a SearchEngine Weighted sum of importance+similarity with query Score(q, d) = w  sim(q, p) + (1-w)  R(p), if sim(q, p) > 0 = 0, otherwise Where –0 < w < 1 –sim(q, p), R(p) must be normalized to [0, 1].

Copyright © D.S.Weld12/3/2015 8:49 PM25 Summary of Key Points PageRank Iterative Algorithm Sink Pages Efficiency of computation – Memory! –Don’t represent M* explicitly. –Minimize IO Cost. –Break arrays into Blocks. –Single precision numbers ok. Number of iterations of PageRank. Weighting of PageRank vs. doc similarity.