Download presentation
Presentation is loading. Please wait.
1
CS246: Page Selection
2
Junghoo "John" Cho (UCLA Computer Science) 2 Page Selection Infinite # of pages on the Web – E.g., infinite pages from a calendar site How to select the pages to download? 2
3
Junghoo "John" Cho (UCLA Computer Science) 3 Challenges Due to Infinity What does Web coverage mean? – 8 billion vs 20 billion How much should I download? – 8 billion? 100 billion? – How much have I covered? – When can I stop? How to maximize coverage? – How can we define coverage?
4
Junghoo "John" Cho (UCLA Computer Science) 4 RankMass Web coverage weighted by PageRank Q: Why PageRank? A: – Primary ranking metric for search results – User’s visit probability under random surfer model
5
PageRank A page is important if it is pointed by many important pages PR(p) = PR(p 1 )/c 1 + … + PR(p k )/c k p i : page pointing to p, c i : number of links in p i PageRank of p is the sum of PageRanks of its parents One equation for every page – N equations, N unknown variables
6
PageRank: Random Surfer Model The probability of a Web surfer to reach a page after many clicks, following random links Random Click
7
Damping Factor and Trust Score Users do not always follow link – They get distracted and “jump” to other pages – d : Damping factor. Probability to follow links. – t i : Trust score. Non-zero only for the pages that user trusts and jumps to. “TrustRank”, “Personalized PageRank”
8
Junghoo "John" Cho (UCLA Computer Science) 8 RankMass: Definition RankMass of D C : – Assuming personalized PageRank Now what? How can we use it for the crawling problem? 8
9
Junghoo "John" Cho (UCLA Computer Science) 9 Two Crawling Challenges Coverage guarantee: – Given , make sure we download at least 1- Crawling efficiency: – For a given |D C |, pick D C such that RM(D C ) is the maximum 9
10
Junghoo "John" Cho (UCLA Computer Science) 10 RankMass Guarantee Q: How can we provide RankMass guarantee when we stop? Q: How do we calculate RankMass without downloading the whole Web? Q: Any way to provide the guarantee without knowing the exact PageRank?
11
Junghoo "John" Cho (UCLA Computer Science) 11 RankMass Guarantee We can’t compute the exact PageRank but can lower bound How? Let’s a start with a simple case 11
12
Junghoo "John" Cho (UCLA Computer Science) 12 Single Trusted Page t 1 =1 ; t i = 0 (i≠1) Always jump to p 1 when bored N L (p 1 ): pages reachable from p 1 in L links 12
13
Junghoo "John" Cho (UCLA Computer Science) 13 Single Trusted Page 13 Q: What is the probability to get to a page L links away from P1?
14
Junghoo "John" Cho (UCLA Computer Science) 14 RankMass Lower Bound: Single Trusted Page Assuming the trust vector T (1), the sum of the PageRank values of all L-neighbors of p1 is at least d L+1 close to 1 14
15
Junghoo "John" Cho (UCLA Computer Science) 15 PageRank Linearity Let be PageRank vector based on trust vector. That is, Then, for any 15
16
Junghoo "John" Cho (UCLA Computer Science) 16 RankMass Lower Bound: General Case The RankMass of the L-neighbors of the group of all trusted pages G, N L (G), is at least d L+1 close to 1. That is: Q: Given the result, how should we download for RankMass guarantee? 16
17
Junghoo "John" Cho (UCLA Computer Science) 17 The L-Neighbor Crawler 1.L := 0 2.N[0] = {pi| ti > 0} // Start with trusted pages 3.While ( < d L+1 ) 1.Download all uncrawled pages in N[L] 2.N[L + 1] = {all pages linked to by a page in N[L]} 3.L = L + 1 Essentially, a BFS (Breadth-First Search) crawling algorithm 17
18
Junghoo "John" Cho (UCLA Computer Science) 18 Crawling Efficiency For a given |D C |, pick D C such that RM(D C ) is the maximum Q: Can we use L-Neighbor? A: – L-Neighbor simple, but we need to further prioritize certain pages over others – Page level prioritization. 18
19
Junghoo "John" Cho (UCLA Computer Science) 19 Page Level Prioritization Q: What page should we download first to maximize RankMass? A: Pages with high PageRank Q: How do we know high PageRank pages? The idea: – Calculate PageRank lower bound of undownloaded pages – Give high priority to high lower bound pages 19
20
Junghoo "John" Cho (UCLA Computer Science) 20 Calculating PageRank Lower Bound PR(p): Probability random surfer at p Breakdown path by “interrupts”, jumps to a trusted page Sum up all paths that start with an interrupt, jump to a trusted page and end with p Interrupt Pj (1-d) (t j ) (d*1/3)(d*1/5) (d*1/3)(d*1/4) (d*1/3) P3P1P2 P4P5 Pi 20
21
Junghoo "John" Cho (UCLA Computer Science) 21 Calculating PageRank Lower Bound Q: What if we sum up the probabilities of the subsets of the paths to p? A: “lower bound” of PageRank p Basic idea – Start with the set of trusted pages G – Enumerate paths to a page p as we discover links – Sum up the probability of each discovered path to p Not every path needed. Only the ones that we have discovered so far
22
Junghoo "John" Cho (UCLA Computer Science) 22 RankMass Crawler: High Level Dynamically update lower bound on PageRank – By enumerating paths to pages Download page with highest lower bound – Sum of downloaded lower bounds = RankMass coverage 22
23
Junghoo "John" Cho (UCLA Computer Science) 23 RankMass Crawler CRM = 0 // CRM: crawled RankMass rm i = (1 − d)t i for each t i > 0 // rm i : RankMass (PageRank lower bound) of p i While (CRM < 1 − ): – Pick p i with the largest rm i. – Download p i if not downloaded yet CRM = CRM + rm i // we have downloaded p i For each p j linked to by p i : rm j = rm j + d/c i rm i // Update RankMass based on the discovered links from p i rm i = 0 23
24
Junghoo "John" Cho (UCLA Computer Science) 24 Experimental Setup HTML files only Algorithms simulated over web graph Crawled between Dec’ 2003 and Jan’ 2004 141 millon URLs span over 6.9 million host names 233 top level domains. 24
25
Junghoo "John" Cho (UCLA Computer Science) 25 Metrics Of Evaluation 1.How much RankMass is actually collected during the crawl 2.How much RankMass is “known” to have been collected during the crawl 25
26
Junghoo "John" Cho (UCLA Computer Science) 26 L-Neighbor 26
27
Junghoo "John" Cho (UCLA Computer Science) 27 RankMass 27
28
Junghoo "John" Cho (UCLA Computer Science) 28 Algorithm Efficiency AlgorithmDownloads required for above 0.98% guaranteed RankMass Downloads required for above 0.98% actual RankMass L-Neighbor7 million65,000 RankMass131,07227,939 Optimal27,101 28
29
Summary Web crawler and its challenges Page selection problem PageRank RankMass guarantee Computing PageRank lower bound RankMass crawling algorithm Any questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.