7CCSMWAL Algorithmic Issues in the WWW

7CCSMWAL Algorithmic Issues in the WWW
Lecture 5

Web Crawling Download web pages to create a repository,
Basic crawling algorithm is like a graph traversal algorithm, e.g., BFS and DFS, but it differs in Starts with an initial set of URLs, (not one single URL) Add sites to Google’s list of to-be-crawled URLs Terminates before the entire Web graph can be traversed

Basic Crawler Architecture

Basic Crawler Architecture
DNS = Domain name server Translate a textual websites into IP addresses E.g.,  Parse Analyze the content of the downloaded pages, extract hyperlinks (URLs) in them Doc FP Store fingerprints of the downloaded pages for checking duplicated pages, maybe from mirror sites Robot templates Decide whether the extracted URLs should be excluded from crawling, e.g., to crawl only some domains, say .uk

Issues Arise in Web Crawling
Page Selection: What pages should the crawler download? Page Refreshment: How should the crawler refresh pages? Efficient Resource Use: How should the load on the visited Web sites be minimized? Coordination: How should the crawling process be parallelised?

Page Selection Web crawler are not able to download all of the Web because Network bandwidth and disk space of the Web crawler are neither infinite nor free Pages change over time. Web crawler’s copy of the Web quickly becomes obsolete The amount of information on the Web is finite, but the number of pages is infinite (dynamic pages) [my diary August 11th 3005]

Page Selection What pages should the crawler download?
Important to carefully select the pages and to visit “important” pages first by prioritizing the URLs in the crawl queue properly Pagerank score is used as a metric of importance of webpages. Problem is: Used as an evaluation measure after the crawling takes place No full information available to compute the “real” pagerank during crawling

Basic Crawling Algorithm
Start off with an initial set of URLs S0 First place S0 in the URL frontier, where all URLs to be retrieved are kept and prioritized Get a URL from the frontier with the highest priority, download the page, extract any URLs (hyperlinks) on the downloaded page, and put the new URLs in the frontier Repeat the process until the crawler decides to stop

URL Frontier Every URL in the frontier is associated with a priority
The frontier is implemented as a heap data structure, which supports the follow operations efficiently (in O(log n) time) Insert a new URL Extract an URL with the max priority Heap used to implement a priority queue

Heap A binary tree of priority values
Maintain a property that the parent node has a priority value larger than its two children 54 25 34 19 8 29 14 10 5

Insert a New Node Insert a new node in the lowest level (compare from root) Swap the priority value with the parent node if necessary, to maintain the heap property. Repeatedly do so along the path to the root. E.g., add a new node with priority 47 (start at root = 54) 54 25 34 19 8 29 14 10 5 47

Insert a New Node 54 54 25 34 19 47 29 14 10 5 8 25 34 swap 19 8 29 14 swap 10 5 47 54 47 34 19 25 29 14 10 5 8 no swap

Extract-Max Remove the root label
Substitute root with the right-most leaf on the lowest level Restore the heap property from the root downwards, if necessary 54 47 34 19 25 29 14 10 5 8

Extract-Max Swap the parent with one of its children (with larger priority value), if necessary, to restore the heap property 8 47 34 19 25 29 14 10 5 swap 47 8 34 swap 19 25 29 14 10 5

Extract-Max Final result 47 25 34 19 8 29 14 10 5

Page Ordering Define the page crawl priority, and hence determine order of the pages the crawler downloads Work with partial information because the “complete” graph is not known Need a measure of vertex importance we can compute using the part of the network seen so far If we can choose only the important pages, we can ignore the others. Saves time, space. Useful for construction of initial network in HITS

How to prioritize? Different strategies BFS (oldest first)
The longer the URL in the frontier, the higher the priority Backlink-count (in-degree) The more the links pointing to the URL from pages already downloaded, the higher the priority. Simplified authority measure

Page Ordering Different strategies (cont) Batch-pagerank
Compute the pagerank based on the pages already downloaded. The higher the pagerank, the higher the priority. Re-compute the pagerank after K pages are download, where K is a parameter (used to refresh pagerank values) Partial-pagerank Similar to batch-pagerank Between pagerank re-calculation, a temporary pagerank is assigned to new pages using the sum of the pagerank of the pages pointing to it divided by the number of out-links of those pages

Page Ordering Different strategies (cont)
OPIC (On-line Page Importance Computation) Weighted backlink-count All pages start with the same amount of “cash”. Every time a page is crawled, its “cash” is split among the pages it links to. The more the “cash” it has received, the higher the priority More time efficient than pagerank computation Larger-sites-first Use the number of un-crawled pages found so far (at a given site) as the priority for picking a Web site Starts with the sites with the largest number of pending pages Pick the page within chosen site by BFS order

Back-Link Back-Link Site 1 Site 3 Site 2
If choice at any step, choose lowest label Suppose the crawling starts at vertex 1 Add vertices 3, Back link list 3(1), 5(1). vertex 3 is crawled Add vertex 2, 7 Back link list 2(1), 5(1), 7(1) Vertex 2 is crawled Add vertex 4,6 Back link list 4(1), 5(1), 6(1), 7(1) Vertex 4 is crawled no new vertex added. back link list 5(2), 6(1), 7(1) vertex 5 is crawled Add vertex 9, back link list 6(2), 7(1), 9(1) Vertex 6 is crawled Add vertex 8, back link list 7(1), 8(1), 9(1) Followed by 7, 8, 9 in that order Back-Link Crawl order [1, 3,2,4,5,6,7,8,9] Site 1 1 2 3 4 5 6 7 8 9 Site 3 Site 2

Example (OPIC) OPIC crawl order [1,3,2,7,5,6,8,4,9]
Let C[i] denote the cash in vertex i Initially, C[i]=1 for all i Suppose the crawling starts at vertex 1 C[3] += 1/2  C[3] = 1.5 C[5] += 1/2  C[5] = 1.5 Suppose vertex 3 is crawled with C[3]=1.5 (vertices 3 and 5 have the same amount of cash) C[2] += 1.5/2  C[2] = 1.75 C[7] += 1.5/2  C[7] = 1.75 Suppose vertex 2 is crawled with C[2] = 1.75 (vertices 2 and 7 have the same amount of cash) Vertex 1 has been crawled (but still gets 1/3 of cash) C[4] += 1.75/3  C[4] = 1.58 (19/12) C[6] += 1.75/3  C[6] = 1.58 Vertex 7 is crawled with C[7] = 1.75 C[5] +=1.75/1  C[5] = 3.25 Vertex 5 is crawled with C[5] = 3.25 OPIC crawl order [1,3,2,7,5,6,8,4,9] 1 2 3 4 5 6 7 8 9

Example (Largest-Site-First)
Suppose the crawling starts at vertex 1 The frontier has vertices 3 & 5, which belong to two different sites [1,1,0] (3,5,*) Suppose vertex 3 is crawled next The frontier has vertices 2, 5 & 7 Site 2 has the most number of vertices in the frontier, with vertex 5 the “oldest” [1,2,0] (2,5-7,*) Vertex 5 is crawled (1st in BFS order) The frontier has vertices 2, 6, 7, & 9 Site 2 has the most number of vertices in the frontier, with vertex 7 the “oldest” [1,2,1] (2,7-9,6) Vertex 7 is crawled (1st in BFS order) The frontier has vertices 2, 6 & 9, which belong to three different sites [1,1,1] (2,9,6) Suppose vertex 2 is crawled (1st in BFS order) The frontier has vertices 4, 6 & 9 Site 3 has the most number of vertices in the frontier, with vertex 6 the “oldest” (*,9,6-4) Vertex 6 is crawled Largest First =[1,3,5,7,2,6,4,9,8] Site 1 1 2 3 4 5 6 7 8 9 Site 3 Site 2

exercise Use Back-Link, OPIC for the directed graph (1,2), (1,3), (1,4), (2,3), (2,4), (2,5), (3,6), (4,6), (6,5) Use Largest-Site-First assuming Site 1= {1,4,5} and Site 2= {2,3,6}

Case Study Baeza-Yates et al (2005) studied the above crawling algorithms corresponding to pages under the .cl (Chile) and .gr (Greece) top-level domains in April to September 2004 “Complete” crawls on each domain Include both static and dynamic pages Stopped at depth 5 for dynamic pages and 15 for static pages Limit to download at most 25,000 pages from each website

Omniscient ordering Omniscient (Oracle) For evaluation purposes only
We suppose we can query an “oracle” which knows the complete Web graph and has calculated the actual pagerank of each page The higher the pagerank of a page, the higher the priority Same restrictions as the other strategies, can only download a page if it has already downloaded a page that points to it In other words: prioritize URL’s based on true pagerank, and see what happens

Performance Metrics Cumulative pagerank order
Sum of the “real” pagerank of the pages downloaded so far (by whatever algorithm) The cumulative pagerank are plotted at different points of the crawling process The best algorithm gets the most pagerank as early as possible Nothing can beat the ORACLE (real answer)

Experiment (Chile, April 2004)
Fraction of total pagerank Fraction of total webpages

Experiment (Chile, May 2004)

Experiment (Greece, May 2004)

Experiment (Greece, September 2004)

Some Observations Omniscient has the best performance, as expected
Backlink-count and partial-pagerank are the worst strategy in the experiments according to cumulative pagerank The performance of other four strategies is close These strategies can retrieve about half of the pagerank value of their domains downloading only around 20-30% of the pages

Page Refreshment Webpages are constantly being updated
Crawler has to revisit the downloaded pages in order to detect changes and refresh the downloaded collection How should the crawler refresh pages? To decide what page to revisit and what page to skip To decide how often a page should be revisited

End of Crawling Strategies
Start of Page Refreshment. Not covered in lecture (background reading only) Continue with Lecture 6 Note: How up-to-date the page is an important issue for media based webpages

End of taught material For interest only

Page Refreshment Intuitively, we want to maintain a collection of pages as up-to-date / fresh as possible E.g., two collections, A and B, containing the same 20 web pages If A maintains 10 pages up-to-date on average, and B maintains 15 up-to-date pages, we consider B to be “fresher” than A Even if all pages are obsolete, we consider collection A “more current” than B, if A was refreshed 1 day ago, and B was refreshed 1 year ago Need a definition of freshness and age

Freshness Let S = {e1, ..., eN} be the local copies of N pages
The freshness of a local page ei at time t is F(ei; t) = 1 if ei is up-to-date at time t F(ei; t) = 0 otherwise The freshness of S at time t is F(S; t) = (i=1 to N F(ei; t)) / N If only M (<N) pages will be up-to-date at a specific time t, the freshness of S at time t is F(S; t) = M / N i.e., the freshness is the fraction of S that is up-to-date

Age To capture “how old” the local copies are
Let tm(ei) be the time of the first modification of ei after the most recent download of ei E.g., if ei is download at time 10 and ei was modified at times 5, 17, 20, ..., tm(ei) = 17 The age of ei at time t is A(ei; t) = 0 if ei is up-to-date at time t A(ei; t) = t – tm(ei) otherwise The age of S at time t is A(S; t) = (i=1 to N A(ei; t)) / N

Evolution of Freshness and Age
The change of the freshness and age of a page ei over the time while ei is updated and refreshed F(ei; t) A(ei; t) The page is changed The page is refreshed 1 Time t

Freshness over a Period of Time
To compare different refreshment policies Time average of freshness of page ei Time average of freshness of S The Time average of age can be defined similarly F(S) is the average of F(ei) for all i, i.e.,

Refreshment Policies Refreshment frequency Resource allocaton
How frequently pages should be refreshed How many pages can be refreshed at a given time period Resource allocaton How frequently each individual page is refreshed uniform-allocation: all pages are refreshed at the same rate nonuniform-allocation: pages are refreshed at different rate. E.g., proportional-allocation refreshs page ei with a frequency fi proportional to its change frequency i

Example Consider only three pages e1, e2 and e3, that changes at the rates 1=4, 2=3 and 3=2 (times/day) Suppose the crawler can download 9 pages/day E.g. of refreshment policy Refresh all pages uniformly at the same rate, i.e., refresh all e1, e2, e3 at the rate of 3 (times/day) Refresh pages proportionally more often when it changes more often, e.g., f1=4, f2=3 and f3=2 (times/day) for e1, e2 and e3, respectively

Refreshment Policies (cont)
Refreshment order What order the pages are refreshed E.g., fixed order: all pages are refreshed in the same order repeatedly Refreshment points May need to refresh pages only in a limited time window E.g., if a website is heavily accessed during daytime, it might be desirable to crawl the site only in the night, when it is less frequently visited

Resource Allocation Assume pages change at different rates
Page ei changes at the rate i (times/day) Suppose the crawler can download n*f pages per day n is the number of pages in S on average every page can be refreshed f times per day Uniform allocation Refresh every page f times per day Proportional allocation Refresh page ei at the rate fi (times/day) where i=1 to N fi = N*f and 1/f1 = 2/f2 = ... = N/fN In terms of freshness and age, it can be proved that uniform allocation is better than the proportional policy under any distribution of i values

: page modification time
Simple Example S = {e1, e2} e1 changes at 9 times/day. Assume one day is split into 9 intervals and e1 changes once and only once in each interval e2 changes at 1 time/day To maximize the freshness averaged over time The crawler can only download 1 page per day Which page to refresh? e1 v : page modification time e2

Refresh e2 Suppose e2 is refreshed at mid-day
If the element e2 changes in the morning, it will remain up-to-date for the afternoon, i.e., we get 1/2 day “benefit” The probability that e2 changes in the morning is 1/2, so the “expected benefit” of refreshing e2 is 1/2 x 1/2 = 1/4 day

Refresh e1 By the same reasoning, if we refresh e1 in the middle of an interval, e1 will remain up-to-date for the remaining half of the interval (1/18 day) with probability 1/2 Therefore, the expected benefit is 1/2 x 1/18 = 1/36 day The expected benefit can be seen as an estimate of average freshness  it is more effective to refresh e2

When More Pages Can Be Downloaded Per Day
f1 = refreshment rate for e1 1 = 9 f2 = refreshment rate for e2 2 = 1

Optimal Refreshment Policy
Given i, for 1 to N, the rate of change of ei, find the values of the refresh rate fi of ei that maximize where i=1 to N fi = N * f, the download rate of the crawler

Optimal Refreshment Policy
Cho and Garcia-Molina (2003) gave the optimal solution using the method of Lagrange multipliers (details omitted) The optimal solution can always be represented as a graph like To improve freshness, we should penalize the elements that change too often fi Refresh rate i Rate of change

Example S = {e1, e2, e3, e4, e5} 1=1, 2=2, 3=3, 4=4, 5=5 (times/day) The crawler downloads 5 pages per day The optimal solution Page Refresh rate e1 1.15 e2 1.36 e3 1.35 e4 1.14 e5

Efficient Resource Use
How should the load on the visited Web sites be minimized? When the crawlers collects pages from the WWW, it consumes resources belonging to other organization E.g., when the crawler downloads page p on a site, the site needs to retrieve page p from its file system, consume disk and CPU resource.The page needs to be transferred through the network.

Also called the politeness policy Use of robots.txt protocol Web administrators indicate which parts of their Web servers should not be accessed by crawlers. "Crawl-delay" parameter in the robots.txt file to indicate the number of seconds to delay between crawler requests Enforce an interval between connections by a crawler E.g., 60 sec, but maybe too long for practical use Other experiments have implemented a delay of 10 sec and 15 sec

Adaptive politeness policy If it took t seconds to download a document from a given server, the crawler waits for 10t seconds before downloading the next page

Coordination How should the crawling process be distributed?
Due to the enormous size of the Web, crawlers often run on multiple machines and download ages in parallel Distributed crawlers should be coordinated properly, so that Different crawlers do not visit the same Web site multiple times

Coordination Use of distributed crawler to while
maximize the download rate while minimizing the overhead from parallelization and avoiding repeated downloads of the same page. To avoid downloading the same page more than once, the crawling system requires a policy for assigning the new URLs discovered during the crawling process, as the same URL can be found by two different crawling processes.

Coordination Policy to assign new URLs Dynamic assignment
A central server to assign new URLs to different crawler process. Could become the bottleneck Static assignment There is fixed rule to assign new URLs to crawler processes E.g., hashing function to transform URLs (or website name) into numbers. Different crawler processes for different numbers. Exchange of URLs between crawler processes when there are links from one website to another, for which they have a different hash number

Distributing the Basic Crawl Architecture
A crawling system is distributed in different “nodes” (machines or geographical locations)

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

7CCSMWAL Algorithmic Issues in the WWW

Similar presentations

Presentation on theme: "7CCSMWAL Algorithmic Issues in the WWW"— Presentation transcript:

Similar presentations

About project

Feedback