CS728: Internet Studies and Web Algorithms Lecture 2: Web Crawlers and DNS Algorithms April 3, 2008
2 Web Crawler basic algorithm 1.Maintain list of unvisited URLs 2.Remove a URL from the unvisited URL list 3.Use DNS lookup to determine the IP Address of its host name 4.Download the corresponding document 5.Parse doc, and extract any links contained in it. 6.Check if the URL is new, if yes add it to the list of unvisited URLs 7.Post process the downloaded document 8.Back to step 1
3 Single-threaded Crawler Initialize URL list with starting URLs Termination ? Pick URL from URL list Parse page Add URL to URL List [No more URL] [URL] [not done] [done] Crawling loop
4 Multithreaded Crawler Check for termination Add URL Get URL Thread Lock URL List Pick URL from List Unlock URL List Fetch Page Parse Page Check for termination Add URL Get URL Lock URL List Pick URL from List Unlock URL List Fetch Page Parse Page Thread end
5 Parallel Crawler Queues of URLs to visit Local Connect Collected Pages C - Proc Local Connect Internet
6 Modified Breadth-first search and URL frontier Crawlers maintain the URLs in the frontier and regurgitates them in some order whenever a crawler thread seeks a URL. Two important considerations govern the order: Quality: pages that change frequently should be prioritized Politeness: must avoid repeated fetch requests to a host within a short time span. A common heuristic is to insert a gap between successive fetch requests to a host that is an order of magnitude larger than the time taken for the most recent fetch from that host.
7 Robot Exclusion Protocol Contains part of the web site that a crawler should not visit. Placed at the root of a web site, robots.txt An example that tells all crawlers not to enter into three directories of a website: User-agent: * Disallow: /cgi-bin/ Disallow: /registration/ Disallow: /login/
8 Search Engine : architecture WWW Crawler(s) Page Repository Indexer Module Collection Analysis Module Query Engine Ranking Client Indexes : Text Structure Utility Queries Results
9 Search Engine : major components Crawlers Collects documents by recursively fetching links from a set of starting pages. Each crawler has different policies The pages indexed by various search engines are different The Indexer Processes pages, decide which of them to index, build various data structures representing the pages (inverted index,web graph, etc), different representation among search engines. Might also build additional structure ( LSI ) The Query Processor Processes user queries and returns matching answers in an order determined by a ranking algorithm.
10 Issues for crawlers 1.General software architecture 2.What pages should the crawler download ? 3.How should the crawler refresh pages ? 4.How should the load on the visited web sites be minimized? 5.How should the crawling process be parallelized? 6.What impact is performance of DNS system? Later, we will look at SE design and Web page analysis.
11 Web Crawlers and DNS Resolution DNS resolution is a well-known bottleneck in web crawling. Entails multiple requests and round-trips across the internet. Puts in jeopardy the goal of fetching several hundred documents a second. Caching: URLs for which we have recently performed DNS lookups are likely to be found in the DNS cache, avoiding the need to go to the DNS servers on the internet. Limited cache hit rate when obeying politeness constraints. Most web crawlers implement their own DNS resolver as a component of the crawler, since standard DNS methods are synchronous (only 1 request at a time). Basic Design: –A crawler thread sends a message to the DNS server and then performs a timed wait –A separate DNS-listen thread listens on the standard DNS port (port 53) for incoming response packets from the name service. Upon receiving a response, it signals the appropriate crawler thread. –A crawler thread that resumes operation because its wait time quantum has expired retries for a fixed number of attempts, sending out a new message to the DNS server and performing a timed wait each time. Various time-out strategies are possible: Mercator recommends five attempts. The time quantum increases exponentially with each of these attempts; Mercator started with one second and ended with roughly 90 seconds, in consideration of the fact that there are host names that take tens of seconds to resolve.
12 DNS Algorithms The objective of any DNS algorithm is two-fold –find quickest server (smallest latency) –find most accurate answers (highest reliability) BIND (Berkeley Internet Name Domain) is the most commonly used DNS software - originally created by four graduate students! BIND algorithms are designed to attempt to pick the fastest server, but not to use one server exclusively. Why? BIND uses a ranking score on servers that combines –average latency time for a query –a penalty on a server each time it is picked
13 BIND Algorithm Example, say there are two name servers for the domain uc.edu, and BIND may start by querying one, say , and penalizing it proportional to the time that it takes for the query. Thus, will eventually be queried and penalized in the same fashion. The process will continue by choosing server with the least total penalty. Questions: –How should we penalize? –How well do different strategies work? –Can we compare it to an optimal strategy?
14 Analysis of BIND Let us use the following notation: N = number of potential name servers T[i](t) = time taken server i to complete request at step t. S(t) = server chosen to query on step t The goal is to minimize the average query time AQT AQT = lim 1/t ∑ T[S(t)](t) as t goes to infinity The simplest BIND algorithm works as follows on each step: Let R[i](t) = total query time used for server i before step t 1. S(t) is picked to be the server i such that R[i](t) is a minimum (with arbitrarily broken ties). 2. If S(t) = i, then update R[i] (t+1) = R[i](t) + T[i](t) and all other R[i] are unchanged.
15 Simplifying assumption: the T[i](t) are constant over all time Clearly, in this case, the optimal algorithm always picks the server with the smallest T[i] value. Theorem: Assuming the T[i] are constant, the simple BIND algorithm has avg. query time AQT = N / ∑ 1/ T[ i ] Example: 3 servers: T[1]=1ms, T[2]=2ms, T[3]=3ms AQT= 3 / (1/1+1/2+1/3) = better than random?? Proof: Let n[i](t) be the number of times server i is picked before step t. Then, since the T[i] are constant, R[i](t) = T[i] n[i](t). Consider case 2 servers a and b. After t steps, server a will be chosen if R[a](t) is less than R[b](t), which implies T[a] n[a](t) < T[b] n[b](t). Analysis of BIND continued
16 In the limit these values will roughly equalize. So too does the ratios tend to equalize: n[a](t) / n[b](t) = T[b]/T[a] So server a is chosen a number of times inversely proportional to T[a]. This holds for all servers. Thus we have that the probability that any server a will be chosen is c/T[a] for some constant c. The “law of total probability” says that these must sum to 1, hence we get c = 1/ ∑1/ T[i]. Now to calculate AQT, server a will be picked with probability c/T[a] and take T[a] time to run. Thus the expected running time is ∑ (c/T[a]) T[a] = Nc = N / ∑ 1/T[i]
17 So AQT is N / ∑ 1/T[i] When is this good, when bad? Suppose N=2…. Let T[1]= 1 and let T[2]= x for some large number x. How bad can AQT be? AQT=2/(1+1/x) Can you think of a really bad case? A case which makes AQT N times as bad the optimal? Theorem: There is a case where the simple BIND algorithm is at least N(1-e) times as bad as the optimal algorithm. Proof: Set T[1] = 1, and T[i] = (N-1)/e for all i> 1. Then the average query time AQT is N/(1+e) as given by Theorem 1 above; while the optimal algorithm would always choose server 1 each time. Hence, performance time goes up as the number of servers does ! Is this unlikely in practice? Maybe not, If there is one nearby server and several far away.
18 BIND8 is the successor to simple version of BIND BIND8 accounts for accuracy of results in the scoring update. BIND8 uses running average of latencies for selected servers, and an exponential decay on unselected servers Uses three factors as follows: R[i](t+1) = a T[i](t) + (1-a) R[i](t) if i is selected and correct answer = b R[i](t) if i selected and incorrect answer = c R[i](t) if server i is not selected Typical values selected are a= 0.3, b =1.2 and c= 0.98 Surprisingly this enhanced algorithm actually has worse theoretical bound on performance.
19 Theorem: BIND8 is arbitrarily bad as compared with the optimal algorithm even with only 2 servers. Recall that Simple BIND gave a 2-appoximation on 2 servers. Proof. Let T[1]= 1 and let T[2]= x for some large number x. Then server 1 is selected and the get resulting sequence of scores on server 2: R[2] = x, cx, c 2 x, …, c n x which is >1 as long as n < log 1/c x. Hence average query time AQT = 1 + x/ log 1/c x. So for x large the AQT is arbitrarily bad.
20 Manipulating DNS to address Locality Problems in Applications Web service providers can manipulate the DNS system to better serve their clients. Problem: we wish for clients to choose “best” web server for themselves based on their locality Hence this requires changes to the usual DNS lookup. A domain name such as must not translate directly to an IP address such as , but rather must be aliased to an intermediate address. DNS allows for aliases called a “CNAME”, for example a CNAME for xyz.com could be a212.g.akamai.net The CNAME address is then be resolved by DNS to an IP address of an optimal server, or possibly another CNAME.