Download presentation
Presentation is loading. Please wait.
1
Measuring the Web
2
What? Use, size –Of entire Web, of sites (popularity), of pages –Growth thereof Technologies in use (servers, media types) Properties –Traffic (periodicity, self-similarity, timeouts,..) –Page-change properties (frequency, amount,..) –Links (self-similarity,
3
Why? Improve Web technologies Improve sites Improve search Justify prices Science
4
How? Surveys Instrumentation –Proxy/router logs, server logs, Sampling and statistical inference
5
A few survey (services) Nielson/NetRatings Pew Internet Project DLF/CLIR study
6
Some survey results NetRatings (Dec 2002) –168M US “Home” Internet Users –Use Web 7 hours/week to view 17 sites Pew Study (July 2002) –111M US Internet Users –33M of them search engine once/day
7
Simple sampling Netcraft server survey –Generate crawling and URL submission –35M sites in 2002 (Archive has 50M) OCLC Host survey –Generate random IP addrs and look for hosts –9,040,000 IP addrs with web servers in 2002 8,712,000 Unique Web sites
8
OCLC technique Generate 1% * 2^32 random IP numbers Screen out “bad ones” –Private addresses, IANA lists HTTP to port 80 of remainder Multiply number of responses by 1000 Use heuristics to eliminate “duplicates”
9
IP sampling and virtual hosting Netcraft says 1/2 of domain names virtually hosted on 100K IP numbers In 2000, OCLC said 3M IP addrs serving data, versus 3.4M IP addrs found by Netcraft
10
Interlude: “Size of Web” Size in (virtual) hosts, probably 40-60M –Based on Netcraft, OCLC, and Archive data Size in pages: infinite –People are obsessed with provide page- estimates, but this is a silly thing to do!
11
Heavy-tailed distributions Zipf, Pareto, power laws, lognormal Chic to find such things (Web, physics, bio) –…and then postulate “generative models” Statistics are squirrelly –For example, averages can be misleading
12
Heavy-tails on the Web Host and page: –Links (in and out) –Sizes –Popularity –In page case, both inter- and intrasite Page-size-to-popularity (Zipfian) Page and user reading times
13
Tripping on heavy tails How not to compute size of Web: –Use OCLC approach to find random hosts –Crawl each of these to measure average size –Multiply average size by host count Problem: heavy-tailed distribution of host size means that host sample is biased towards smaller hosts
14
Advanced inference Determine relative size of search engines A,B Pr[A&B|A] = |A+B|/|A| Pr[A&B|B] = |A+B|/|B| => |A|/|B| = Pr[A&B|B] / Pr[A&B|A]
15
Advanced inference Sample URLs from A –Issue random conjunctive query with <200 results, select a random result Test if present in B –Query with 8 rarest words and look for result Assume Pr[A&B|A] = # URLs discovered in A also found to be in B URL sampling biased to long documents Biased by ranking and details of engine
16
Conclusions Measuring Web is hard because it cannot be enumerated or even reliably sampled Statistical methods impacted by biases that cannot be quantified Validation is not possible The problem is getting harder (e.g., link spam) Quantitative studies are fascinating and a good research problem
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.