Download presentation
Presentation is loading. Please wait.
Published byBertram Miles Arnold Modified over 9 years ago
1
Measuring the Size of the Web Dongwon Lee, Ph.D. IST 501, Fall 2014 Penn State
2
Studying the Web To study the characteristics of the Web Statistics Topology Behavior … Why Scientific curiosity Practical values Eg, search engine coverage 2 Nature 1999
3
Web as Platform Web becomes a new computation platform Pauses new challenges Scale Efficiency Heterogeneity Impact to People’s lives 3
4
Eg, How Big is the Web? Q1: How many web sites? Q2: How many web pages? Q3: How many surface/deep web pages? Research Method Mostly used Experimental method to validate novel solutions 4
5
Q1: How Many Web Sites? DNS Registrars List of domain names Issues Not every domain is web site A domain contains more than one web site Registrars are under no obligations for their correctness So many of them … 5
6
6 How Many Web Sites? Brute-force: Polling every IP IPv4: 256.256.256.256 2^32 = 4 billion IPv6: 2^128 10 sec/IP, 1000 simultaneous connection: 2^32*10/(1000*24*60*60) = 460 days Not going to work !!
7
7 How Many Web Sites? 2 nd attempt: Sampling T: All 4 Billion IPs S: Sampled IPs V: Valid reply
8
8 How Many Web Sites? 1.Select |S| random IPs 2.Send HTTP requests to port 80 at the selected IPs 3.Count valid replies: “HTTP 200 OK” = |V| 4.|T| = 2^32 Q: What are the issues here?
9
9 Issues Virtual hosting Ports other than 80 Temporarily unavailable sites …
10
10 OCLC Survey (2002) OCLC (Online Computer Library) Results http://wcp.oclc.org/ Still room for growth (at least for Web sites) ??
11
NetCraft Web Server Survey (2010) Goal is to measure web server market share Also record # of sites their crawlers visited August 2010: 213,458,815 distinct sites 11 http://news.netcraft.com/archives/category/web-server-survey/
12
NetCraft Web Server Survey (2013) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 716,822,317 distinct sites 12 http://news.netcraft.com/archives/category/web-server-survey/
13
NetCraft Web Server Survey (2014) Goal is to measure web server market share Also record # of sites their crawlers visited August 2013: 992,177,228 distinct sites 13 http://news.netcraft.com/archives/category/web-server-survey/
14
14 Q2: How Many Web Pages? Sampling based? Issue here? T: All URLs S: Sampled URLs V: Valid reply
15
15 How Many Web Pages? Method #1: For each site with valid reply, download all pages Measure average # of pages per site Avg # of pages X total # of sites Result [Lawrence & Giles, 1999] 289 pages per site, 2.8M sites 289 * 2.8M =~ 800M web pages
16
16 Further Issues A small #of sites with TONS of pages Sampling could miss these sites Majority of sites with small # of pages Lots of samples necessary 99.99% of the sites
17
17 How Many Web Pages? Method #2: Random sampling Assume: T: All pages B: Base set S: Random samples
18
18 Random Page? Idea: Random walk Start from a Portal home page (eg, Yahoo) Estimate the size of the portal: B Follow random links, say 10,000 times Select the pages At the end, a set of random web pages S are gathered
19
19 Straightforward Random Walk google.com amazon.com pike.psu.edu Follow a random out- link at each step 1 2 3 4 5 6 7 8 9 Issues?
20
20 Straightforward Random Walk google.com amazon.com pike.psu.edu Follow a random out- link at each step 1 2 3 4 5 6 7 8 9 1. Gets stuck in sinks and in dense Web communities 2. Biased towards popular pages 3. Converges slowly, if at all Issues?
21
21 Going to Converge? Random walks on regular, undirected graph uniformly distributed sample Theorem [Markov chain folklore]: After steps, a random walk reaches the stationary distribution : depends on the graph structure N: number of nodes Idea: Transform the Web graph to a regular, undirected graph Perform a random walk Problem Web is neither regular nor undirected
22
22 Intuition Random walk on undirected Web graph (not regular) High chance to be at a “popular” node at a particular time Increase the chance to be at a “unpopular” node by staying there longer through self loop. Unpopular nodesPopular node
23
23 WebWalker: Undirected Regular Random Walk on the Web Fact: A random walk on a connected undirected regular graph converges to a uniform stationary distribution after certain # of steps. w(v) = deg max - deg(v) google.com pike.psu.edu 1 2 3 1 amazon.com 4 0 2 3 0 3 2 2 4 4 3 3 3 1 2 5 Follow a random out- link or a random in-link at each step Use weighted self loops to even out pages’ degrees
24
24 Ideal Random Walk Generate the regular, undirected graph: Make edges undirected Decide d the maximum # of edges per page: say, 300,000 If edge(n) < 300,000, then add self-loop Perform random walks on the graph 10 -5 for the 1996 Web, N 10 9
25
25 WebWalker Results (2000) Size of the Web pages Altavista: |B| = 250M |B S|/|S| = 35% Estimated |T| = ~ 720M Avg page size: 12K Avg # of out-links: 10 Ziv Bar-Yossef, Alexander Berg, Steve Chien, Jittat Fakcharoenphol, and Dror Weitz, Approximating Aggregate Queries about Web Pages via Random Walks. VLDB, 2000
26
How large is SE’s Index? Prepare a representative corpus (eg, DMOZ) Draw a word W with known frequency percentage F Eg, “The” is present in 60% of all documents within the corpus Submit W to a search engine E If E reports there are X number of documents containing W, one can extrapolate the total size of E’s index as=~ X / F Repeat multiple times for computing average 26
27
http://www.worldwidewebsize.com/ (2010) 27 28 Billions
28
http://www.worldwidewebsize.com/ (2011) 28 46 Billions
29
http://www.worldwidewebsize.com/ (2013) 29 46 Billions
30
http://www.worldwidewebsize.com/ (2013) 30 10 Billions
31
Google Reveals Itself (2008) 1998: 26 Million URLs 2000: 1 Billion URLs 2008: 1 trillion URLs Not all of them are indexed Duplicates Auto-generated (eg, Calendar) Spams Experts suspect (2010) Google index at least 40 Billions 31
32
32 Deep Web (aka Hidden Web) HTML FORM Interface Query Answers
33
33 Q3: Size of Deep Web? Deep Web: Information reachable only through query interface (eg, HTML FORM) Often backed by DBMS Estimation: How to estimate? By sampling (Avg size of record) X (Avg # of records per site) X (Total # of Deep Web sites)
34
34 Size of Deep Web? Total # of Deep Web sites: |B S|/|S| Avg size of a record: Issue random queries Estimate reply size Avg # of records per site: Permute all possible queries for the FORM Issue all queries and count valid return
35
35 Size of Deep Web (2005) BrightPlanet report estimates: Avg size of a record: 14KB Avg # of records per site: 5MB Total # of Deep Web sites: 200,000 Size of the Deep Web: 10^16 (10 petabytes) 1,000 times larger than the “Surface Web” How to access it? Wrapper/Mediator (aka. Web scrapping) http://brightplanet.com/the-deep-web/deep-web-faqs/ : obsolete now
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.