Web Characterization Week 11 LBSC 690 Information Technology
The Why of the Web (in 1995) Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos, Yahoo
Defining the Web HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?
Number of Web Sites
Discussion Topic: What’s a Web “Site”? OCLC counted any server at port 80 –Misses many servers at other ports Some servers host unrelated content –Geocities Some content requires specialized servers –rtsp
Crawling the Web
Web Crawl Challenges Discovering “islands” and “peninsulas” Duplicate and near-duplicate content –30-40% of total content Server and network loads Dynamic content generation Link rot –Changes at 1% per week Temporary server interruptions
Link Structure of the Web
Duplicate Detection Structural –Identical directory structure (e.g., mirrors, aliases) Syntactic –Identical bytes –Identical markup (HTML, XML, …) Semantic –Identical content –Similar content (e.g., with a different banner ad) –Related content (e.g., translated)
Robots Exclusion Protocol Requires voluntary compliance by crawlers Exclusion by site –Create a robots.txt file at the server’s top level –Indicate which directories not to crawl Exclusion by document (in HTML head) –Not implemented by all crawlers
Hands on: The Internet Archive alexa.com Web crawls since 1997 – Check out Maryland’s Web site in 1997 Check out the history of your favorite site
Discussion Point Can we save everything? Should we? Do people have a right to remove things?
The “Deep Web” Dynamic pages, generated from databases Not easily discovered using crawling Perhaps times larger than surface Web Fastest growing source of new information
Content of the Deep Web
Deep Web 60 Deep Sites Exceed Surface Web by 40 Times Name TypeURL Web Size (GBs) National Climatic Data Center (NOAA) Publichttp:// urces.html 366,000 NASA EOSDISPublichttp://harp.gsfc.nasa.gov/~imswww/pub/imswelco me/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Feehttp:// 32,940 AlexaPublic (partial) Right-to-Know Network (RTK Net)Publichttp:// MP3.comPublichttp://
Source: James Crawford,
Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users
Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users
World Trade in 2001 Source: World Trade Organization
European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997
Doubling 18.9 Million Weblogs Tracked Doubling in size approx. every 5 months Consistent doubling over the last 36 months Blogs Doubling
Blue = Mainstream Media Red = Blog Challenge: Fight, or Embrace?
Kryptonite Lock Controversy US Election Day Indian Ocean Tsunami Superbowl Schiavo Dies Newsweek Koran Deepthroat Revealed Justice O’Connor Live 8 Concerts London Bombings Katrina Daily Posting Volume 1.2 Million legitimate Posts/Day Spam posts marked in red On average, additional 5.8% are spam posts Some spam spikes as high as 18%
A Web of Speech? Web in 1995Speech in 2005 Storage (words per $) 300K1.5M Internet Backbone (simultaneous users) 250K30M “Last Mile” (Download time) 1 second (no graphics) Streaming Display Capability (Computers/US population) 10%100% Search SystemsLycos Yahoo
Rethinking the Spoken Word Speech is better for some things than writing Spoken bits are as persistent as written bits Storage costs is 80 times more than text –Disk cost falls by a factor of 80 in ~16 years If speech is searchable, we will keep lots of it
A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B) Required storage ≈ 5 PetaBytes/day
A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B) Required storage ≈ 5 PetaBytes/day Storage array sales > 5 PB/day –457 PB in 2Q 2005 (increasing 59% per year) $22/person/year (decreasing at 31%/year) Source: IDC Worldwide Disk Storage Systems Tracker, 2Q 2005
Human History Oral Tradition Writing Human Future Writing and Speech
Hands On: Speech on the Web singingfish.com blinkx.com ocw.mit.edu podcasts.yahoo.com