Presentation is loading. Please wait.

Presentation is loading. Please wait.

Web Characterization Week 9 LBSC 690 Information Technology.

Similar presentations


Presentation on theme: "Web Characterization Week 9 LBSC 690 Information Technology."— Presentation transcript:

1 Web Characterization Week 9 LBSC 690 Information Technology

2 Outline What is the Web? What’s on the Web? What is the nature of the Web? Preserving the Web

3 Defining the Web HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?

4 Economics of the Web in 1995 Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos (now google), Yahoo

5 Nature of the Web Over one billion pages by 1999 –Growing at 25% per month! –Google indexed about 3 billion pages in 2003 Unstable –Changing at 1% per week Redundant –30-40% (near) duplicates e.g., unix man page tree

6 Source: Michael Lesk, How Much Information is there in the World?

7 Number of Web Sites

8 Web Sites by Country, 2002

9 What’s a Web “Site”? OCLC counts any server at port 80 –Misses many servers at other ports Some servers host unrelated content –Geocities Some content requires specialized servers –rtsp

10 World Trade in 2001 Source: World Trade Organization

11 Source: Global Reach English 20002005 Global Internet User Population Chinese

12 Widely Spoken Languages Source: http://www.g11n.com/faq.html

13 Source: James Crawford, http://ourworld.compuserve.com/homepages/JWCRAWFORD/can-pop.htm

14 Source: Jack Xu, Excite@Home, 1999 Web Page Languages

15 European Web Size: Exponential Growth Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

16 European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

17 Live Streams source: www.real.com, Feb 2000 Almost 2000 Internet-accessible Radio and Television Stations

18 Streaming Media SingingFish indexes 35 million streams 60% of queries are for music –Then movies –Then sports –Then news

19 Crawling the Web

20 Web Crawl Challenges Temporary server interruptions Discovering “islands” and “peninsulas” Duplicate and near-duplicate content Dynamic content Link rot Server and network loads Have I seen this page before?

21 Duplicate Detection Structural –Identical directory structure (e.g., mirrors, aliases) Syntactic –Identical bytes –Identical markup (HTML, XML, …) Semantic –Identical content –Similar content (e.g., with a different banner ad) –Related content (e.g., translated)

22 Robots Exclusion Protocol Based on voluntary compliance by crawlers Exclusion by site –Create a robots.txt file at the server’s top level –Indicate which directories not to crawl Exclusion by document (in HTML head) –Not implemented by all crawlers

23 Link Structure of the Web

24 The Deep Web Dynamic pages, generated from databases Not easily discovered using crawling Perhaps 400-500 times larger than surface Web Fastest growing source of new information

25 Content of the Deep Web

26 Deep Web 60 Deep Sites Exceed Surface Web by 40 Times Name TypeURL Web Size (GBs) National Climatic Data Center (NOAA) Publichttp://www.ncdc.noaa.gov/ol/satellite/satellitereso urces.html 366,000 NASA EOSDISPublichttp://harp.gsfc.nasa.gov/~imswww/pub/imswelco me/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Feehttp://www.nodc.noaa.gov/, http://www.ngdc.noaa.gov/ 32,940 AlexaPublic (partial) http://www.alexa.com/15,860 Right-to-Know Network (RTK Net)Publichttp://www.rtk.net/14,640 MP3.comPublichttp://www.mp3.com/

27 Hands on: The Wayback Machine Internet Archive –Stored Alexa.com Web crawls since 1997 –http://archive.orghttp://archive.org Check out Maryland’s Web site in 1997 Check out the history of your favorite site

28 Discussion Point Can we save everything? Should we? Do people have a right to remove things?


Download ppt "Web Characterization Week 9 LBSC 690 Information Technology."

Similar presentations


Ads by Google