Web Characterization Week 11 LBSC 690 Information Technology.

Slides:



Advertisements
Similar presentations
Project 1 Introduction to HTML.
Advertisements

Web Characterization Week 9 LBSC 690 Information Technology.
An Analysis of Internet Content Delivery Systems Stefan Saroiu, Krishna P. Gommadi, Richard J. Dunn, Steven D. Gribble, and Henry M. Levy Proceedings of.
Web Search Week 6 LBSC 796/INFM 718R March 9, 2011.
Did You Know? Number of spam s sent each day? 100 billion.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Web Characterization Week 11 LBSC 690 Information Technology.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Inbound Statistics Slides Attract. 1 Blogging There are 31% more bloggers today than there were three years ago 46% of people read blogs more than once.
1st Project Introduction to HTML.
What Is A Web Page? An Introduction to the Internet.
COMPUTER TERMS PART 1. COOKIE A cookie is a small amount of data generated by a website and saved by your web browser. Its purpose is to remember information.
1 Archive-It Training University of Maryland July 12, 2007.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
UNDERSTANDING WEB AND WEB PROJECT PLANNING AND DESIGNING AND EFFECTIVE WEBSITE Garni Dadaian.
1 Relational Databases. 2 Find Databases here… 3 And here…
Search engine structure Web Crawler Page archive Page Analizer Control Query resolver ? Ranker text Structure auxiliary Indexer.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
With Internet Explorer 9 Getting Started© 2013 Pearson Education, Inc. Publishing as Prentice Hall1 Exploring the World Wide Web with Internet Explorer.
Lecturer: Ghadah Aldehim
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.
Internet Technology I د. محمد البرواني. Project Number 3 Computer crimes in the cybernet Computer crimes in the cybernet Privacy in the cybernet Privacy.
Windows Internet Explorer 9 Chapter 1 Introduction to Internet Explorer.
First things, First Do you belong in here? – 10 – 12 – Comp. Discovery or Keyboard/Comp Apps – Do you have any experience with Web Page Design?????
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
NASRULLAH KHAN.  Lecturer : Nasrullah   Website :
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
HTML ~ Web Design.
State of the Blogosphere October 2005 David L. Sifry CEO, Technorati Inc.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
CIS 250 Advanced Computer Applications Internet/WWW Review.
Peter Laird. | 1 Building Dynamic Google Gadgets in Java Peter Laird Managing Architect WebLogic Portal BEA Systems.
The INTERNET Worldwide network of computers linked together.
200 pt 300 pt 400 pt 500 pt 100 pt 200 pt 300 pt 400 pt 500 pt 100 pt 200pt 300 pt 400 pt 500 pt 100 pt 200 pt 300 pt 400 pt 500 pt 100 pt 200 pt 300 pt.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Schedule Introduction to Web & Database Integration Tools and Resources HTML and Styles Forms and Client-Side Scripts DB Engines Forms Processing and Server-Side.
 History (WWW & Internet)  Search tools  Search Engines vs. Subject Directory  Meta search Engines  Steps for Searching  Effective Strategies.
 A website, also written Web site, web site, or simply site, is a group of Web pages and related text, databases, graphics, audio, and video files that.
Web Search Week 6 LBSC 796/INFM 718R October 15, 2007.
Web Design. What is the Internet? A worldwide collection of computer networks that links millions of computers by – Businesses (.com.net) – the government.
Web Server.
World Wide Web Guide * for Students to the Internet.
NASRULLAH KHAN.  Lecturer : Nasrullah   Website :
The Internet and World Wide Web Sullivan University Library.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Web Search Architecture & The Deep Web
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
and Internet Explorer.  The transmission of messages and files via a computer network  Messages can consist of simple text or can contain attachments,
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Web Search Module 6 INST 734 Doug Oard. Agenda  The Web Crawling Web search.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
General Architecture of Retrieval Systems 1Adrienn Skrop.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
4.01 How Web Pages Work.
Chapter 10: Web Basics.
Chapter 1 Introduction to HTML
Chapter 1 Introduction to HTML.
Browsing and Searching the Web
Project 1 Introduction to HTML.
Web page a hypertext document connected to the World Wide Web.
Internet.
A Brief Introduction to the Internet
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Unit# 5: Internet and Worldwide Web
The Internet and Electronic mail
Presentation transcript:

Web Characterization Week 11 LBSC 690 Information Technology

The Why of the Web (in 1995) Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos, Yahoo

Defining the Web HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?

Number of Web Sites

Discussion Topic: What’s a Web “Site”? OCLC counted any server at port 80 –Misses many servers at other ports Some servers host unrelated content –Geocities Some content requires specialized servers –rtsp

Crawling the Web

Web Crawl Challenges Discovering “islands” and “peninsulas” Duplicate and near-duplicate content –30-40% of total content Server and network loads Dynamic content generation Link rot –Changes at 1% per week Temporary server interruptions

Link Structure of the Web

Duplicate Detection Structural –Identical directory structure (e.g., mirrors, aliases) Syntactic –Identical bytes –Identical markup (HTML, XML, …) Semantic –Identical content –Similar content (e.g., with a different banner ad) –Related content (e.g., translated)

Robots Exclusion Protocol Requires voluntary compliance by crawlers Exclusion by site –Create a robots.txt file at the server’s top level –Indicate which directories not to crawl Exclusion by document (in HTML head) –Not implemented by all crawlers

Hands on: The Internet Archive alexa.com Web crawls since 1997 – Check out Maryland’s Web site in 1997 Check out the history of your favorite site

Discussion Point Can we save everything? Should we? Do people have a right to remove things?

The “Deep Web” Dynamic pages, generated from databases Not easily discovered using crawling Perhaps times larger than surface Web Fastest growing source of new information

Content of the Deep Web

Deep Web 60 Deep Sites Exceed Surface Web by 40 Times Name TypeURL Web Size (GBs) National Climatic Data Center (NOAA) Publichttp:// urces.html 366,000 NASA EOSDISPublichttp://harp.gsfc.nasa.gov/~imswww/pub/imswelco me/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Feehttp:// 32,940 AlexaPublic (partial) Right-to-Know Network (RTK Net)Publichttp:// MP3.comPublichttp://

Source: James Crawford,

Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users

Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users

World Trade in 2001 Source: World Trade Organization

European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

Doubling 18.9 Million Weblogs Tracked Doubling in size approx. every 5 months Consistent doubling over the last 36 months Blogs Doubling

Blue = Mainstream Media Red = Blog Challenge: Fight, or Embrace?

Kryptonite Lock Controversy US Election Day Indian Ocean Tsunami Superbowl Schiavo Dies Newsweek Koran Deepthroat Revealed Justice O’Connor Live 8 Concerts London Bombings Katrina Daily Posting Volume 1.2 Million legitimate Posts/Day Spam posts marked in red On average, additional 5.8% are spam posts Some spam spikes as high as 18%

A Web of Speech? Web in 1995Speech in 2005 Storage (words per $) 300K1.5M Internet Backbone (simultaneous users) 250K30M “Last Mile” (Download time) 1 second (no graphics) Streaming Display Capability (Computers/US population) 10%100% Search SystemsLycos Yahoo

Rethinking the Spoken Word Speech is better for some things than writing Spoken bits are as persistent as written bits Storage costs is 80 times more than text –Disk cost falls by a factor of 80 in ~16 years  If speech is searchable, we will keep lots of it

A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B)  Required storage ≈ 5 PetaBytes/day

A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B)  Required storage ≈ 5 PetaBytes/day Storage array sales > 5 PB/day –457 PB in 2Q 2005 (increasing 59% per year) $22/person/year (decreasing at 31%/year) Source: IDC Worldwide Disk Storage Systems Tracker, 2Q 2005

Human History Oral Tradition Writing Human Future Writing and Speech

Hands On: Speech on the Web singingfish.com blinkx.com ocw.mit.edu podcasts.yahoo.com