Web Characterization Week 9 LBSC 690 Information Technology.

Slides:



Advertisements
Similar presentations
Basic Internet Terms Digital Design. Arpanet The first Internet prototype created in 1965 by the Department of Defense.
Advertisements

Project 1 Introduction to HTML.
Web Search Week 6 LBSC 796/INFM 718R March 9, 2011.
Web Characterization Week 11 LBSC 690 Information Technology.
Did You Know? Number of spam s sent each day? 100 billion.
The Internet. What is the Internet? A community with about 100 million users Available in almost every country about 160,000 people are added each month.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Web Characterization Week 11 LBSC 690 Information Technology.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Introduction 2: Internet, Intranet, and Extranet J394 – Perancangan Situs Web Program Sudi Manajemen Universitas Bina Nusantara.
1st Project Introduction to HTML.
IDK0040 Võrgurakendused I Building a site: Publicising Deniss Kumlander.
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
1 Archive-It Training University of Maryland July 12, 2007.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
Donghui Xu Spring 2011, COMS E6125 Prof. Gail Kaiser.
1 Relational Databases. 2 Find Databases here… 3 And here…
How Search Engines Work General Search Strategies Dr. Dania Bilal IS 587 SIS Fall 2007.
Deduplication CSCI 572: Information Retrieval and Search Engines Summer 2010.
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Lecturer: Ghadah Aldehim
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.
Internet Technology I د. محمد البرواني. Project Number 3 Computer crimes in the cybernet Computer crimes in the cybernet Privacy in the cybernet Privacy.
Internet Basics Dr. Norm Friesen June 22, Questions What is the Internet? What is the Web? How are they different? How do they work? How do they.
Windows Internet Explorer 9 Chapter 1 Introduction to Internet Explorer.
Chapter 6 The World Wide Web. Web Pages Each page is an interactive multimedia publication It can include: text, graphics, music and videos Pages are.
First things, First Do you belong in here? – 10 – 12 – Comp. Discovery or Keyboard/Comp Apps – Do you have any experience with Web Page Design?????
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Basic Web Applications 2. Search Engine Why we need search ensigns? Why we need search ensigns? –because there are hundreds of millions of pages available.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Crawling Slides adapted from
HTML ~ Web Design.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
CIS 250 Advanced Computer Applications Internet/WWW Review.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
The INTERNET Worldwide network of computers linked together.
200 pt 300 pt 400 pt 500 pt 100 pt 200 pt 300 pt 400 pt 500 pt 100 pt 200pt 300 pt 400 pt 500 pt 100 pt 200 pt 300 pt 400 pt 500 pt 100 pt 200 pt 300 pt.
Search Engines.
WEB MINING. In recent years the growth of the World Wide Web exceeded all expectations. Today there are several billions of HTML documents, pictures and.
1 UNIT 13 The World Wide Web Lecturer: Kholood Baselm.
Schedule Introduction to Web & Database Integration Tools and Resources HTML and Styles Forms and Client-Side Scripts DB Engines Forms Processing and Server-Side.
 A website, also written Web site, web site, or simply site, is a group of Web pages and related text, databases, graphics, audio, and video files that.
Web Search Week 6 LBSC 796/INFM 718R October 15, 2007.
Web Design. What is the Internet? A worldwide collection of computer networks that links millions of computers by – Businesses (.com.net) – the government.
Web Server.
The Internet and World Wide Web Sullivan University Library.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
Copyright © 2002 Pearson Education, Inc. Slide 3-1 Internet II A consortium of more than 180 universities, government agencies, and private businesses.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.
A s s i g n m e n t W e e k 7 : T h e I n t e r n e t B Y : P a t r i c k O b i s p o.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
Web Search Module 6 INST 734 Doug Oard. Agenda  The Web Crawling Web search.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
1 UNIT 13 The World Wide Web. Introduction 2 Agenda The World Wide Web Search Engines Video Streaming 3.
1 UNIT 13 The World Wide Web. Introduction 2 The World Wide Web: ▫ Commonly referred to as WWW or the Web. ▫ Is a service on the Internet. It consists.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
CSE541: Web Applications Special Thanks to M. Abdur Rahman.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Dr. Frank McCown Comp 250 – Web Development Harding University
Chapter 1 Introduction to HTML.
Project 1 Introduction to HTML.
Internet.
Internet.
Computer Networks and Internet
Unit# 5: Internet and Worldwide Web
Presentation transcript:

Web Characterization Week 9 LBSC 690 Information Technology

Outline What is the Web? What’s on the Web? What is the nature of the Web? Preserving the Web

Defining the Web HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal?

Economics of the Web in 1995 Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos (now google), Yahoo

Nature of the Web Over one billion pages by 1999 –Growing at 25% per month! –Google indexed about 3 billion pages in 2003 Unstable –Changing at 1% per week Redundant –30-40% (near) duplicates e.g., unix man page tree

Source: Michael Lesk, How Much Information is there in the World?

Number of Web Sites

Web Sites by Country, 2002

What’s a Web “Site”? OCLC counts any server at port 80 –Misses many servers at other ports Some servers host unrelated content –Geocities Some content requires specialized servers –rtsp

World Trade in 2001 Source: World Trade Organization

Source: Global Reach English Global Internet User Population Chinese

Widely Spoken Languages Source:

Source: James Crawford,

Source: Jack Xu, 1999 Web Page Languages

European Web Size: Exponential Growth Source: Extrapolated from Grefenstette and Nioche, RIAO 2000

European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

Live Streams source: Feb 2000 Almost 2000 Internet-accessible Radio and Television Stations

Streaming Media SingingFish indexes 35 million streams 60% of queries are for music –Then movies –Then sports –Then news

Crawling the Web

Web Crawl Challenges Temporary server interruptions Discovering “islands” and “peninsulas” Duplicate and near-duplicate content Dynamic content Link rot Server and network loads Have I seen this page before?

Duplicate Detection Structural –Identical directory structure (e.g., mirrors, aliases) Syntactic –Identical bytes –Identical markup (HTML, XML, …) Semantic –Identical content –Similar content (e.g., with a different banner ad) –Related content (e.g., translated)

Robots Exclusion Protocol Based on voluntary compliance by crawlers Exclusion by site –Create a robots.txt file at the server’s top level –Indicate which directories not to crawl Exclusion by document (in HTML head) –Not implemented by all crawlers

Link Structure of the Web

The Deep Web Dynamic pages, generated from databases Not easily discovered using crawling Perhaps times larger than surface Web Fastest growing source of new information

Content of the Deep Web

Deep Web 60 Deep Sites Exceed Surface Web by 40 Times Name TypeURL Web Size (GBs) National Climatic Data Center (NOAA) Publichttp:// urces.html 366,000 NASA EOSDISPublichttp://harp.gsfc.nasa.gov/~imswww/pub/imswelco me/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Feehttp:// 32,940 AlexaPublic (partial) Right-to-Know Network (RTK Net)Publichttp:// MP3.comPublichttp://

Hands on: The Wayback Machine Internet Archive –Stored Alexa.com Web crawls since 1997 – Check out Maryland’s Web site in 1997 Check out the history of your favorite site

Discussion Point Can we save everything? Should we? Do people have a right to remove things?