Web Characterization Week 11 LBSC 690 Information Technology.

Slides:



Advertisements
Similar presentations
RSS RSS is an acronymn for Really Simple Syndication or Rich Site Summary. RSS (noun) - an XML format for distributing news headlines on the Web. RSS.
Advertisements

Unit 11 Using the Internet & Browsing the Web.  Define the Internet and the Web  Set up & troubleshoot an Internet connection  Categorize webs sites.
1 Content Delivery Networks iBAND2 May 24, 1999 Dave Farber CTO Sandpiper Networks, Inc.
Project 1 Introduction to HTML.
Web Characterization Week 9 LBSC 690 Information Technology.
Evidence from Behavior LBSC 796/CMSC 828o Douglas W. Oard Session 5, February 23, 2004.
CIS101 Introduction to Computing Week 05. Agenda Your questions Exam next week - Excel Introduction to the Internet & HTML Online HTML Resources Using.
Web Search Week 6 LBSC 796/INFM 718R March 9, 2011.
Evidence from Behavior LBSC 796/INFM 719R Douglas W. Oard Session 7, October 22, 2007.
1 CS 502: Computing Methods for Digital Libraries Lecture 16 Web search engines.
CIS101 Introduction to Computing Week 05. Agenda Your questions CIS101 Survey Introduction to the Internet & HTML Online HTML Resources Using the HTML.
Web Characterization Week 11 LBSC 690 Information Technology.
Introduction to HTML 2006 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Internet – Part II. What is the World Wide Web? The World Wide Web is a collection of host machines, which deliver documents, graphics and multi-media.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Inbound Statistics Slides Attract. 1 Blogging There are 31% more bloggers today than there were three years ago 46% of people read blogs more than once.
The Internet 8th Edition Tutorial 1 Browser Basics.
1 The World Wide Web. 2  Web Fundamentals  Pages are defined by the Hypertext Markup Language (HTML) and contain text, graphics, audio, video and software.
1st Project Introduction to HTML.
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
HTML 1 Introduction to HTML. 2 Objectives Describe the Internet and its associated key terms Describe the World Wide Web and its associated key terms.
Chapter ONE Introduction to HTML.
1 Relational Databases. 2 Find Databases here… 3 And here…
1 Introduction to Web Development. Web Basics The Web consists of computers on the Internet connected to each other in a specific way Used in all levels.
HTML Comprehensive Concepts and Techniques Intro Project Introduction to HTML.
PowerPoint Presentation to Accompany GO! with Internet Explorer 9 Getting Started Chapter 3 Exploring the World Wide Web with Internet Explorer 9.
With Internet Explorer 9 Getting Started© 2013 Pearson Education, Inc. Publishing as Prentice Hall1 Exploring the World Wide Web with Internet Explorer.
FALL 2005CSI 4118 – UNIVERSITY OF OTTAWA1 Part 4 Web technologies: HTTP, CGI, PHP,Java applets)
The Internet Writer’s Handbook 2/e Introduction to World Wide Web Terms Writing for the Web.
Windows Internet Explorer 9 Chapter 1 Introduction to Internet Explorer.
First things, First Do you belong in here? – 10 – 12 – Comp. Discovery or Keyboard/Comp Apps – Do you have any experience with Web Page Design?????
XHTML Introductory1 Linking and Publishing Basic Web Pages Chapter 3.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
Using a Web Browser What does a Web Browser do? A web browser enables you to surf the World Wide Web. What are the most popular browsers?
Tutorial 1: Browser Basics.
HTML, XHTML, and CSS Sixth Edition Chapter 1 Introduction to HTML, XHTML, and CSS.
Web Search Module 6 INST 734 Doug Oard. Agenda The Web  Crawling Web search.
Evidence from Behavior INST 734 Doug Oard Module 7.
Microsoft Internet Explorer and the Internet Using Microsoft Explorer 5.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 1 1 Browser Basics Introduction to the Web and Web Browser Software Tutorial.
Chapter 8 Browsing and Searching the Web. Browsing and Searching the Web FAQs: – What’s a Web page? – What’s a URL? – How does a browser work? – How do.
State of the Blogosphere October 2005 David L. Sifry CEO, Technorati Inc.
Week 3 LBSC 690 Information Technology Web Characterization Web Design.
Chapter 8 Browsing and Searching the Web. 2Practical PC 5 th Edition Chapter 8 Getting Started In this Chapter, you will learn: − What is a Web page −
INTERNET. Objectives Explain the origin of the Internet and describe how the Internet works. Explain the difference between the World Wide Web and the.
Web Browsers  Web browser- software that you run on your computer to make it work as a web client.  Web Servers- Computers connected to the Internet.
Chapter Twelve Digital Interactive Media Arens|Schaefer|Weigold Copyright © 2015 McGraw-Hill Education. All rights reserved. No reproduction or distribution.
Web Search Week 6 LBSC 796/INFM 718R October 15, 2007.
Web Design. What is the Internet? A worldwide collection of computer networks that links millions of computers by – Businesses (.com.net) – the government.
Web Server.
The Internet and World Wide Web Sullivan University Library.
HTML Concepts and Techniques Fifth Edition Chapter 1 Introduction to HTML.
CSCI-235 Micro-Computers in Science The Internet and World Wide Web.
Chapter 1 Introduction to HTML, XHTML, and CSS HTML5 & CSS 7 th Edition.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Web Search Module 6 INST 734 Doug Oard. Agenda  The Web Crawling Web search.
Week-6 (Lecture-1) Publishing and Browsing the Web: Publishing: 1. upload the following items on the web Google documents Spreadsheets Presentations drawings.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
HTML PROJECT #1 Project 1 Introduction to HTML. HTML Project 1: Introduction to HTML 2 Project Objectives 1.Describe the Internet and its associated key.
Chapter 10: Web Basics.
Chapter 1 Introduction to HTML
Chapter 8 Browsing and Searching the Web
Chapter 1 Introduction to HTML.
Browsing and Searching the Web
Project 1 Introduction to HTML.
Introduction to HTML Simple facts yet crucial to beginning of study in fundamentals of web page design!
Introduction to Internet Explorer
Intro Project Introduction to HTML.
Presentation transcript:

Web Characterization Week 11 LBSC 690 Information Technology

The Why of the Web (in 1995) Affordable storage –300,000 words/$ Adequate backbone capacity –25,000 simultaneous transfers Adequate “last mile” bandwidth –1 second/screen Display capability –10% of US population Effective search capabilities –Lycos, Yahoo

Defining the Web HTTP, HTML, or URL? Static, dynamic or streaming? Public, protected, or internal? Content or behavior?

Number of Web Sites

Discussion Topic: What’s a Web “Site”? OCLC counted any server at port 80 –Misses many servers at other ports Some servers host unrelated content –Geocities Some content requires specialized servers –rtsp

Crawling the Web

Link Structure of the Web

Web Crawl Challenges Discovering “islands” and “peninsulas” Duplicate and near-duplicate content –30-40% of total content Server and network loads Dynamic content generation Link rot –Changes at 1% per week Temporary server interruptions

Duplicate Detection Structural –Identical directory structure (e.g., mirrors, aliases) Syntactic –Identical bytes –Identical markup (HTML, XML, …) Semantic –Identical content –Similar content (e.g., with a different banner ad) –Related content (e.g., translated)

Robots Exclusion Protocol Requires voluntary compliance by crawlers Exclusion by site –Create a robots.txt file at the server’s top level –Indicate which directories not to crawl Exclusion by document (in HTML head) –Not implemented by all crawlers

Hands on: The Internet Archive alexa.com Web crawls since 1997 – Check out the CLIS Web site from 1998! Check out the history of your favorite site

Discussion Point Can we save everything? Should we? Do people have a right to remove things?

The “Deep Web” Dynamic pages, generated from databases Not easily discovered using crawling Perhaps times larger than surface Web Fastest growing source of new information

Content of the Deep Web

Deep Web 60 Deep Sites Exceed Surface Web by 40 Times Name TypeURL Web Size (GBs) National Climatic Data Center (NOAA) Publichttp:// urces.html 366,000 NASA EOSDISPublichttp://harp.gsfc.nasa.gov/~imswww/pub/imswelco me/plain.html 219,600 National Oceanographic (combined with Geophysical) Data Center (NOAA) Public/Feehttp:// 32,940 AlexaPublic (partial) Right-to-Know Network (RTK Net)Publichttp:// MP3.comPublichttp://

Source: James Crawford,

Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users

Native speakers, Global Reach projection for 2004 (as of Sept, 2003) Global Internet Users

World Trade in 2001 Source: World Trade Organization

European Web Content Source: European Commission, Evolution of the Internet and the World Wide Web in Europe, 1997

Doubling 18.9 Million Weblogs Tracked Doubling in size approx. every 5 months Consistent doubling over the last 36 months Blogs Doubling

Blue = Mainstream Media Red = Blog Challenge: Fight, or Embrace?

Kryptonite Lock Controversy US Election Day Indian Ocean Tsunami Superbowl Schiavo Dies Newsweek Koran Deepthroat Revealed Justice O’Connor Live 8 Concerts London Bombings Katrina Daily Posting Volume 1.2 Million legitimate Posts/Day Spam posts marked in red On average, additional 5.8% are spam posts Some spam spikes as high as 18%

A Web of Speech? Web in 1995Speech in 2005 Storage (words per $) 300K1.5M Internet Backbone (simultaneous users) 250K30M “Last Mile” (Download time) 1 second (no graphics) Streaming Display Capability (Computers/US population) 10%100% Search SystemsLycos Yahoo

Rethinking the Spoken Word Speech is better for some things than writing Spoken bits are as persistent as written bits Storage costs is 80 times more than text –Disk cost falls by a factor of 80 in ~16 years  If speech is searchable, we will keep lots of it

A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B)  Required storage ≈ 5 PetaBytes/day

A Little Math Collectable spoken words ≈ 10 Tw/day –1 billion users * 100 words/min * 200 min/day / 2 Compressed speech ≈ 2 words/kiloByte –(100/60 w/sec) * (6.5 kb/sec / 8 b/B)  Required storage ≈ 5 PetaBytes/day Storage array sales > 5 PB/day –457 PB in 2Q 2005 (increasing 59% per year) $22/person/year (decreasing at 31%/year) Source: IDC Worldwide Disk Storage Systems Tracker, 2Q 2005

Human History Oral Tradition Writing Human Future Writing and Speech

Hands On: Speech on the Web audio.search.yahoo.com blinkx.com ocw.mit.edu podcasts.net

View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Type Edit

View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Behavior Category Examine Retain Reference Annotate Create Type Edit

Minimum Scope SegmentObjectClass View Listen Select Print Bookmark Save Purchase Delete Subscribe Copy / paste Quote Forward Reply Link Cite Mark up Tag Publish Organize Behavior Category Examine Retain Reference Annotate Create Type Edit

Estimating Authority from Links Authority Hub

Collecting Click Streams Browsing histories are easily captured –Make all links initially point to a central site Encode the desired URL as a parameter –Build a time-annotated transition graph for each user Cookies identify users (when they use the same machine) –Redirect the browser to the desired page Reading time is correlated with interest –Can be used to build individual profiles –Used to target advertising by doubleclick.com

Search Engine Query Logs A: Southeast Asia (Dec 27, 2004) B: Indonesia (Mar 29, 2005) C; Pakistan (Oct 10, 2005) D; Hawaii (Oct 16, 2006) E: Indonesia (Aug 8, 2007) F: Peru (Aug 16, 2007)

Search Engine Query Logs

AOL User

Gaining Access to Observations Observe public behavior –Hypertext linking, publication, citing, … Policy protection –EU: Privacy laws –US: Privacy policies + FTC enforcement Statistical assurance of privacy –Distributed architecture –Model and mitigate privacy risks

No Interest Low Interest Moderate Interest High Interest Rating Reading Time (seconds) Full Text Articles (Telecommunications)

More Complete Observations User selects an article –Interpretation: Summary was interesting User quickly prints the article –Interpretation: They want to read it User selects a second article –Interpretation: another interesting summary User scrolls around in the article –Interpretation: Parts with high dwell time and/or repeated revisits are interesting User stops scrolling for an extended period –Interpretation: User was interrupted

No Interest No Interest Low Interest Moderate Interest High Interest Abstracts (Pharmaceuticals)

Critical Issues Protecting privacy –What absolute assurances can we provide? –How can we make remaining risks understood? Scalable rating servers –Is a fully distributed architecture practical? Non-cooperative users –How can the effect of spamming be limited?