TEK: Internet Search System for Low-Connectivity Communities Saman Amarasinghe Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts.

Slides:



Advertisements
Similar presentations
3.02H Publishing a Website 3.02 Develop webpages..
Advertisements

The Internet and the Web
Copyright © 2008 Roger Webster, Ph.D. EDW647 Internet For Educators Conclusion Roger W. Webster, Ph.D. Department of Computer Science Millersville University.
By the end of this section, you will know and understand the hardware and software involved in making a LAN!
 2008 Pearson Education, Inc. All rights reserved Web Browser Basics: Internet Explorer and Firefox.
Google Chrome & Search C Chapter 18. Objectives 1.Use Google Chrome to navigate the Word Wide Web. 2.Manage bookmarks for web pages. 3.Perform basic keyword.
Pervasive Web Content Delivery with Efficient Data Reuse Chi-Hung Chi and Cao Yang School of Computing National University of Singapore
$200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $400 $500 $100 $200 $300 $500 $100 Category One Category Two Category.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
CIS101 Introduction to Computing Week 05. Agenda Your questions Exam next week - Excel Introduction to the Internet & HTML Online HTML Resources Using.
World Wide Web1 Applications World Wide Web. 2 Introduction What is hypertext model? Use of hypertext in World Wide Web (WWW) – HTML. WWW client-server.
Providing Internet Search for Low-Connectivity Communities Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of.
Introduction to HTML 2006 CIS101. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Introduction to HTML 2006 INT197B. What is the Internet? Global network of computers that are connected and communicate via a series of Protocols Protocols.
Web Clipping Presentation By: Alex Jacobs, Philip Kim, Nathan Po Web Clipping.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
The Internet 8th Edition Tutorial 1 Browser Basics.
UNIFORM RESOURCE LOCATOR (URL)
 Proxy Servers are software that act as intermediaries between client and servers on the Internet.  They help users on private networks get information.
S ELECTION OF WEB HOST AND WEB PAGE SYSTEM. W EB HOST stores all the pages of your website and makes them available to computers connected to the Internet.
CIS101 Introduction to Computing Week 06. Agenda Your questions Excel Exam during second hour Our status after the snow day Introduction to the Internet.
Chapter 10 Publishing and Maintaining Your Web Site.
Searching the World Wide Web in Low Connectivity Communities Libby Levison Massachusetts Institute of Technology.
Practical PC, 7 th Edition Chapter 9: Sending and Attachments.
INTERNET CHAPTER 12 Information Available The INTERNET contains a huge amount of information a huge amount of information information on any topic you.
Internet Standard Grade Computing. Internet a wide area network spanning the globe. consists of many smaller networks linked together. Service a way of.
Computer Concepts 2014 Chapter 7 The Web and .
UNIT 14 Lecturer: Ghadah Aldehim 1 Websites. Introduction 2.
 Internet vs WWW  Pages vs Sites  How the Internet Works  Getting a Web Presence.
CHAPTER 2 Communications, Networks, the Internet, and the World Wide Web.
Paul Mundy and Bob Huggan 1 Websites.
The Internet A Wide Area Network across the world The network of networks –Lots of smaller networks joined together.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,
Copyright © Allyn & Bacon 2008 POWER PRACTICE Chapter 7 The Internet and the World Wide Web START This multimedia product and its contents are protected.
CSCE Chapter 5 (Links, Images, & Multimedia) CSCE General Applications Programming Benito Mendoza 1 By Benito Mendoza Department.
What is the Internet? Internet: The Internet, in simplest terms, is the large group of millions of computers around the world that are all connected to.
The INTERNET A worldwide network of computers linked together. Web Pages, , Instant Messaging, Xbox Live, iTunes downloads, etc. are all part of.
WHAT IS A WEBSITE AND HOW TO GET YOUR BUSINESS ONLINE Anna Gabali – 30/07/ MKLC.
CIS 250 Advanced Computer Applications Internet/WWW Review.
Chapter 9 Publishing and Maintaining Your Site. 2 Principles of Web Design Chapter 9 Objectives Understand the features of Internet Service Providers.
Networks.
Copyright © Terry Felke-Morris WEB DEVELOPMENT & DESIGN FOUNDATIONS WITH HTML5 7 TH EDITION Chapter 1 Key Concepts 1.
The Anatomy of a Large-Scale Hyper textual Web Search Engine S. Brin, L. Page Presenter :- Abhishek Taneja.
INTERNET. Objectives Explain the origin of the Internet and describe how the Internet works. Explain the difference between the World Wide Web and the.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Empirical Quantification of Opportunities for Content Adaptation in Web Servers Michael Gopshtein and Dror Feitelson School of Engineering and Computer.
Chapter 29 World Wide Web & Browsing World Wide Web (WWW) is a distributed hypermedia (hypertext & graphics) on-line repository of information that users.
MODULE 3 Internet Basics © Paradigm Publishing, Inc.1.
THE INTERNET. TABLE OF CONTENT CONNECTING TO THE INTERNET ELECTRONIC MAIL WORLD WIDE WEB INTERNET SERVICES.
Web Design. What is the Internet? A worldwide collection of computer networks that links millions of computers by – Businesses (.com.net) – the government.
Web Design and Development. World Wide Web  World Wide Web (WWW or W3), collection of globally distributed text and multimedia documents and files 
The Internet. Important Terms Network Network Internet Internet WWW (World Wide Web) WWW (World Wide Web) Web page Web page Web site Web site Browser.
and Internet Explorer.  The transmission of messages and files via a computer network  Messages can consist of simple text or can contain attachments,
WebScan: Implementing QueryServer 2.0 Karl Geiger, Amgen Inc. BRS NA UG August 1999.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
Introduction. Internet Worldwide collection of computers and computer networks that link people to businesses, governmental agencies, educational institutions,
The Internet. The Internet and Systems that Use It Internet –A group of computer networks that encircle the entire globe –Began in 1969 Protocol –Language.
General Architecture of Retrieval Systems 1Adrienn Skrop.
Information Networks. Internet It is a global system of interconnected computer networks that link several billion devices worldwide. It is an international.
Introduction to Information Systems SSD1: Introduction to Information Systems Unit 1. The World Wide Web Unit 2. Introduction to Java and Object- Oriented.
Website Deployment Week 12. Software Engineering Practices Consider the generic process framework – Communication – Planning – Modeling – Construction.
Information Architecture
3.02H Publishing a Website 3.02 Develop webpages..
What is the Internet? © EIT, Author Gay Robertson, 2016.
Evolution of Internet.
Web Caching? Web Caching:.
What is the World Wide Web (www)
4.02 Develop web pages using various layouts and technologies.
INTELLIGENT BROWSERS Cenk Ursavas.
Presentation transcript:

TEK: Internet Search System for Low-Connectivity Communities Saman Amarasinghe Bill Thies Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

TEK Team Current –Saman Amarasinghe - Marjorie Cheng –Bill Thies Previous participants –Libby Levison (PI) –Samidh Chakrabarti –Tazeen Mahtab –Genevieve Cuevas –Saad Shakhshir –Janelle Prevost –Mark Halsey –Hongfei Tian –Damon Berry –Bihn Vo –Sheldon Chan –Sid Henderson –Alexandro Artola

Percentage of Country’s Population Online Internet Users Worldwide Copyright 2004, Matthew Zook. Data Source: ClickZ Stats.

Barriers to Internet Access Infrastructure –Limited phone lines –Low-bandwidth international links –Unreliable power supplies High costs –Computer unaffordable or unavailable –ISP, telephone costs can exceed local wage –Exacerbated by slow connections Social barriers –Illiterate or non-technical users –Lack of local content

Cost of Dial-up Internet Access as a Fraction of Household Income 240% 140% Sources: ISP Websites 2005, UNDP Development Report 2004, WorldBank 2003 Monthly Internet access: 30 hours –Or unlimited rate (if cheaper) Monthly household income derived from GNP per capita –Ignore richest 20% of population –Household size: 2.5

Cost of Dial-up Internet Access as a Fraction of Household Income 240% 140% Sources: ISP Websites 2005, UNDP Development Report 2004, WorldBank 2003 Monthly Internet access: 30 hours –Or unlimited rate (if cheaper) Monthly household income derived from GNP per capita –Ignore richest 20% of population –Household size: 2.5

Cost of Dial-up Internet Access as a Fraction of Household Income 240% 140% Sources: ISP Websites 2005, UNDP Development Report 2004, WorldBank 2003 Monthly Internet access: 30 hours –Or unlimited rate (if cheaper) Monthly household income derived from GNP per capita –Ignore richest 20% of population –Household size: 2.5

TEK: -Based Search Solution has two components: 1. Transfer all data through , not http - Connect only to send/receive , not to browse web 2. TEK Server optimizes for bandwidth requirements TEK: “Time Equals Knowledge” user ISP World Wide Web TEK Server

Outline TEK System Usage Scenarios Optimizing for Bandwidth

Outline TEK System Usage Scenarios Optimizing for Bandwidth

TEK Client Implemented as an HTTP Proxy Server bundled with a custom version of Firefox When offline, users can: –Search and browse old results as if connected –Enqueue queries for new results or missing pages When online, users can: –Send pending queries –Receive new results (attached to standard s)

TEK Server Queries Google for relevant pages Returns filtered content of ̃20 pages to user –Remove images –Remove junk HTML (JavaScript, colors, meta tags, etc.) Uses loband library for page simplification (loband.org) –Convert PDF, PS to HTML (uses pdftohtml) Maintain server image of client page cache –Avoid sending duplicate pages Compress pages, send as single attachment –Limit attachment size to 150K (or smaller, for some users)

TEK Rationale I: More Affordable accounts cheaper than web access Cost of Internet Cost of

TEK Rationale I: More Affordable accounts cheaper than web access –Some infrastructures support only

TEK Rationale I: More Affordable accounts cheaper than web access –Some infrastructures support only Can send/receive all queries at night Daytime Cost Nighttime Cost

TEK Rationale I: More Affordable accounts cheaper than web access –Some infrastructures support only Can send/receive all queries at night Connection time is shorter –Avoids reading pages online –Content direct from ISP, not distant server –Server compression shrinks results

TEK Rationale II: More Usable Viewing results offline: quick, reliable –Establish local database of shared information In school: time-share Internet line with voice –Reduced time online makes Internet viable Manageable amount of information

Outline TEK System Usage Scenarios Optimizing for Bandwidth

Deployment Status Released summer 2002 See tek.sourceforge.net

People’s First Network Solomon Islands served by HF Radio Network only Source:

People’s First Network Many applications reported 1. Farmers – information on diseases; networking 2. Teachers – environmental impact of local logging 3. Pastors – downloading sermons 4. Entrepreneurs – download / sell lyrics 5. General – health, education, sports, entertainment Subsistence farmers on Rennell have obtained advice concerning taro diseases affecting their crop. Via the 'TEK-websearch' facility, one group of farmers was able to access detailed technical information about vanilla farming and to communicate with a specialist from the Kastom Gaden Association. -- Chand et al., PFNet Case Study, 2005

First Mile Solutions Source: Store-and-forward connectivity via Mobile Access Point –Cambodia, Rwanda, Costa Rica, India TEK provides only Web access

Outline TEK System Usage Scenarios Optimizing for Bandwidth

Low-Bandwidth Search is Different Real-time Search -based Search Acceptable Latency 1-2 secondsminutes/hours Optimization Metric relevance per pagerelevance per byte Search Processtrial-and-errorcareful User Identityunknown address

State-Based Compression Cheaper to store information than re-download it –100 GB disk drive: $250 –100 GB at 56kbs, $1/hr:$4000 TEK Server If server knows everything stored on client, can it improve compression of search results?

State-Based Compression General problem: If two parties share a large dictionary, can they reduce communication bandwidth? In general: no –info content (index) = info content (entry) In practice: maybe –Space of inputs is not uniformly populated E.g., many images are text, bullets, smileys, patterns –Lossy: send index of closest match in dictionary –Lossless: send exact diff from dictionary entry

Photo Mosaics Mosaic: picture made of other pictures 1. Break image into cells 2. Match each cell against image library Use wavelet decomposition for perceptual match

Mosaic Compression (Samidh Chakrabarti 2002) Idea: server constructs mosaic from client images –Send pointers to image components, not image data Image size (bits): #cells * log 2 (library_size) –Gzip offers further savings Possible image libraries –Images previously downloaded by client –Pre-defined library

Experiments Setup –4096 images from Wikipedia –Cell size: 12x12 pixels –PhotoMosaic software (BlackDog, shareware) Touch-up features disabled Processing time –~20 minutes to analyze library –~1 minute to build mosaic

Wikipedia JPEG: 46 Kb Mosaic: 2.0 Kb (22X smaller)

0-Quality JPEG: 27 Kb Mosaic: 2.0 Kb (13X smaller)

30X Smaller JPEG: 2.0 Kb Mosaic: 2.0 Kb

Small JPEG, Zoomed: 2.0 Kb Mosaic: 2.0 Kb

21X Smaller GIF: 2.0 Kb Mosaic: 2.0 Kb

Small GIF, Zoomed: 2.0 Kb Mosaic: 2.0 Kb

Compressing Landscapes JPEG Image: 52 Kb Mosaic: 1.6 Kb (33X smaller)

Compressing Landscapes 23X Smaller GIF: 1.6 Kb Mosaic: 1.6 Kb

Importance of Small Images Most bandwidth spent on small images! File Size (Kb) 100% 80% 60% 40% 20% 0% Fraction of Total Bandwidth Source: Chakrabarti’ ,684 images from sites in Google programming contest - 5,540 images from 1,000 most popular sites (ZDNet) 60% of bandwidth on images < 10Kb  Text recognition?  Icon substitution?

Many avenues for improvement –What is the best image library? –Impact of smoothing, rotation, diffs? –Edge detection + texture mapping Lossy compression of edges Random noise for realism In current form, perhaps useful as a preview –5-33X smaller than JPEG –More entertaining than ALT tag or blurry picture What’s the Verdict?

2. Breaking the URL Abstraction Entire webpage is unlikely to be useful Alternate abstractions for search engines: –Document sections ( ) –Paragraphs –Tables –PDF Bookmarks If low bandwidth, Extract relevant content and return to user If high bandwidth, Jump* to relevant portion * may require cached version or HTML / browser extensions

3. Client-Specific Pagerank (ala Google Personalized) Ambiguous searches have clusters of results –“Mercury” – element, planet, car, or Roman God? –High-bandwidth users do iterative searches –Low-bandwidth users can’t afford many iterations And often lack skills to eliminate spurious hits Idea: select pages based on client profile –Geography, demographics, previous searches –“Java history” from Indonesia  history of island –“GDP” after biology queries  guanine diphosphate Pagerank: boost links from user’s demographic

Future Work: Smart Query Builder Spelling error is costly for -based search Client interface should: –Check spelling –Anticipate number of results –Identify ambiguous queries New opportunity for advanced query building –E.g., users willing to categorize searches New opportunity for evaluating search results –Users willing to provide careful feedback –Research vehicle for IR and UI testing

Conclusion High demand for low-bandwidth search –Today: emerging Internet users worldwide, PDAs –Future: pervasive computing, space exploration Much room for technical innovation Prototype systems have proven useful –TEK, Web, www4mail, loband –Robust, visible service could have large impact