Web Search by Ray Mooney

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

Communicating Information: Web Design. It’s a big net HTTP FTP TCP/IP SMTP protocols The Internet The Internet is a network of networks… It connects millions.
Principles of IR Hacettepe University Department of Information Management DOK 324: Principles of IR.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 Pertemuan 19 Searching Mechanisms Matakuliah: M0284/Teknologi & Infrastruktur E-Business Tahun: 2005 Versi: >
Web Search – Summer Term 2006 III. Web Search - Introduction (Cont.) - Jeff Dean, Google's Systems Lab:
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
ISP 433/633 Week 7 Web IR. Web is a unique collection Largest repository of data Unedited Can be anything –Information type –Sources Changing –Growing.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Internet Research Search Engines & Subject Directories.
What Is A Web Page? An Introduction to the Internet.
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Accessing the Global Database The World Wide Web.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Introducing the Internet Source: Learning to Use the Internet.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Operating Systems Concepts 1/e Ruth Watson Chapter 12 Chapter 12 Introduction to the Internet Ruth Watson.
The Internet : Exploration, Evaluation, and Elaboration presented by Kathy Schrock.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
Information Retrieval and Web Search Web search. Spidering Instructor: Rada Mihalcea Class web page: (some of these.
Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of.
INTERNET. Objectives Explain the origin of the Internet and describe how the Internet works. Explain the difference between the World Wide Web and the.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Internet and WWW. Internet Network linking computers to other computers Access to numerous resources – Communications systems Instant messaging.
World Wide Web Guide * for Students to the Internet.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Search Engines Information Technology and Social Life March 2, 2005.
8/31: Ch. 1 The Internet & WWW What is the Internet? What is the WWW? –Browser basics What is a search engine? What search engines are used today? Images.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
The World Wide Web.
Digital Revolution History of Technology.
Marking the Most of the Web’s Resources
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Internet and the web Summary of terms discusses and review
Internet.
Microsoft Office Illustrated Introductory, Premium Edition
A Brief History of the Internet
What is Internet Internet is a network of networks, linking computers to computers. Each runs software to provide or “serve” information and/or to access.
Internet.
Search Engines & Subject Directories
HTML History CS 101.
Computer Networks and Internet
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Data Mining Chapter 6 Search Engines
Web Search Introduction.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
A worldwide system of interconnected computer networks.
Search Engines & Subject Directories
Search Engines & Subject Directories
All About the Internet.
Web Search Introduction.
Web Search Engines.
Educational Computing
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Searching Everything, now..
Information Retrieval and Web Design
Web Search Introduction.
Internet and the world wide web (www)
Information Retrieval and Web Search
Presentation transcript:

Web Search by Ray Mooney Introduction

The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet. Combined idea of documents available by FTP with the idea of hypertext to link documents. Developed initial HTTP network protocol, URLs, HTML, and first “web server.”

Web Pre-History Ted Nelson developed idea of hypertext in 1965. Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. ARPANET was developed in the early 1970’s. The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical.

Web Browser History Early browsers were developed in 1992 (Erwise, ViolaWWW). In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic browser and distributed it widely. Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC). Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.

Search Engine Early History By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) Assembled lists of files available on many FTP servers. Allowed regex search of these file names. In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

Web Search History In 1993, early web robots (spiders) were built to collect URL’s: Wanderer ALIWEB (Archie-Like Index of the WEB) WWW Worm (indexed URL’s and titles for regex search) In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

Web Search History (cont) In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (eventually became part of Excite and AOL). A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages. In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries.

Web Search Recent History In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority.

Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc.

Number of Web Servers

Number of Web Pages

Searches per Day Info missing For fast.com, Excite, Northernlight, etc.

Number of Web Pages Indexed SearchEngineWatch, Aug. 15, 2001 Assuming about 20KB per page, 1 billion pages is about 20 terabytes of data.

Growth of Web Pages Indexed SearchEngineWatch, Aug. 15, 2001 Google lists current number of pages searched.

Graph Structure in the Web http://www9.org/w9cdrom/160/160.html

Zipf’s Law on the Web Number of in-links/out-links to/from a page has a Zipfian distribution. Length of web pages has a Zipfian distribution. Number of hits to a web page has a Zipfian distribution. http://www.useit.com/alertbox/zipf.html

Manual Hierarchical Web Taxonomies Yahoo approach of using human editors to assemble a large hierarchically structured directory of web pages. http://www.yahoo.com/ Open Directory Project is a similar approach based on the distributed labor of volunteer editors (“net-citizens provide the collective brain”). Used by most other search engines. Started by Netscape. http://www.dmoz.org/

Automatic Document Classification Manual classification into a given hierarchy is labor intensive, subjective, and error-prone. Text categorization methods provide a way to automatically classify documents. Best methods based on training a machine learning (pattern recognition) system on a labeled set of examples (supervised learning). Text categorization is a topic we will discuss later in the course.

Automatic Document Hierarchies Manual hierarchy development is labor intensive, subjective, and error-prone. It would nice to automatically construct a meaningful hierarchical taxonomy from a corpus of documents. This is possible with hierarchical text clustering (unsupervised learning). Hierarchical Agglomerative Clustering (HAC) Text clustering is a another topic we will discuss later in the course.

Web Search Using IR Web Spider Document corpus IR Query String System Ranked Documents 1. Page1 2. Page2 3. Page3 .

How do Web Search Engines Differ? Different kinds of information Unedited – anyone can enter Quality issues Spam Varied information types Phone book, brochures, catalogs, dissertations, news reports, weather, all in one place! Sources are not differentianted Search over medical text the same as over product catalogs

What Do People Search for on the Web? (from Spink et al. 98 study) Topics Genealogy/Public Figure: 12% Computer related: 12% Business: 12% Entertainment: 8% Medical: 8% Politics & Government 7% News 7% Hobbies 6% General info/surfing 6% Science 6% Travel 5% Arts/education/shopping/images 14%

Web Search Queries Web search queries are SHORT User Expectations ~2.4 words on average (Aug 2000) Has increased, was 1.7 (~1997) User Expectations Many say “the first item shown should be what I want to see”! This works if the user has the most popular/common notion in mind

What about Ranking? Lots of variation here Combining subsets of: Pretty messy in many cases Details usually proprietary and fluctuating Combining subsets of: Term frequencies Term proximities Term position (title, top of page, etc) Term characteristics (boldface, capitalized, etc) Link analysis information Category information Popularity information Most use a variant of vector space ranking to combine these Here’s how it might work: Make a vector of weights for each feature Multiply this by the counts for each feature

Summary Web Search Directories vs. Search engines How web search differs from other search Type of data searched over Type of searches done Type of searchers doing search Web queries are short This probably means people are often using search engines to find starting points Once at a useful site, they must follow links or use site search Web search ranking combines many features