INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Internet Research Search Engines & Subject Directories.
What Is A Web Page? An Introduction to the Internet.
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Accessing the Global Database The World Wide Web.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
Introducing the Internet Source: Learning to Use the Internet.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
The Internet : Exploration, Evaluation, and Elaboration presented by Kathy Schrock.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
Information Retrieval and Web Search Web search. Spidering Instructor: Rada Mihalcea Class web page: (some of these.
Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of.
Web Based Systems for Engineering and Management Professors Iris D. Tommelein and Arpad Horvath Fall 2000.
Internet and WWW. Internet Network linking computers to other computers Access to numerous resources – Communications systems Instant messaging.
World Wide Web Guide * for Students to the Internet.
The World Wide Web: Information Resource. How a Search Engine works… How Search Works - YouTube
Search Engines Information Technology and Social Life March 2, 2005.
8/31: Ch. 1 The Internet & WWW What is the Internet? What is the WWW? –Browser basics What is a search engine? What search engines are used today? Images.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
The World Wide Web.
Marking the Most of the Web’s Resources
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
McGraw-Hill Technology Education
Internet.
A Brief History of the Internet
What is Internet Internet is a network of networks, linking computers to computers. Each runs software to provide or “serve” information and/or to access.
Search Engines & Subject Directories
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Computer Networks and Internet
The Internet and the World Wide Web
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Search Introduction.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Search Engines & Subject Directories
Search Engines & Subject Directories
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
All About the Internet.
Web Search Introduction.
Welcome to Cyberspace The Internet - World Wide Web
Educational Computing
Web Search by Ray Mooney
Web Searching Everything, now..
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Search Basics Berlin Chen
Web Search Introduction.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Internet and the world wide web (www)
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Search
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 32 Web Search

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline The World Wide Web Web Pre-History Web Search History Web Challenges for IR Graph Structure in the Web (Bowtie) Zipf’s Law on the Web Manual Hierarchical Web Taxonomies Business Models for Web Search

The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet. Combined idea of documents available by FTP with the idea of hypertext to link documents. Developed initial HTTP network protocol, URLs, HTML, and first “web server.” 00:01:28  00:02:15 (developed) 00:02:30  00:02:50 (combined) 00:03:00  00:03:30 (developed initial)

Web Pre-History Ted Nelson developed idea of hypertext in 1965. Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. ARPANET was developed in the early 1970’s. The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical. 00:04:20  00:04:50 (bullets) 00:05:15  00:05:40 (bullets)

Web Browser History Early browsers were developed in 1992 (Erwise, ViolaWWW). In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic browser and distributed it widely. Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC). Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995. 00:06:25  00:07:45 (bullets)

Search Engine Early History By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) Assembled lists of files available on many FTP servers. Allowed regex search of these file names. In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers. 00:08:20  00:09:35 (by late & in 1990) 00:09:50  00:10:10 (in 1993)

Web Search History In 1993, early web robots (spiders) were built to collect URL’s: Wanderer ALIWEB (Archie-Like Index of the WEB) WWW Worm (indexed URL’s and titles for regex search) In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo. 00:10:35  00:11:50 (in 1993) 00:11:57  00:12:18 (in 1994)

Web Search History (cont) In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (eventually became part of Excite and AOL). A few months later, Fuzzy Maudlin, a grad student at CMU developed Lycos. First to use a standard IR system as developed for the DARPA Tipster project. First to index a large set of pages. In late 1995, DEC developed Altavista. Used a large farm of Alpha machines to quickly process large numbers of queries. Supported boolean operators, phrases, and “reverse pointer” queries. 00:12:24  00:12:40 (in early) 00:12:45  00:13:34 (a few & in late)

Web Search Recent History In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google. Main advance is use of link analysis to rank results partially based on authority. 00:13:55  00:14:20

Web Challenges for IR Distributed Data: Documents spread over millions of different web servers. Volatile Data: Many documents change or disappear rapidly (e.g. dead links). Large Volume: Billions of separate documents. Unstructured and Redundant Data: No uniform structure, HTML errors, up to 30% (near) duplicate documents. Quality of Data: No editorial control, false information, poor quality writing, typos, etc. Heterogeneous Data: Multiple media types (images, video, VRML), languages, character sets, etc. 00:15:36  00:16:00 (distributed) 00:16:50  00:17:05 (volatile) 00:20:52  00:21:00 (large volume) 00:23:15  00:23:44 (unstructured and redundant) 00:29:25  00:29:35 (quality) 00:32:40  00:32:55 (Heterogeneous)

Graph Structure in the Web (Bowtie) {Graphics  (Animation required)} 00:37:20  00:40:20 http://www9.org/w9cdrom/160/160.html

Zipf’s Law on the Web Number of in-links/out-links to/from a page has a Zipfian distribution. Length of web pages has a Zipfian distribution. Number of hits to a web page has a Zipfian distribution. 00:42:20  00:43:05 (Bullets)

Manual Hierarchical Web Taxonomies Yahoo approach of using human editors to assemble a large hierarchically structured directory of web pages. http://www.yahoo.com/ Open Directory Project is a similar approach based on the distributed labor of volunteer editors (“net-citizens provide the collective brain”). Used by most other search engines. Started by Netscape. http://www.dmoz.org/ 00:45:08  00:45:35 (yahoo) 00:45:48  00:46:00 (open directory)

Business Models for Web Search Advertisers pay for banner ads on the site that do not depend on a user’s query. Goto/Overture CPM: Cost Per Mille (thousand impressions). Pay for each ad display. CPC: Cost Per Click. Pay only when user clicks on ad. CTR: Click Through Rate. Fraction of ad impressions that result in clicks throughs. CPC = CPM / (CTR * 1000) CPA: Cost Per Action (Acquisition). Pay only when user actually makes a purchase on target site. Advertisers bid for “keywords”. Ads for highest bidders displayed when user query contains a purchased keyword. PPC: Pay Per Click. CPC for bid word ads (e.g. Google AdWords). 00:47:20  00:47:35 (advertisers) 00:47:40  00:48:50 (cpm) 00:49:05  00:49:37 (cpc) 00:51:00  00:51:10 (ctr) 00:51:25  00:51:45 (cpa) 00:52:10  00:52:35 (advertiser bid)