Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of.

Slides:



Advertisements
Similar presentations
The Internet and the Web
Advertisements

Web Search Spidering.
INTERNET A collection of networks. History ARPANet – developed for security of sending in case of a nuclear attack IDEA – the system would not go down.
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Mastering the Internet, XHTML, and JavaScript Chapter 7 Searching the Internet.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
Introduction Web Development II 5 th February. Introduction to Web Development Search engines Discussion boards, bulletin boards, other online collaboration.
WEB CRAWLERs Ms. Poonam Sinai Kenkre.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
Chapter 3 Search Before Google. Briefly describe search engines before Google Innovations (introduction of something new) Mistakes or things that these.
1 Internet History Internet made up of thousands of networks worldwide No one in charge of Internet - No governing body Internet backbone owned by private.
Historical Background An internet server from which hierarchically-organised text files could be retrieved from allover the world. Developed at the University.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
Search engines Christian Rennerskog, Jonas Rosling, Mattias Olsson.
Internet. Internet is Is a Global network Computers connected together all over that world. Grew out of American military.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Discover the Information Superhighway Explore How It Serves You Test-Drive the Internet.
Wasim Rangoonwala ID# CS-460 Computer Security “Privacy is the claim of individuals, groups or institutions to determine for themselves when,
HOW SEARCH ENGINE WORKS. Aasim Bashir.. What is a Search Engine? Search engine: It is a website dedicated to search other websites and there contents.
Web Search Created by Ejaj Ahamed. What is web?  The World Wide Web began in 1989 at the CERN Particle Physics Lab in Switzerland. The Web did not gain.
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
Search Engines. Internet protocol (IP) Two major functions: Addresses that identify hosts, locations and identify destination Connectionless protocol.
Operating Systems Concepts 1/e Ruth Watson Chapter 12 Chapter 12 Introduction to the Internet Ruth Watson.
Promotion & Cataloguing AGCJ 407 Web Authoring in Agricultural Communications.
Web Searching Basics Dr. Dania Bilal IS 530 Fall 2009.
Search Engine Interfaces search engine modus operandi.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
1/28: The Internet & Website Design What is the Internet? –Parts of the Internet –Internet & WWW basics –Searching the WWW Website design considerations.
Information Retrieval and Web Search Web search. Spidering Instructor: Rada Mihalcea Class web page: (some of these.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
The World Wide Web: Information Resource. Hock, Randolph. The Extreme Searcher’s Internet Handbook. 2 nd ed. CyberAge Books: Medford. (2007). Internet.
1 University of Qom Information Retrieval Course Web Search (Spidering) Based on:
Internet and WWW. Internet Network linking computers to other computers Access to numerous resources – Communications systems Instant messaging.
Information Retrieval and Web Search Web Crawling Instructor: Rada Mihalcea (some of these slides were adapted from Ray Mooney’s IR course at UT Austin)
Chapter 1 Getting Listed. Objectives Understand how search engines work Use various strategies of getting listed in search engines Register with search.
World Wide Web Guide * for Students to the Internet.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Search Engines Information Technology and Social Life March 2, 2005.
8/31: Ch. 1 The Internet & WWW What is the Internet? What is the WWW? –Browser basics What is a search engine? What search engines are used today? Images.
CONTENTS WHAT ARE SEARCH ENGINES? IMPORTANCE OF SEARCH ENGINES TYPES OF SEARCH ENGINES: – CRAWLER BASED – DIRECTORIES – HYBRID – META HOW TO USE SEARCH.
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
1 Web Search Introduction. 2 The World Wide Web Developed by Tim Berners-Lee in 1990 at CERN to organize research documents available on the Internet.
SEO BASICS Internet Marketers #SEOmkt3730 Done By: Evan Clough Ashley Sellers Erik Wilson Stephen Glover.
Web Crawling and Automatic Discovery Donna Bergmark March 14, 2002.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
1 CS 430: Information Discovery Lecture 17 Web Crawlers.
The Internet. The Internet and Systems that Use It Internet –A group of computer networks that encircle the entire globe –Began in 1969 Protocol –Language.
Search Engine and Optimization 1. Introduction to Web Search Engines 2.
1 Web Search Spidering (Crawling)
Lecture 4 Access Tools/Searching Tools. Learning Objectives To define access tools To identify various access tools To be able to formulate a search strategy.
Seminar on seminar on Presented By L.Nageswara Rao 09MA1A0546. Under the guidance of Ms.Y.Sushma(M.Tech) asst.prof.
1 Web Crawling and Data Gathering Spidering. 2 Some Typical Tasks Get information from other parts of an organization –It may be easier to get information.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Web Search Introduction.
10. IR on the World Wide Web and Link Analysis
Web Search Introduction.
Welcome to Cyberspace The Internet - World Wide Web
12. Web Spidering These notes are based, in part, on notes by Dr. Raymond J. Mooney at the University of Texas at Austin.
Web Search by Ray Mooney
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Web Searching Everything, now..
Web Search Introduction.
Information Retrieval and Web Search
Presentation transcript:

Retrieving Information on the Web Presented by Md. Zaheed Iftekhar Course : Information Retrieval (IFT6255) Professor : Jian E. Nie DIRO, University of Montreal April 9 th, 2003

April 9, 2003 Presented by: Md. Zaheed Iftekhar 2 Overview Web search: general description –Introduction of web, search engines –Definitions –Major search engines –Current technologies The future –Where is the technology heading –Proposal for further improvement Conclusion References

April 9, 2003 Presented by: Md. Zaheed Iftekhar 3 History of the Web In 1990 the World Wide Web (WWW) was developed by Tim Berners-Lee at CERN to organize research documents available on the Internet. Combined idea of documents available by FTP with the idea of hypertext to link documents. Developed initial HTTP network protocol, URLs, HTML, and first “web server.”

April 9, 2003 Presented by: Md. Zaheed Iftekhar 4 Ted Nelson developed idea of hypertext in Doug Engelbart invented the mouse and built the first implementation of hypertext in the late 1960’s at SRI. ARPANET was developed in the early 1970’s. The basic technology was in place in the 1970’s; but it took the PC revolution and widespread networking to inspire the web and make it practical. World Wide Web

April 9, 2003 Presented by: Md. Zaheed Iftekhar 5 Web Browser Early browsers were developed in 1992 (Erwise, ViolaWWW). In 1993, Marc Andreessen and Eric Bina at UIUC NCSA developed the Mosaic. Andreessen joined with James Clark (Stanford Prof. and Silicon Graphics founder) to form Mosaic Communications Inc. in 1994 (which became Netscape to avoid conflict with UIUC). Microsoft licensed the original Mosaic from UIUC and used it to build Internet Explorer in 1995.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 6 Web Search By late 1980’s many files were available by anonymous FTP. In 1990, Alan Emtage of McGill Univ. developed Archie (short for “archives”) – Assembled lists of files available on many FTP servers. –Allowed regex search of these file names. In 1993, Veronica and Jughead were developed to search names of text files available through Gopher servers.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 7 Web Search In 1993, early web robots (spiders) were built to collect URL’s: –Wanderer –ALIWEB (Archie-Like Index of the WEB) –WWW Worm (indexed URL’s and titles for regex search) In 1994, Stanford grad students David Filo and Jerry Yang started manually collecting popular web sites into a topical hierarchy called Yahoo.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 8 Web Search In early 1994, Brian Pinkerton developed WebCrawler as a class project at U Wash. (became part of Excite and AOL). The same year, Fuzzy Maudlin, a grad student at CMU developed Lycos. –First to use a standard IR system. –First to index a large set of pages. In late 1995, DEC developed Altavista. Supported boolean operators, phrases, and “reverse pointer” queries. In 1998, Larry Page and Sergey Brin, Ph.D. students at Stanford, started Google.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 9 Spiders (Robots/Bots/Crawlers) Start with a comprehensive set of root URL’s from which to start the search. Follow all links on these pages recursively to find additional pages. Index all novel found pages in an inverted index as they are encountered. May allow users to directly submit pages to be indexed (and crawled from).

April 9, 2003 Presented by: Md. Zaheed Iftekhar 10 Breadth-first Search Web search

April 9, 2003 Presented by: Md. Zaheed Iftekhar 11 Depth-first Search Web search

April 9, 2003 Presented by: Md. Zaheed Iftekhar 12 Search Strategy Trade-Off’s Breadth-first explores uniformly outward from the root page but requires memory of all nodes on the previous level (exponential in depth). Standard spidering method. Depth-first requires memory of only depth times branching-factor (linear in depth) but gets “lost” pursuing a single thread. Both strategies implementable using a queue of links (URL’s).

April 9, 2003 Presented by: Md. Zaheed Iftekhar 13 Avoiding Page Duplication Must detect when revisiting a page that has already been spidered (web is a graph not a tree). Must efficiently index visited pages to allow rapid recognition test. –Tree indexing (e.g. trie) –Hashtable Index page using URL as a key. –Must canonicalize URL’s (e.g. delete ending “/”) –Not detect duplicated or mirrored pages. Index page using textual content as a key. –Requires first downloading page.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 14 Spidering Algorithm Initialize queue (Q) with initial set of known URL’s. Until Q empty or page or time limit exhausted: Pop URL, L, from front of Q. If L is not to an HTML page (.gif,.jpeg,.ps,.pdf,.ppt…) continue loop. If already visited L, continue loop. Download page, P, for L. If cannot download P (e.g. 404 error, robot excluded) continue loop. Index P (e.g. add to inverted index or store cached copy). Parse P to obtain list of new links N. Append N to the end of Q.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 15 Queueing Strategy How new links added to the queue determines search strategy. FIFO (append to end of Q) gives breadth- first search. LIFO (add to front of Q) gives depth-first search. Heuristically ordering the Q gives a “focused crawler” that directs its search towards “interesting” pages.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 16 Source:

April 9, 2003 Presented by: Md. Zaheed Iftekhar 17 Google Google is a search engine that maintains its own spider based index. Google also has a directory that is powered by the Open Directory; Google supports: –Boolean search –Phrase –Similarity –Proximity Source: lookoff.com,

April 9, 2003 Presented by: Md. Zaheed Iftekhar 18 Google Strengths The interface is tremendously simple, but the quality in results is not significantly impeded Accuracy for common topics Weaknesses Lack of power features Coverage of the Internet is much less than some competitors No OR keyword support for boolean searches Source: lookoff.com,

April 9, 2003 Presented by: Md. Zaheed Iftekhar 19 Yahoo! Strengths Coverage of the Internet is excellent Links are generally quite up to date and free of spam and poor quality sites Human maintainers ensure that sites are placed correctly within the relevant topic The search interface is very fast Yahoo integrates with indexed searches after presenting Yahoo topic areas Accuracy for common topics Weaknesses The search interface is very effective for general searches but could be better for powerful searches Not all relevant sites are listed in Yahoo - they have to be submitted and accepted. Source: lookoff.com,

April 9, 2003 Presented by: Md. Zaheed Iftekhar 20 Ask Jeeves Strengths A simple interface makes it very easy to form queries. Excellent for new users and children. If your query corresponds to a pre-packaged answer, you can expect some surprisingly good results. Millions of bundled answers provide premium answers that are superior to standard index search.es The site is actively maintained. An integrated metacrawler provides results for your search from Goto, AltaVista, Mamma and 4Anything. The search code is very fast. Weaknesses The site supposedly takes pay for top spots, sometimes placing dubious quality links at the top of results. No advanced search. Very little power in constructing your keywords Little control over filtering results.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 21 MSN Strengths Very active news portal with updated and well-presented headlines. Integrated single sign-on with hotmail, msn, etc. Configurable interface lets you customize content, layout and colors. Very actively maintained. Many interesting (although often commercially-oriented) services tied into the MSN network. Nationalized versions for quite a few countries providing a more specific content and news feed. Ability to save (i.e. tag) results to quickly filter search results into a candidates list. Weaknesses Not a low-bandwidth interface. Slow modem users should beware. Mediocre search interface Less web coverage than most search engines

April 9, 2003 Presented by: Md. Zaheed Iftekhar 22 Program Pages (#)ClassFAQFTPIndexMetaMiscNewsPortal Dejanews 300M msgBestNNNNYYN Raging 250MBestNNYNNNN Yahoo 500TBestNNNNNNY AllTheWeb 300MExcellentNNYNNNN AltaVista 250MExcellentNNYNNYY FAQS 3300 FAQsExcellentYNNNYNN FTPSearch 100M fileExcellentNYNNNNN Search.com N/AExcellentNNNYNNN About ?GoodNNNNYNY AskJeeves 8M Ques.GoodYNYNNNY DirectHit ?GoodNNNNNNY Excite ?GoodNNYNNYY Go 50M?GoodNNYNNNY Google 100M?GoodNNYNNNN HotBot 150M?GoodNNYNNNY Lycos 250M?GoodNYYNNNY MetaCrawler N/AGoodNNNYNNN MSN 120M?GoodNNYNNNY NorthernLight 200M?GoodNNYNNYN OpenDirectory 1M?GoodNNNNNNY WebCenter 500T?GoodNNNNNNY DogPile N/AOkayNYNYYYY GoTo ?OkayNNYNNNY InfoSpace very fewOkayNNYNYNN iWon 350M?OkayNNYNYNN Snap ?OkayNNYNNNY Mamma n/aWeakNNNYNNN

April 9, 2003 Presented by: Md. Zaheed Iftekhar 23

April 9, 2003 Presented by: Md. Zaheed Iftekhar 24

April 9, 2003 Presented by: Md. Zaheed Iftekhar 25

April 9, 2003 Presented by: Md. Zaheed Iftekhar 26

April 9, 2003 Presented by: Md. Zaheed Iftekhar 27 Conclusion Intelligent agent technology could be used to improve the searching method. Quantum searching method also could be explored.

April 9, 2003 Presented by: Md. Zaheed Iftekhar 28 Web search Thank you all!