(c) Maria Indrawan 20041 Distributed Information Retrieval.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Natural Language Processing WEB SEARCH ENGINES August, 2002.
Chapter 2. Slide 1 CULTURAL SUBJECT GATEWAYS CULTURAL SUBJECT GATEWAYS Subject Gateways  Started as links of lists  Continued as Web directories  Culminated.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
“ The Anatomy of a Large-Scale Hypertextual Web Search Engine ” Presented by Ahmed Khaled Al-Shantout ICS
Information Retrieval in Practice
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Internet Resources Discovery (IRD) Search Engines Quality.
6/16/20151 Recent Results in Automatic Web Resource Discovery Soumen Chakrabartiv Presentation by Cui Tao.
1 CS/INFO 430 Information Retrieval Lecture 17 Web Search 3.
Searching the Web II. The Web Why is it important: –“Free” ubiquitous information resource –Broad coverage of topics and perspectives –Becoming dominant.
Search engines. The number of Internet hosts exceeded in in in in in
Best Web Directories and Search Engines Order Out of Chaos on the World Wide Web.
Searching the World Wide Web From Greenlaw/Hepp, In-line/On-line: Fundamentals of the Internet and the World Wide Web 1 Introduction Directories, Search.
Unit 3 Web Search Engines. Can You Find the Answers? n Connect to Google Google n Search for items on Iran Records ________ n Combine Iran with nuclear.
WHAT HAVE WE DONE SO FAR?  Weeks 1 – 8 : various components of an information retrieval system  Now – look at various examples of information retrieval.
Week 3: MetaSearch Engines Click here for Word handout Tom Johnson Boston University - Dept. of Journalism
SEARCH ENGINES By, CH.KRISHNA MANOJ(Y5CS021), 3/4 B.TECH, VRSEC. 8/7/20151.
Overview of Search Engines
Internet Research Search Engines & Subject Directories.
 Search engines are programs that search documents for specified keywords and returns a list of the documents where the keywords were found.  A search.
SEARCH ENGINE By Ms. Preeti Patel Lecturer School of Library and Information Science DAVV, Indore E mail:
An Application of Graphs: Search Engines (most material adapted from slides by Peter Lee) Slides by Laurie Hiyakumoto.
1 Web Developer Foundations: Using XHTML Chapter 11 Web Page Promotion Concepts.
Enterprise & Intranet Search How Enterprise is different from Web search What to think about when evaluating Enterprise Search How Intranet use is different.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Courtney Forsmann IT Help Desk Manager Lewis-Clark State College October 1, 2014.
Using Hyperlink structure information for web search.
Overview What is a Web search engine History Popular Web search engines How Web search engines work Problems.
Search Engine By Bhupendra Ratha, Lecturer School of Library and Information Science Devi Ahilya University, Indore
When Experts Agree: Using Non-Affiliated Experts To Rank Popular Topics Meital Aizen.
1 Search Engines Emphasis on Google.com. 2 Discovery  Discovery is done by browsing & searching data on the Web.  There are 2 main types of search facilities.
XP New Perspectives on The Internet, Sixth Edition— Comprehensive Tutorial 3 1 Searching the Web Using Search Engines and Directories Effectively Tutorial.
The Internet 8th Edition Tutorial 4 Searching the Web.
McLean HIGHER COMPUTER NETWORKING Lesson 7 Search engines Description of search engine methods.
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Multimedia & The World Wide Web winny HCI 201 Multimedia and the www.
Search Engines June 20, 2005 LIBS100 Linda Galloway.
Curtis Spencer Ezra Burgoyne An Internet Forum Index.
Search Engines Reyhaneh Salkhi Outline What is a search engine? How do search engines work? Which search engines are most useful and efficient? How can.
Search Engines1 Searching the Web Web is vast. Information is scattered around and changing fast. Anyone can publish on the web. Two issues web users have.
Search Engines.
Searching The Internet Open Text Searching vs. Subject Tree Search Open Text Search Search Engine scans the Web looking for a word or group of words.
Search Tools and Search Engines Searching for Information and common found internet file types.
Searching the World Wide Web: Meta Crawlers vs. Single Search Engines By: Voris Tejada.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
Web Search Architecture & The Deep Web
CP3024 Lecture 12 Search Engines. What is the main WWW problem?  With an estimated 800 million web pages finding the one you want is difficult!
WIRED Week 6 Syllabus Review Readings Overview Search Engine Optimization Assignment Overview & Scheduling Projects and/or Papers Discussion.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Anatomy of a Large-Scale Hypertextual Web Search Engine S. Brin and L. Page, Computer Networks and ISDN Systems, Vol. 30, No. 1-7, pages , April.
General Architecture of Retrieval Systems 1Adrienn Skrop.
The Anatomy of a Large-Scale Hypertextual Web Search Engine (The creation of Google)
Lecture 4 Access Tools/Searching Tools. Learning Objectives To define access tools To identify various access tools To be able to formulate a search strategy.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
SEARCH ENGINE by: by: B.Anudeep B.Anudeep Y5CS016 Y5CS016.
Information Retrieval in Practice
Using Search Tools on the Internet
Search Engine Architecture
SEARCH ENGINES & WEB CRAWLER Akshay Ghadge Roll No: 107.
Search Engines & Subject Directories
Eric Sieverts University Library Utrecht Institute for Media &
Information Retrieval
Data Mining Chapter 6 Search Engines
Search Engines & Subject Directories
Search Engines & Subject Directories
All About the Internet.
Information Retrieval and Web Design
Presentation transcript:

(c) Maria Indrawan Distributed Information Retrieval

(c) Maria Indrawan Challenges in Managing Distributed Information No topology of the data organisation. Dynamic data. The size of the collection. No control over quality of the data. Multimedia data.

(c) Maria Indrawan Challenges-Human Factor Diversity of users –Expert to novice Ill-formed queries. Specific behaviour –Favour precision over recall (85% users only look at the first screen – Lan Huang A survey on Web Information Technology)

(c) Maria Indrawan Types of Distributed IR Directory –Yahoo Search Engine –Google, AskJeeves, Yahoo, Teoma Meta Search –Metacrawler, Dogpile Distributed Broker –Harvest

(c) Maria Indrawan Directory Listing Manually created –Yahoo, Google, MSN –Open Directory Project

(c) Maria Indrawan Directory Listing Automatic classification TERENA. – seac.htmlhttp:// seac.html Scorpion –

(c) Maria Indrawan Search Engine Architecture Crawler (robots) –Collecting the pages from the WEB. Indexer –Indexing pages collected by the crawler and represent them in an efficient data structure. Query Server –Accepting, process and return the results of the query from the user.

(c) Maria Indrawan Crawler – Design Considerations Crawling algorithm –Breadth-first vs Depth first How do we handle URL-aliases? How do we reduce server load? How do we detect a duplicate page or a mirror- site? How often we need to revisit a site?

(c) Maria Indrawan Update Rate (May 2003) Search EngineNewest page Found Rough AverageOldest Page Found Google2 days1 month165 days MSN (Ink)1 day4 weeks51 days HotBot (Ink)1 day4 weeks51 days AlltheWeb1 day1 month599 days Gigablast45 days7 months381 days Teoma41 days2.5 months81 days WiseNut133 days6 months183 days

(c) Maria Indrawan Indexer - Design Considerations How do we handle typing mistakes? Do we use stop list and stemming algorithm? How much do we want to index in a given web page? –Google index only the first 101 KB of a web page and 120 KB of PDF file. How big do we want the database indexed to be? –response time vs coverage Do we want to index PDF, PS files?

(c) Maria Indrawan Size Growth

(c) Maria Indrawan Estimated Size Dec 31,

(c) Maria Indrawan Query Server- Design Considerations Retrieval model. Complexity of the query syntax. HCI – human computer interface. Output display.

(c) Maria Indrawan Retrieval Model Traditional approach: –Keywords matching returns to many low quality matches – low precision. Search engines need a VERY high precision output – even on the expense of RECALL. How can we achieve this?

(c) Maria Indrawan Google Retrieval Model Utilise the popularity of a page –If a page has many other pages pointed to this page, the page must be very important. We can assign a high weight to this page during search. –If a page is pointed by a popular page, this page can be considered as important because it is referred by a reputable source (a popular page). –PageRank Function.

(c) Maria Indrawan PageRank Example

(c) Maria Indrawan Google Retrieval Model Utilise the anchor text. –Anchors often provide more accurate descriptions of web pages than the pages themselves. –Anchors may exist for documents which cannot be indexed by a text-based search engine. Utilise the appearance of the text. –Larger and bolder font text are weighted higher than other words.

(c) Maria Indrawan Results Overlap

(c) Maria Indrawan Metasearch Meta searches do not build their own index. They use the index of the existing search engines. When user posted a query to a meta search, the meta search sends the query to a number of search engines and collates the results. A list of metacrawler: – http://

(c) Maria Indrawan Meta Search metacrawler, –uses google, yahoo,askJeeves, About, Looksmart, Teoma, Overture, FindWhat. dogpile, –uses google, yahoo,askJeeves, About, Looksmart, Teoma, Overture, FindWhat

(c) Maria Indrawan Metasearch Design Issue Potential problems: –Translating the user query into a different query in a different search engine. –Query time is bounded by the least powerful (slowest) underlying system. –Combining results into a single ranked list is difficult. Effectiveness depend on heuristics and information passed back from underlying search engines. detecting overlap in the query results different scoring schemes (some do not use)

(c) Maria Indrawan Distributed Broker Information is indexed locally by geographical locations or institutional boundaries. –Suitable for supporting community that to have a common search database. Local indexes are combined to provide wider coverage. Document scoring is performed locally by each index server.

(c) Maria Indrawan Distributed Broker broker CSSE broker SIMS broker ACC broker MGM broker FIT broker F. Bussiness broker Monash

(c) Maria Indrawan Distributed Broker Example: Harvest – hing/schwartz.harvest/schwartz.harvest.htmlhttp:// hing/schwartz.harvest/schwartz.harvest.html

(c) Maria Indrawan General architecture Hierarchical vs Flat Hierarchical: underlying index servers are connected through a hierarchy of brokers. –broker hierarchy provides efficient and global coverage. –brokers can be geographical, institutional or subject based. broker query broker query broker index server...

(c) Maria Indrawan Flat Graph Model broker index server broker index server broker index server broker index server... query

(c) Maria Indrawan Useful site –Provides links to most of the information discovery tools.

(c) Maria Indrawan Summary Type of Distributed Information Discovery –Directory Listing yahoo –Search Engines. Google, AskJeeves, Teoma –Metasearch metacrawler, dogpile –Distributed Broker Harvest