Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.

Slides:



Advertisements
Similar presentations
Fatma Y. ELDRESI Fatma Y. ELDRESI ( MPhil ) Systems Analysis / Programming Specialist, AGOCO Part time lecturer in University of Garyounis,
Advertisements

Dublin Core for Digital Video: Overview of the ViDe Application Profile.
THE DONOR PROJECT Titia van der Werf-Davelaar. Project Financed by: Innovation of Scientific Information Provision (IWI) Duration: –phase 1: 1 may 1998.
Cultural Heritage in REGional NETworks REGNET. October 2001Project presentation REGNET 2 T1.3. IDENTIFICATION OF STANDARDS TO BE USED 1. OBJECTIVES 2.
A REST-ful Web Services Approach to Library Federated Search using SRU Kevin Reiss Rutgers-Newark Law Library CALI 2005 – June 11th.
Distributing the Indexing and Retrieval of Information Winston Bourne IRNLP.
Search Engines. 2 What Are They?  Four Components  A database of references to webpages  An indexing robot that crawls the WWW  An interface  Enables.
Distributed components
Presentation Outline  Project Aims  Introduction of Digital Video Library  Introduction of Our Work  Considerations and Approach  Design and Implementation.
OCT1 Principles From Chapter One of “Distributed Systems Concepts and Design”
Rethinking the library catalogue: making search work for the library user Sally Chambers The European Library
Internet Resources Discovery (IRD) IBM DB2 Digital Library Thanks to Zvika Michnik and Avital Greenberg.
Searching and Researching the World Wide: Emphasis on Christian Websites Developed from the book: Searching and Researching on the Internet and World Wide.
WWW and Internet The Internet Creation of the Web Languages for document description Active web pages.
Internet basics, Browsers, application, advantages and disadvantages, architecture, WWW, URL, HTML Week 10 Mr. Mohammed Rahmath.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
The Internet & Web Browsers Business Webpage Design Kelly Seale.
Digital Library Architecture and Technology
AGRIS Multi-Host Search System: Using Dublin Core to homogenise distributed databases Frehiwot Fisseha FAO/WAICENT AGRIS/CARIS and Documentation Unit.
Chapter 16 The World Wide Web Chapter Goals Compare and contrast the Internet and the World Wide Web Describe general Web processing Describe several.
Chapter 16 The World Wide Web. 2 The Web An infrastructure of information combined and the network software used to access it Web page A document that.
16-1 The World Wide Web The Web An infrastructure of distributed information combined with software that uses networks as a vehicle to exchange that information.
CPS120: Introduction to Computer Science The World Wide Web Nell Dale John Lewis.
XP New Perspectives on Browser and Basics Tutorial 1 1 Browser and Basics Tutorial 1.
DISTRIBUTED COMPUTING
How did the internet develop?. What is Internet? The internet is a network of computers linking many different types of computers all over the world.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
Internet Information Retrieval Sun Wu. Course Goal To learn the basic concepts and techniques of internet search engines –How to use and evaluate search.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
I NTERNET BASICS, B ROWSERS, APPLICATION, ADVANTAGES AND DISADVANTAGES, ARCHITECTURE, WWW, URL, HTML Week 10 Mr. Mohammed Rahmath.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Internet Research Tips Daniel Fack. Internet Research Tips The internet is a self publishing medium. It must be be analyzed for appropriateness of research.
GUIDED BY DR. A. J. AGRAWAL Search Engine By Chetan R. Rathod.
Search Engines.
IT-522: Web Databases And Information Retrieval By Dr. Syed Noman Hasany.
Distributed database system
World Wide Web Library 150 Week 8. The Web The World Wide Web is one part of the Internet. No one controls the web Diverse kinds of services accessed.
Schedule Introduction to Web & Database Integration Tools and Resources HTML and Styles Forms and Client-Side Scripts DB Engines Forms Processing and Server-Side.
World Wide Web “WWW”, "Web" or "W3". World Wide Web “WWW”, "Web" or "W3"
Database Concepts Track 3: Managing Information using Database.
CONTENTS  Definition And History  Basic services of INTERNET  The World Wide Web (W.W.W.)  WWW browsers  INTERNET search engines  Uses of INTERNET.
Automatic Metadata Discovery from Non-cooperative Digital Libraries By Ron Shi, Kurt Maly, Mohammad Zubair IADIS International Conference May 2003.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
The World Wide Web. What is the worldwide web? The content of the worldwide web is held on individual pages which are gathered together to form websites.
Feb 24-27, 2004ICDL 2004, New Dehli Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer.
CSCI-235 Micro-Computers in Science The Internet and World Wide Web.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
The Internet What is the Internet? The Internet is a lot of computers over the whole world connected together so that they can share information. It.
Internet Searching the World Wide Web. The Internet and the World Wide Web The Internet is a worldwide collection of networks that allows people to communicate.
Web Design Terminology Unit 2 STEM. 1. Accessibility – a web page or site that address the users limitations or disabilities 2. Active server page (ASP)
The Internet Salihu Ibrahim Dasuki (PhD) CSC102 INTRODUCTION TO COMPUTER SCIENCE.
Networked Information Resources Federated search, link server, e-books.
The Internet & Web Browsers Business Webpage Design Created by Kelly Seale Adapted by Jill Einerson.
A Presentation Presentation On JSP On JSP & Online Shopping Cart Online Shopping Cart.
SEMINAR ON INTERNET SEARCHING PRESENTED BY:- AVIPSA PUROHIT REGD NO GUIDED BY:- Lect. ANANYA MISHRA.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
The Web Web Design. 3.2 The Web Focus on Reading Main Ideas A URL is an address that identifies a specific Web page. Web browsers have varying capabilities.
The World Wide Web.
E-commerce | WWW World Wide Web - Concepts
E-commerce | WWW World Wide Web - Concepts
Some Common Terms The Internet is a network of computers spanning the globe. It is also called the World Wide Web. World Wide Web It is a collection of.
Introduction Web Environments
Panagiotis G. Ipeirotis Tom Barry Luis Gravano
Distributed Systems Bina Ramamurthy 11/12/2018 From the CDK text.
Web Design & Development
Data Mining Chapter 6 Search Engines
Unit# 5: Internet and Worldwide Web
Chapter 16 The World Wide Web.
Introduction to World Wide Web
Presentation transcript:

Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan

Outline Introduction to Distributed Computer Systems The need for Distributed IR Distributed IR Problems of Distributed IR - system components Federated search engine Other examples Distributed IR Conclusion

Introduction to Distributed Computer Systems What is it? A distributed system is a collection of independent computers that appears to its users as a single coherent system. The first Distributed System: IBM 1961 develop a Compatible Time Sharing System. More recent times The WWW concept was designed in 1989 at CERN. wide spread use in the 90's ARPANET - building blocks of the INTERNET INTERNET - network or networks

Other types of Distributed systems ORACLE – Distributed Database Management System Air Traffic Control System – Real-time Distributed System University Network - Client-Server system Question: is a search engines a distributed system? Yes - single interface - search engines like Google have a cluster of 4000 computers doing its web crawling. No - user is aware of where searched documents come from. web address (URL) - Google's control is centralised - index and presentation Goals Share and access resource located on remote sites Scalability Transparency Fault Tolerance

The need for Distributed IR Benefits of Centralised IR Centralised control of resources is easier to manage More relevant resources are selected from user query on a centralised system Why is there a need for Distributed IR? Problems of centralised systems It’s not scalable for millions of users accessing single server - Increases network traffic - Increases server load There is a single point of failure Problems in IR Information is constantly growing. Different types of information are emerging with different formats and standards, residing on heterogeneous networks. hard to integrate services.

The need for Distributed IR Improves Scalability Distribute the information to a network of servers. Apply Standards and Protocols Z39.50 search protocol - allows a uniform access to a large number of diverse and heterogeneous information sources. client server computing. Dublin Core - standards applied to metadata (data about data) to make searching for information more efficient. Replication model for information retrieval Removes single point of failure Improves scalability issues

Distributed IR What is distributed information retrieval? The goal of distributed information retrieval is to enable the identification and retrieval of data sets relevant to a general description or query, wherever those data sets may be located or hosted USERAPPLICATIONDISTRIBUTED DATABSES Environments: cooperative or uncooperative

Distributed IR How does it works? Library Example library organisation has sites in different locations and has different internet accessible resources (e-journals, e-books) in different categories (literature, science, computing, geography, history, sports). Each library maintains its own database, with the resources, a unique identifier for the resources and detailed descriptions of the resources, and statistical information about the resource content. each library may have resources of the same type The library organisation has an online search engine, which enables users to search for any online resource in any category in all the libraries databases. Example of query User enters a query into the search application which will pass this request to all the individual library databases. These databases will return a list of unique identifiers of the relevant resources which are merged together in the application to present to the single ranked list to the user. If user finds a resource they want to view, the resource identifier is used to retrieve the resource.

Problems of Distributed IR - system sub components main components resource description resource selection query translation resource merging Resource description database files which contain detailed information about the resources. cooperative environments - START protocol uncooperative environments - Query based sampling Dublin Core - standards used to improve indexing information for resource descriptions. 15 elements - used to uniquely identify information or resources. Embedded into XML or HTML Example TITLE: Information Retrieval from Distributed Databases CREATOR: Ananth Anandhakrishnan DATE: FORMAT: WORD DOCUMENT LANGUAGE: ENGLISH

Dublin Core Metadata in HTML and XML Distributed Information Retrieval <meta name = "DC.Title" content = " Retrieval of information from DIR "> <meta name = "DC.Creator" content = "Ananth Anandhakrishnan"> <meta name = "DC.Date" content = "24/11/2004"> <meta name = "DC.Format" content = "text/html"> <meta name = "DC.Language" content = "en"> </body HTML has a tag called META XML embedded with a framework RDF <rdf:RDF Ananth Anandhakrishnan Distributed Information Retreival How does Information Retrieval from Distributed databasesworks

Resource Selection Component Resource Selection two jobs: involves identifying a small set of databases from the distributed information retrieval system that contains documents relevant to a query. after databases are selected a ranked list is produced This Process based on using algorithms CORI KL Divergence Relevant Document Distribution Estimation (ReDDE) Which is the best? ReDDE is proven to be the best algorithm for resource selection. estimates the distribution of relevant documents across the databases for each user query and ranks databases according to this distribution of relevant documents.

Resource Merging Result Merging Selected resources are complied into a single result. removes any duplication of resources Problems different databases use different selection algorithms difficult to merge. solution use standard selection algorithms more problems current merging methods take place at client end - isolated from DIR current methods are not very good. round robin - selecting the first database that it hits, doesn’t take into account of its relevance raw merge - results based on document scores solution place merging component near the selection component Semi Supervised Learning model - resource merging method. aim: produce a ranked list which is similar to one of a centralised information retrieval system. achieved: running a centralised sample database in parallel with the distributed databases. centralised sample database - using query based sampling to build resource descriptions.

Ranked list document links Semi Supervised Learning Model Query entry Resource selection Merging results CENTRALISED SAMPLE DATABASE DISTRIBUTED DATABASES Resource Descriptions of documents held on all databases. Obtained by querying Query is sent to a centralised sample database Merged results ranked by relevance. Combine document ranking Merged list Ranked list of documents from central database. Individual ranked lists Database independent scores Database specific scores

Semi Supervised Learning Model How distributed information retrieval works in more detail A user enters a query The query is used to rank the collection of databases from which a set of databases are selected. The query is then broadcasted to all the selected databases from which it produces a ranked list of all matches with document id and scores. The document ids and scores are added to the merging algorithm. The query is also broadcasted to the parallel running centralized database and the ranked list of document id’s and scores are also inputted into the merging algorithm. The ranked list provided by the central database will influence the resources merged from the distributed databases. SSL The SSL algorithm specifically models result merging as a task of transforming sets of database-specific document scores into a single set of database-independent document scores by using the documents acquired by query-based sampling as training data. Uses a regression algorithm to do this.

ISI Web of Knowledge ISI products are registered trademarks and service marks used under license. An incredible wealth of content -- ISI-Derwent + Partners = depth and diversity Engineered to work as single resource. Uniquely Integrated like no other platform. What makes the Web of Knowledge so unique? CrossSearch: 9,000+ International Journals 100,000+ meetings, symposia, and reports 11.3 million Patented Inventions

Our research interests involve the development of plant species that will actually assist in the clean-up of polluted soils.

We can choose to explore our results using the CrossSearch results summary list as a base.

We can also filter results by specific database. This is especially helpful in identifying particular information, such as patent data, within the results list.

Other Examples Emerge Emerge is a software built for information retrieval of scientific data. makes use of the Dublin core and Z39.50 search protocol XML-based translation engine which can perform metadata mapping and query translation. Harvest collects information from : - internet, intranet using http, ftp - local files like data on hard disk, CDROM and file servers. makes them searchable using a web interface supports wide range of formats Summary Object Interchange Format (SOIF) - metadata mapping BrokerGathererProvider 2 Provider 1 Provider 3 Client Collects information available at provider Collects, stores and managers the information for clients to query

Other Examples User information to keep track of processing data is a screensaver program used to aid the search for extraterrestrial life uses client computers CPU power to process data packets.

Conclusion Distributed Computing Concepts help information retrieval systems Distributed IR depends on Centralised IR - tries to emulate it Current State of Distributed Search GRUB screensaver program which uses your bandwidth and CPU power produces the most up-to-date indexes. have not got wide level of support. P2P search well known for Napster and Kazaa more dynamic than Google- allows users to upload whatever they want, and make it search available Google is in a controlled environment. not considered in commercial field - they don’t see the benefits.