An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An.

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Communication Service Identifier Requirements on SIP draft-loreto-3gpp-ics-requirements.txt
Web Service Ahmed Gamal Ahmed Nile University Bioinformatics Group
Database Architectures and the Web
ITIL: Service Transition
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
28.2 Functionality Application Software Provides Applications supply the high-level services that user access, and determine how users perceive the capabilities.
1 Adaptive Management Portal April
© 2005 Prentice Hall7-1 Stumpf and Teague Object-Oriented Systems Analysis and Design with UML.
Software Testing and Quality Assurance
1 Software Testing and Quality Assurance Lecture 30 - Introduction to Software Testing.
Architecture & Data Management of XML-Based Digital Video Library System Jacky C.K. Ma Michael R. Lyu.
©Silberschatz, Korth and Sudarshan1.1Database System Concepts Chapter 1: Introduction Purpose of Database Systems View of Data Data Models Data Definition.
eGovernance Under guidance of Dr. P.V. Kamesam IBM Research Lab New Delhi Ashish Gupta 3 rd Year B.Tech, Computer Science and Engg. IIT Delhi.
An Agent-Oriented Approach to the Integration of Information Sources Michael Christoffel Institute for Program Structures and Data Organization, University.
Software Issues Derived from Dr. Fawcett’s Slides Phil Pratt-Szeliga Fall 2009.
NFS. The Sun Network File System (NFS) An implementation and a specification of a software system for accessing remote files across LANs. The implementation.
.NET Mobile Application Development Remote Procedure Call.
LEVERAGING THE ENTERPRISE INFORMATION ENVIRONMENT Louise Edmonds Senior Manager Information Management ACT Health.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
Digital Object: A Virtual Online Storage Solution 598C Course Project Huajing Li.
The FCLA Endeca Project By Michele Newberry. M.Newberry2 Why ENDECA?  Already proven by NCSU  Build on NCSU’s work instead of starting from zero  Product.
Libraries Australia Cataloguing Parallel Session Bemal Rajapatirana / Rob Walls.
Avalanche Internet Data Management System. Presentation plan 1. The problem to be solved 2. Description of the software needed 3. The solution 4. Avalanche.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
Department of Electronic Engineering City University of Hong Kong EE3900 Computer Networks Introduction Slide 1 A Communications Model Source: generates.
The FCLA Endeca Project By Michele Newberry. M.Newberry2 Current OPAC environment  Aleph 500 v.15.5  Heavily customized to reflect pre- implementation.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Configuration Management (CM)
Testing and Improving Interoperability The Z39.50 Interoperability Testbed William E. Moen School of Library and Information Sciences Texas Center for.
Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK
ZLOT Prototype Assessment John Carlo Bertot Associate Professor School of Information Studies Florida State University.
Unit – I CLIENT / SERVER ARCHITECTURE. Unit Structure  Evolution of Client/Server Architecture  Client/Server Model  Characteristics of Client/Server.
An Alternative Approach to Interoperability Testing The Use of Special Diagnostic Records in the Context of Z39.50 and Online Library Catalogs William.
RELATIONAL FAULT TOLERANT INTERFACE TO HETEROGENEOUS DISTRIBUTED DATABASES Prof. Osama Abulnaja Afraa Khalifah
Session-8 Data Management for Decision Support
Information: Policy, Strategy and Systems Module Overview
Distributed Information Retrieval Using a Multi-Agent System and The Role of Logic Programming.
Dynamic Document Sharing Detailed Profile Proposal for 2010 presented to the IT Infrastructure Technical Committee Karen Witting November 10, 2009.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
ICDL 2004 Improving Federated Service for Non-cooperating Digital Libraries R. Shi, K. Maly, M. Zubair Department of Computer Science Old Dominion University.
Chapter 1 Introduction to Databases. 1-2 Chapter Outline   Common uses of database systems   Meaning of basic terms   Database Applications  
 Repository Model  Client-Server Model  Layered Model  Modular decomposition styles  Object Models  Function Oriented Pipelining  Control Styles.
1 The Future Of Union Catalogues Some BL Perspectives Neil Wilson Head of Bibliographic Development Scholarship & Collections Boston Spa 17 th March 2006.
Lesson Overview 3.1 Components of the DBMS 3.1 Components of the DBMS 3.2 Components of The Database Application 3.2 Components of The Database Application.
Personalized Interaction With Semantic Information Portals Eric Schwarzkopf DFKI
August 2005 TMCOps TMC Operator Requirements and Position Descriptions Phase 2 Interactive Tool Project Presentation.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
1 Interoperability: architectures and connections John Gilby, M25 Systems Team, LSE Ashley Sanders, Copac Team, MIMAS "Hyper Clumps, Mini Clumps and National.
Methods and Techniques for Integration of Small Datasets September 13-14, 2005 St. Louis, Missouri Sponsored by the U.S. Department of Housing and Urban.
Oct 12-14, 2003NSDL Challenges in Building Federation Services over Harvested Metadata Kurt Maly, Michael Nelson, Mohammad Zubair Digital Library.
Internet Applications (Cont’d) Basic Internet Applications – World Wide Web (WWW) Browser Architecture Static Documents Dynamic Documents Active Documents.
Copyright (c) 2014 Pearson Education, Inc. Introduction to DBMS.
The overview How the open market works. Players and Bodies  The main players are –The component supplier  Document  Binary –The authorized supplier.
Introduction Contain two or more CPU share common memory and peripherals. Provide greater system throughput. Multiple processor executing simultaneous.
Multimedia Retrieval Architecture Electrical Communication Engineering, Indian Institute of Science, Bangalore – , India Multimedia Retrieval Architecture.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 Information Retrieval and Use De-normalisation and Distributed database systems Geoff Leese September 2008, revised October 2009.
Online Information and Education Conference 2004, Bangkok Dr. Britta Woldering, German National Library Metadata development in The European Library.
Powerpoint Templates Data Communication Muhammad Waseem Iqbal Lecture # 07 Spring-2016.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Database Principles: Fundamentals of Design, Implementation, and Management Chapter 1 The Database Approach.
ITIL: Service Transition
REST- Representational State Transfer Enn Õunapuu
CHAPTER 3 Architectures for Distributed Systems
Outline Pursue Interoperability: Digital Libraries
Submitted By: Usha MIT-876-2K11 M.Tech(3rd Sem) Information Technology
Chapter 24 Testing Object-Oriented Applications
Chapter 19 Testing Object-Oriented Applications
Chapter 19 Testing Object-Oriented Applications
Presentation transcript:

An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 An Architecture for Online Information Integration on Concurrent Resource Access on a Z39.50 Environment Michalis Sfakakis 1 and Sarantos Kapidakis 2 1 National Documentation Centre / National Hellenic Research Foundation 2 Laboratory on Digital Libraries and Electronic Publishing Archive and Library Sciences Department / Ionian University 7 th European Conference on Digital Libraries August 2003, Trondheim, Norway

Presentation Summary  Main Contributions  Resource Access in a Network Environment (models, characteristics, issues, implementations)  Proposed Architecture (goal, critical points, characteristics, benefits)  Technical Details of the Proposed Architecture  Conclusions  Future Research

Main Contributions  Analysis of problems (in a networked environment) for: Concurrent resource access via parallel search Concurrent resource access via parallel search Information integration Information integration  Proposal of architecture for these problems: Able to improve online information integration Able to improve online information integration Taking into account the restrictions imposed by the: Taking into account the restrictions imposed by the: l Network environment l Z39.50 information retrieval protocol

Resource Access in Union Catalogues  Give access to library content from one central point  Functional requirements Consistent searching & indexing Consistent searching & indexing Consolidation of Records (information integration) Consolidation of Records (information integration) Performance & Management Performance & Management  … conformance to current implementation models Centralized (the vast majority of the current implementations): conform well to all functional requirements Centralized (the vast majority of the current implementations): conform well to all functional requirements Distributed (current approaches – virtual union catalogues): all functional requirements vary Distributed (current approaches – virtual union catalogues): all functional requirements vary

Why Virtual Union Catalogues (VUC) Why Centralized  Distributed:  Local autonomy and control of the participating systems  Retention of the specific resource characteristics  User ability to dynamically define his own collections of resources  Vast and increasing number of available resources

Pre-requirements for VUC   Ensure systems interoperability, derived from the implementation of international metadata standards and information retrieval protocols   Provide information integration (indicated by user studies)   Achieve accepted performance from the systems which emulate the union catalogue   Have ability for parallel searching   Have adequate network performance

Is it possible to implement VUC now? Depends on:  Current technology and network improvements  Existence and wide acceptance of metadata standards (e.g. DC, MARC, MODS, etc)  Wide acceptance of the Z39.50 information retrieval protocol and its associated profiles

Requirements for Information Integration  The Information Integration (Consolidation of Records) is a two step process: Identification of the duplicate records Identification of the duplicate records Presentation: Creation of a union record, or, according to the Z39.50 duplicate detection model, the clustering of records in ‘equivalence classes’ and the selection of a representative record Presentation: Creation of a union record, or, according to the Z39.50 duplicate detection model, the clustering of records in ‘equivalence classes’ and the selection of a representative record  Its effectiveness & quality is affected by the: Differences in semantic models and formats of the metadata Differences in semantic models and formats of the metadata Metadata Quality (i.e. specificity, completeness of fields, syntactic correctness and consistency as implemented by authority files) Metadata Quality (i.e. specificity, completeness of fields, syntactic correctness and consistency as implemented by authority files)

Methods for Information Integration  Depending on the challenge: High quality duplicate detection and merging on large amount of data, offline - without hard time restrictions High quality duplicate detection and merging on large amount of data, offline - without hard time restrictions l Development of centralized union catalogues, or creation of collection by harvesting techniques Good de-duplication quality on medium to small amount of data, online and present them to the user in accepted response time Good de-duplication quality on medium to small amount of data, online and present them to the user in accepted response time l Development of virtual union catalogues

Z39.50 Information Retrieval Protocol  A complicated, state full, client /server protocol, widely used in the area of libraries  For every session (Z-association) a server: Holds a search history (at least the last query) Holds a search history (at least the last query) During the session the client can request data from any result set included in the search history During the session the client can request data from any result set included in the search history The search history stays alive during the session The search history stays alive during the session The session can be abruptly terminated by the server (timeout), on ‘lack of activity’ The session can be abruptly terminated by the server (timeout), on ‘lack of activity’ l The timeout period is server dependent  Depending of the implementation level, a server could implement in a number of variations the: Sort service Sort service Duplicate detection service Duplicate detection service

Summary of VUC Implementation Issues  Network dependent: Network links performance & availability Network links performance & availability  Protocol dependent: Interoperability level (e.g. supported services and their implementation variations) Interoperability level (e.g. supported services and their implementation variations) Timeout period and session reactivation Timeout period and session reactivation  Participating systems dependent: Performance, availability, extensibility, metadata encoding and semantics Performance, availability, extensibility, metadata encoding and semantics  De-duplication complexity & expensiveness: Highly affected by the different semantic models & formats, quality, completeness, consistency and the amount of the metadata Highly affected by the different semantic models & formats, quality, completeness, consistency and the amount of the metadata  Overall system performance

Current VUC Implementations  Server side: Majority support basic services (e.g. Init, Search, Present, Scan) Majority support basic services (e.g. Init, Search, Present, Scan) A small number support the sort service A small number support the sort service A minority supports the duplicate detection service A minority supports the duplicate detection service  Client side: Has to deal with heterogeneity in receiving resulting data Has to deal with heterogeneity in receiving resulting data Must overcome timeout issues, avoiding session reactivation Must overcome timeout issues, avoiding session reactivation Has to de-duplicate incoming results, even if every individual server reply does not provide duplicates Has to de-duplicate incoming results, even if every individual server reply does not provide duplicates The majority of the implementations does not make any integration, due to performance issues. The majority of the implementations does not make any integration, due to performance issues. Primitive duplication detection approaches, based on some coded data (e.g. ISBN, ISSN, LC number, etc.) Primitive duplication detection approaches, based on some coded data (e.g. ISBN, ISSN, LC number, etc.)

User – VUC System Interactions  Defines the desired collection of resources  Sends a search request, specifying a desired number of records (Presentation Set) to display each time  After receiving the Presentation Set, subsequently Presentation Sets could be requested – or not

Goal of the Proposed Architecture To improve information integration in online access of a distributed system, which:  Accesses concurrently resources via the network  Applies online good quality duplicate detection procedures (for presenting only once each record that is multiply located in the resources)

Critical Points of the Proposed Architecture We have to deal with:  Performance of the network links and the availability of the resources  Complexity and expensiveness of the duplicate detection algorithms, especially in large amount of records  Extraction of the Presentation set in reasonable response time

Characteristics of the Proposed Architecture What we do:  We do not apply the duplicate detection algorithms in one shot – the duplicate detection process is applied using each received set of data and comparing them against the previously processed results  Incremental comparison and elimination of the duplicates in every Presentation Set – the processed results are sorted and do not contain duplicates  Usage of the sort or duplicate detection service, when supported  During the time the user is reading the results, the system prepares few next sets of unique records

Benefits of the Proposed Architecture  Avoid downloading large amounts of data over the network and unnecessarily loading the servers  Apply the duplicate detection algorithm to a small number of records – especially in the first steps  Every record is compared against a processed set during de-duplication  We deploy the time the user is reading the presented data, without exhausting the system resources

Overview of the Proposed Architecture  Modules: Request Interface, Data Integrator, Resource Communicator  Components: Data Provider, Local Result Set Manager, De- duplicator, Data Presenter  Interaction is accomplished by messages or synchronous data transmissions

Modules of the Proposed Architecture  The Request Interface: Receives every user request (search or present), dispatches it to the appropriate modules, waiting the Presentation Set  The Resource Communicator: Access the resources and supplies the data for the integration  The Data Integrator: Receives the data sets, makes the information integration and manages the unique records to be ready for presentation

Components of the Proposed Architecture  The Local Result Set Manager: Holds and arranges (e.g. sorts) the de- duplicated records and prepares the Presentation Set  The Data Provider: Receives data from the Resource Communicator Module and sends one at a time for further process  The De-duplicator (s): Receives a record from the Local Result Set Manager and compares it with all the unique records in the Local Result Set  The Data Presenter: Dispatches the received request for data, from the Request Interface, to the Local Result Set Manager and returns back the next unique records for presentation

Resource 1…j Z39.50 Server Resource j+1…k Z39.50 Server Resource l+1…r Z39.50 Server Resource Communicator Data Integrator Request Interface User Interaction

Accomplishing a search request – Module Interactions 1.The Request Interface requests p records from the Data Integrator and waits for (at most p) records 2.The Request Interface, also, forwards the search request including the number p, to the Resource Communicator and continues monitoring for user requests 3.The Resource Communicator waits for messages from the Request Interface and when it receives a new search request, it concurrently starts the following sequences of steps for every server: 1.Interprets the search request to the appropriate message format for the server, sends it and waits for its reply 2.Adds the number of hits from all the replies and sends it to the Request Interface 3.If the server supports either the duplicate detection or the sort service, it invokes it after its initial response to the search request 4.Requests a number of records (e.g. p) from every server that replied on its last request 5.It sends the arrived data to the Data Integrator 6.Waits for further commands, but if there is no communication with the server for a period close to its timeout, the procedure jumps to step The Data Integrator de-duplicates part of the received data, prepares the set of unique records and when p records are found, it sends them to the Request Interface

Module Interactions: Comments & Clarifications  All modules work in parallel  The number of requested records from every server could vary, depending upon its: performance, timeout, the network links and the Result Set size  For the overall system performance, the Resource Communicator realizes if a server is down, using the Profiles of the Z39.50 servers, and continues the interaction with the other modules  The calculated number of hits is not the actual one  To avoid session reactivation, imposed by the server timeout, the Resource communicator could request data from any server at any time  A threshold value activates the Data Integrator to ‘request data’ from the Resource Communicator

Request Interface Resource Communicator Profiles of the Z39.50 Servers Data Integrator De-duplicator Data Presenter Local Result Set Output QueueInput Queue Data Provider Local Result Set Manager Presentation Set

Accomplishing a search request – Component Interactions 1.The Data Provider starts to transfer data, possibly by rearranging them. If the number of data contained in it is less than a threshold (e.g. 5p), the Data Provider sends a ‘request data’ message to the Resource Communicator 2.While the Local Result Set Manager has less than a threshold (e.g. 3 p) unique record, it tries to read from the Data Provider and for every record found, it calls the De-Duplicator to compare the record: 1.The De-Duplicator compares the record with the records in the Local Result Set and then sends the results back to the Local Result Set Manager 2.The Local Result Set Manager receives the results from the duplicate detection process and arranges the record into the Local Result Set 3.If the number of new unique records in the Local Result Set becomes p, it copies the p new unique records into the Presentation Set and activates the Data Presenter 3.When the Presentation Set is filled with (the p) records, the Data Presenter component dispatches the records to the Request Interface module and waits to receive the next ‘request data’ message from it. If the component does not receive any request during its predefined timeout period, it terminates the system

Component Interactions: Comments & Clarifications  The combination of the threshold values in Data Provider & Local Result Set Manager, controls the ‘request data’ activity from the Resource Communicator  The Local Result Set Manager keeps two orderings for the unique records in order to: Improve the performance of the De-duplicator Improve the performance of the De-duplicator Present and Facilitate easy access of the stored records Present and Facilitate easy access of the stored records

Conclusions  The online de-duplication process from resources accessed concurrently in a network environment: Is a requirement identified by user studies Is a requirement identified by user studies Is challenged by a number of issues relevant to: Is challenged by a number of issues relevant to: l Performance of the participating servers l Their network links l The complexity and the expensiveness of the duplicate detection algorithms  These issues make inefficient any approach to the application of the information integration: In online environments In online environments Especially when large amounts of data must be processed Especially when large amounts of data must be processed  In our proposed system: We do not try to integrate all the results from all the recourses at once We do not try to integrate all the results from all the recourses at once We attack this problem by: We attack this problem by: l Retrieving a small number of records, independently if the servers provide de-duplicated or sorted results l Appling the de-duplication process on small amounts of sorted records l Creating a presentation set of unique records to display to the user l Deploying the time the user is reading the presented data, without misapplying the system resources

Future Research  To better approximate the number of records satisfying the search request  To derive priorities for the servers and their resources  To select or adapt a good de-duplication algorithm for different record completeness and different provision of records by the servers  To optimize the number of requested records from a server  To implement the system and evaluate its performance