1 CS 430 / INFO 430 Information Retrieval Lecture 24 Architecture of Information Retrieval Systems.

Slides:



Advertisements
Similar presentations
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
Advertisements

Y.T. a brief history of the OAI 0 Kaynak: Herbert van de Sompel.
OAI in DigiTool DigiTool Version 3.0.
1 Building the NSDL William Y. Arms Cornell University Thinking aloud about the NSDL.
ELPUB 2006 June Bansko Bulgaria1 Automated Building of OAI Compliant Repository from Legacy Collection Kurt Maly Department of Computer.
OAI-PMH Dawn Petherick, University Web Services Team Manager, Information Services, University of Birmingham MIDESS Dissemination.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Building Reliable Distributed Information Spaces Carl Lagoze CS /22/2002.
1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.
1 DLESE in Context: Educational Computing, Digital Libraries and Scientific Education William Y. Arms Cornell University.
1 CS 430 / INFO 430 Information Retrieval Lecture 22 Metadata 4.
1 NSDL The National Science Foundation's National Digital Library for Science, Mathematics, Engineering and Technology Education [a.k.a. Smete, NSDL, Learns,...]
SCORM-NSDL Workshop May 18, Educational Materials are Scattered across the Internet NASA Math Forum State standards Scientific American Ask.
Mixed content, mixed metadata: Information discovery in the NSDL.
1 CS 502: Computing Methods for Digital Libraries Lecture 22 Repositories.
Rethinking the library catalogue: making search work for the library user Sally Chambers The European Library
UCLA Digital Library UC Digital Library Forum August 5, 2002 UCLA Digital Library Presenter: Curtis Fornadley Senior Programmer/Analyst.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
OAI Standards for Sheet Music Meeting March 28-29, 2002 Basic OAI Principals How They Apply to Sheet Music Presenter: Curtis Fornadley, Senior Programmer/Analyst.
1 William Y. Arms September 26, 2002 A Research Program for Information Science with the NSDL as an Example.
1 An introduction to the NSDL William Y. Arms Cornell University.
Introduction to Digital Libraries hussein suleman uct cs honours 2004.
INTRODUCTION TO WEB DATABASE PROGRAMMING
Introduction to the OAI Metadata Harvesting Protocol Hussein Suleman, Digital Library Research Laboratory Virginia Tech.
1 The NSDL: A Case Study in Interoperability William Y. Arms Cornell University.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
7. Approaches to Models of Metadata Creation, Storage and Retrieval Metadata Standards and Applications.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
NSDL: OAI and a large- scale digital library Carl Lagoze, Cornell University NSDL Director of Technology
Building a large-scale digital library for education Carl Lagoze Common Solutions Group January 16, 2003.
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
SCIELO AS AN OPEN ARCHIVE: the development of SciELO / OpenArchives data provider interface Prof. Carlos H. Marcondes Federal Fluminense University/ Information.
OAI-PMH: Open Archives Initiative Protocol for Metadata Harvesting T.B. Rajashekar National Centre for Science Information (NCSI) Indian Institute of Science,
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
Mixed content, mixed metadata: Information discovery in the NSDL.
Alexandria Digital Earth ProtoType DIGITAL LIBRARIES AND ENVIRONMENTAL INFORMATION Terence R. Smith Alexandria Digital Library Project.
1 A Very Large Digital Library Technology Demonstration William Y. Arms Cornell University.
Digital Library Interoperability Architecture CS 502 – Carl Lagoze – Cornell University.
Kurt Maly Department of Computer Science Old Dominion University Norfolk, Virginia 23529, USA Digital Libraries, OAI and Free Software.
Tsinghua University Library Yang Zhao & Airong Jiang Tsinghua University Library, Beijing China 4 June, 2004 Electronic Thesis and Dissertation System.
Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH) Phil Barker, March © Heriot-Watt University. You may reproduce all or any part.
Open Archive Initiative – Protocol for metadata Harvesting (OAI-PMH) Surinder Kumar Technical Director NIC, New Delhi
Caltech CODA CODA: Collection of Digital Archives Caltech Scholarly Communication.
Slavic Digital Text Workshop 2006 The Open Archives Initiative Protocol for Metadata Harvesting: an Opportunity for Sharing Content in a Distributed Environment.
1 GRID Based Federated Digital Library K. Maly, M. Zubair, V. Chilukamarri, and P. Kothari Department of Computer Science Old Dominion University February,
OAI Overview DLESE OAI Workshop April 29-30, 2002 John Weatherley
Integrating Access to Digital Content Sarah Shreeves University of Illinois at Urbana-Champaign Visual Resources Association 23 rd Annual Conference Miami.
1 The NSDL Program Stephen Griffin National Science Foundation.
Search Interoperability, OAI, and Metadata Sarah Shreeves University of Illinois at Urbana-Champaign Basics and Beyond Grainger Engineering Library April.
Metadata “Data about data” Describes various aspects of a digital file or group of files Identifies the parts of a digital object and documents their content,
The Open Archives Initiative Marshall Breeding Director for Innovative Technologies and Research Vanderbilt University
Open Archives Initiative Protocol for Metadata Harvesting.
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
2/22/2016J Ammerman1 Open Archives Initiative What is it? What’s it good for?
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
The NSDL, OAI and Your Metadata Core Infrastructure Metadata Repository (“union catalog”) Naomi Dushay Cornell University.
OAI and ODL Building Digital Libraries from Components Hussein Suleman Virginia Tech DLRL 12 September 2002.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems.
1 CS 430: Information Discovery Lecture 13 Case Study: the NSDL.
WHAT DOES THE FUTURE HOLD? Ann Ellis Dec. 18, 2000
Building Search Systems for Digital Library Collections
NSDL: OAI and a large-scale digital library
CS 430 / INFO 430 Information Retrieval
OAI and Metadata Harvesting
Presentation transcript:

1 CS 430 / INFO 430 Information Retrieval Lecture 24 Architecture of Information Retrieval Systems

2 Course Administration CS 490 and CS 790 Independent Research Projects Web Research Infrastructure -- Build a system to bring complete crawls of the Web from the Internet Archive to the Cornell Theory Center and make them available for researchers through a standard API. (Continues planning work carried out this semester.) There will not be an independent research project in information retrieval.

3 Course Administration Final Examination The final examination is on Monday, December 13, between 12:00 and 1:30. A make-up examination will be available on another date, which has not yet been chosen. The proposed date is December 9. If you would like to take the make-up examination, send an message to Anat Nidar-Levi

4 Distributed Architecture 1: Standard Search Protocols Find x Strict adherence to standards allows any user interface to search any conforming search service.

5 Distributed Architecture 1: Standard Search Protocols Example: Z Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc.) Message Passing Protocol: Z Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols.

6 Z39.50 principles Servers store a set of databases with searchable indexes Interactions are based on a session The client opens a connection with the server(s), carries out a sequence of interactions and then closes the connection. During the course of the session, both the server and the client remember the state of their interaction.

7 State Z39.50 The server carries out the search and builds a results set Server saves the results set. Subsequent message from the client can reference the result set. Thus the client can modify a large set by increasingly precise requests, or can request a presentation of any record in the set, without searching entire database.

8 Standard Search Protocols Example: Z Family of Standards for Searching Library Catalogs The Z family of standards has proved successful in a tightly knit community, where: There is a strong tradition of standardization, with many professionally trained people. The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z39.50.

9 Distributed Architecture 2: Broadcast Search (a.k.a. Federated Search) Find x Interface Service An interface server broadcasts a query to each collection, combines the results and returns them to the user. Examples: Dienst (digital library protocol), Web metasearch services

10 Distributed Architecture 2: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http, etc.).

11 Distributed Architecture 2: Broadcast Search Problems with Broadcast Search Performance: If any collection does not respond, the Interface Server waits for a time out. Recall: If any collection does not respond, documents in that collection are not found. Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization.

12 Standardization: Function Versus Cost of Acceptance Function Cost of acceptance Many adopters Few adopters

13 Example: Textual Mark-up Function Cost of acceptance SGML ASCII HTML XML

14 Distributed Architecture 3: Centralized Search Services Find x Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central system, and (b) retrieves items from collections. Examples: Union catalogs, Web search services Search Service retrieve search

15 Distributed Architecture 3: Centralized Search Services Gathering by Web Crawling Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but... Can only gather openly accessible materials. Cannot gather material in databases unless explicit URLs are known. Cannot easily make use of metadata provided by collections. Examples: Web search services.

16 Distributed Architecture 3: Centralized Search Services Harvesting Each collection makes a copy of its metadata available from a sever associated with the collection. A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages... Can index material from databases without explicit URLs. Allows authentication and selection of material. but... Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting).

17 Open Archives Initiative Protocol for Metadata Harvesting Low-barrier protocol for exposing structured information (metadata) from cooperating repositories Provides opportunity for building comprehensive service network

18 Discovery Current Awareness Preservation Service Providers Data Providers Metadata harvesting OAI-PMH: A simple two party model for sharing structured information

19 OAI-PMH Key technical features Simple HTTP encoding Built on of established XML standards Multiple metadata formats, but Dublin Core required Repository partitioning (sets) Selective harvesting (sets and dates) Clean partition between core and implementation-specific extensions –Multiple item-level metadata –Collection level metadata

20 OAI Verbs Identify – repository characteristics ListMetadataFormats – DC required ListSets – repository partitioning ListRecords – (selectively) harvest metadata ListIdentifiers – (selectively) harvest metadata identifiers GetRecord – known item retrieval

21 The Integration Task is to provide a coherent set of collections and services across great diversity (all digital collections relevant to science education). The National Science Digital Library

22 Interoperability in the NSDL The Problem Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners... most of whom are not directly part of the NSDL program The challenge is to create incentives for independent digital libraries to adopt agreements

23 Basic Assumptions The integration team will not manage most of the collections The integration team will not create most of the metadata Architecture for Searching

24 Full Text or Metadata? Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing). Comprehensive metadata is available for very few of the materials. What Architecture to Use? Few collections support an established search protocol (e.g., Z39.50). The NSDL Search Service

25 NSDL: The Spectrum of Interoperability LevelAgreementsExample FederationStrict use of standardsAACR, MARC (syntax, semantic, Z and business) HarvestingDigital libraries exposeOpen Archives metadata; simplemetadata harvesting protocol and registry GatheringDigital libraries do notWeb crawlers cooperate; services mustand search engines seek out information

26 Users Collections NSDL Repository The NSDL Repository Services The repository is a resource for service providers. It holds information about every collection and item known to the NSDL, including contextual information.

27 NSDL Search Service: First Phase Portal Search and Discovery Service Collections SDLIP OAI harvest crawl NSDL Repository Inquery -> Lucene

28 NSDL Search Service: First Phase Approach (a)Collections map metadata to Dublin Core, provide via Open Archives protocol. (b)Search service augments Dublin Core metadata with indexing of full-text where available. (c)User interface returns snippets derived from the metadata, links to full content and to metadata.

29 NSDL Search Service: First Phase Weaknesses (a)Ranking by similarity to query not sufficient. (b)Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata). (c)Dublin Core records provide limited information. (d)Browsing environment limited. (e)Most users begin their search with a Web search engine (e.g., Google)

30 NSDL Search Service: Second Phase Developments Metadata (a)Accept any metadata that is available in a range of formats (b)System for reviews and annotations, with reputation management Search system (a)Multimodal retrieval and ranking (b)Dynamic generation of snippets by search engine

31 NSDL Search Service: Second Phase Developments (cont.) Usability and human factors (a)Wider range of browsing tools (e.g., collection visualization) (b)Filters by education level and education quality, where known Web compatibility (a)Expose records for Web crawlers to index (b)Browser bookmarklet to add NSDL information to Web pages