1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems.

Slides:



Advertisements
Similar presentations
1 Web Search Environments Web Crawling Metadata using RDF and Dublin Core Dave Beckett Slides:
Advertisements

1 William Y. Arms Cornell University October 25, 2002 The National Science Digital Library (NSDL) as an Example of Information Science Research.
Retrieval of Information from Distributed Databases By Ananth Anandhakrishnan.
1. The Digital Library Challenge The Hybrid Library Today’s information resources collections are “hybrid” Combinations of - paper and digital format.
1 Quality Control in Scholarly Publishing. What are the Alternatives to Peer Review? William Y. Arms Cornell University.
1 Building the NSDL William Y. Arms Cornell University Thinking aloud about the NSDL.
Building Reliable Distributed Information Spaces Carl Lagoze CS /22/2002.
1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.
1 DLESE in Context: Educational Computing, Digital Libraries and Scientific Education William Y. Arms Cornell University.
1 CS 502: Computing Methods for Digital Libraries Lecture 20 Multimedia digital libraries.
1 CS 430 / INFO 430 Information Retrieval Lecture 22 Metadata 4.
1 NSDL The National Science Foundation's National Digital Library for Science, Mathematics, Engineering and Technology Education [a.k.a. Smete, NSDL, Learns,...]
SCORM-NSDL Workshop May 18, Educational Materials are Scattered across the Internet NASA Math Forum State standards Scientific American Ask.
Mixed content, mixed metadata: Information discovery in the NSDL.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
William Y. Arms Corporation for National Research Initiatives March 22, 1999 Object models, overlay journals, and virtual collections.
What is the Internet? The Internet is a computer network connecting millions of computers all over the world It has no central control - works through.
Enterprise Search With SharePoint Portal Server V2 Steve Tullis, Program Manager, Business Portal Group 3/5/2003.
T.Sharon-A.Frank 1 Internet Resources Discovery (IRD) FDL Examples.
Institutional Repositories Tools for scholarship Mary Westell University of Calgary AMTEC Conference May 26, 2005.
1 William Y. Arms September 26, 2002 A Research Program for Information Science with the NSDL as an Example.
Corporation For National Research Initiatives NSF SMETE Library Building the SMETE Library: Getting Started William Y. Arms.
1 An introduction to the NSDL William Y. Arms Cornell University.
1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.
Introduction to Digital Libraries hussein suleman uct cs honours 2004.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
The Natural Resources Digital Library Needs, Partners, and Challenges Bonnie Avery, Janine Salwasser, & Janet Webster Oregon State University.
1 The NSDL: A Case Study in Interoperability William Y. Arms Cornell University.
CS621 : Seminar-2008 DEEP WEB Shubhangi Agrawal ( )‏ Jayalekshmy S. Nair ( )‏
Internet Basics Dr. Norm Friesen June 22, Questions What is the Internet? What is the Web? How are they different? How do they work? How do they.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
1 CS 502: Computing Methods for Digital Libraries Lecture 28 Current work in preservation.
1 CS 430 / INFO 430 Information Retrieval Lecture 23 Non-Textual Materials 2.
OpenURL Link Resolvers 101
Creating and Operating a Digital Library for Information and Learning– the GROW Project Muniram Budhu Department of Civil Engineering & Engineering Mechanics.
7. Approaches to Models of Metadata Creation, Storage and Retrieval Metadata Standards and Applications.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Preserving Digital Culture: Tools & Strategies for Building Web Archives : Tools and Strategies for Building Web Archives Internet Librarian 2009 Tracy.
NSDL: OAI and a large- scale digital library Carl Lagoze, Cornell University NSDL Director of Technology
Building a large-scale digital library for education Carl Lagoze Common Solutions Group January 16, 2003.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
1 CS 430: Information Discovery Lecture 3 Inverted Files.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Architecture of Information Retrieval Systems.
International Seminary on Digitisation: Experience and Technology 11 th May 2004 | National Library | Lisbon – Portugal DIGITAL ARCHIVE OF PORTUGUESE ART.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
Internet Research Tips Daniel Fack. Internet Research Tips The internet is a self publishing medium. It must be be analyzed for appropriateness of research.
Mixed content, mixed metadata: Information discovery in the NSDL.
Extending Access To Information Resource Discovery Service William E. Moen, Ph.D. Kathleen R. Murray, Ph.D. School of Library and Information Sciences.
1 A Very Large Digital Library Technology Demonstration William Y. Arms Cornell University.
1 The Digital Library Landscape Looking for Trends William Y. Arms Department of Computer Science Cornell University.
1 The NSDL Program Stephen Griffin National Science Foundation.
Data Integration Hanna Zhong Department of Computer Science University of Illinois, Urbana-Champaign 11/12/2009.
Corporation For National Research Initiatives Technical Issues in Electronic Publishing Corporation for National Research Initiatives William Y. Arms.
Information Retrieval
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
Web Information Retrieval Prof. Alessandro Agostini 1 Context in Web Search Steve Lawrence Speaker: Antonella Delmestri IEEE Data Engineering Bulletin.
NSDL & Access Management David Millman Columbia University Jan ‘02.
Cornell CS 502 Metadata for the Web Issues and Simple Answers CS 502 – Carl Lagoze – Cornell University.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
1 CS 502: Computing Methods for Digital Libraries Guest Lecture William Y. Arms Identifiers: URNs, Handles, PURLs, DOIs and more.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 CS 430: Information Discovery Lecture 13 Case Study: the NSDL.
Metasearch: Top-Level Interface, Reference Applications
CS 430: Information Discovery
NSDL: OAI and a large-scale digital library
CS 430 / INFO 430 Information Retrieval
DIGITAL LIBRARY.
Building a large-scale digital library for education
Presentation transcript:

1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems

2 Course Administration CS 490 and CS 790 Independent Research Projects Next semester, the National Science Digital Library (NSDL) will have several projects that are suitable for independent research. Topics will include selective web crawling, annotation services and information discovery. If you are interested, send to Assignment 4 due date If you want an extension to 5 p.m. on Monday, send to

3 Cornell SIGCHI - Student Chapter Interested in user interface design? Cornell has a chapter of the ACM’s special interest group for computer human interaction (SIGCHI) Meeting tonight in the Information Science Building (301 College Ave.) at 5pm. For info: ajf15 or google: cornell sigchi Benefits of membership include but are not limited to: access to a network of students and faculty interested in HCI, help with securing professional or graduate positions in HCI, access to information about courses and research on campus, access to the resources of the Cornell HCI lab, regular meetings to share ideas and have fun, cool magazine if you join national org.

4 Basic Architecture 1: Single Homogeneous Collection Documents and indexes are held on a single computer system (may be several computers). The user interface and search methods are selected for the specific service. Examples: Medline (medical information) Westlaw (legal information) Index Documents

5 Basic Architecture 2: Several Similar Collections -- One Computer System Several more or less similar collections are held on a single computer system. Each collection is indexed separately using the same software, procedures, algorithms, etc. (with minor differences, e.g., stoplists). The user interface is the same (or very similar) for each service. Examples: OCLC's FirstSearch

6 Distributed Architecture 1: Standard Search Protocols Find x Strict adherence to standards allows any user interface to search any conforming search service.

7 Distributed Architecture 1: Standard Search Protocols Example: Z Family of Standards for Searching Library Catalogs Content: Anglo American Cataloging Rules Structure of Content: MARC Encoding Rules: Base Encoding Rules (character sets, separators, etc.) Message Passing Protocol: Z Query Format: Bib 1 (Boolean), Type 102 (full text) In addition, there are the underlying network standards, e.g. the Internet suite of protocols.

8 Distributed Architecture 1: Standard Search Protocols Example: Z Family of Standards for Searching Library Catalogs The Z family of standards has proved successful in a tightly knit community, where: There is a strong tradition of standardization, with many professionally trained people. The categories of material change gradually, allowing a slow-moving standardization process. The standardization approach has failed where these two criteria are not met. Historic note: WAIS was based on an early version of Z39.50.

9 Distributed Architecture 2: Broadcast Search Find x Interface Service An interface server broadcasts a query to each collection, combines the results and returns them to the user. Examples: Dienst (digital library protocol), Web metasearch services

10 Distributed Architecture 2: Broadcast Search Interface Service: Can be a separate server (e.g., CGI), or run on the user's computer (e.g., applet). Protocols: In the simple version, each collection must support the same standards and protocols (e.g., Z 39.50, http, etc.).

11 Distributed Architecture 2: Broadcast Search Problems with Broadcast Search Performance: If any collection does not respond, the Interface Server waits for a time out. Recall: If any collection does not respond, documents in that collection are not found. Ranking and duplicates: There are great difficulties in reconciling ranked lists from different collections. Broadcast searching is as bad as its weakest link! Conclusion: Broadcast search does not scale beyond about five or ten collections, even with strict standardization.

12 Distributed Architecture 3: Centralized Search Services Find x Batch indexing: Metadata about all items is accumulated in a central system. Real-time searching: The user (a) searches the central system, and (b) retrieves items from collections. Examples: Union catalogs, Web search services Search Service retrieve search

13 Distributed Architecture 3: Centralized Search Services Gathering by Web Crawling Entirely automatic, low cost. Highly efficient at gathering very large amounts of material. but... Can only gather openly accessible materials. Cannot gather material in databases unless explicit URLs are known. Cannot easily make use of metadata provided by collections. Examples: Web search services.

14 Distributed Architecture 3: Centralized Search Services Harvesting Each collection makes a copy of its metadata available from a sever associated with the collection. A search service harvests metadata from all collections on a regular cycle and builds a central search system. Advantages... Can index material from databases without explicit URLs. Allows authentication and selection of material. but... Requires that collections have metadata and support harvesting protocol (e.g., Open Archives Initiative Protocol for Metadata Harvesting).

15 The Integration Task is to provide a coherent set of collections and services across great diversity (all digital collections relevant to science education). The National Science Digital Library (NSDL)

16 Interoperability in the NSDL The Problem Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners... most of whom are not directly part of the NSDL program The challenge is to create incentives for independent digital libraries to adopt agreements

17 Full Text or Metadata? Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing). Comprehensive metadata is available for very few of the materials. What Architecture to Use? Few collections support an established search protocol (e.g., Z39.50). The NSDL Search Service

18 Basic Assumptions The integration team will not manage any collections The integration team will not create any metadata Architecture for Searching

19 Options for Effective Information Retrieval Comprehensive metadata with Boolean retrieval (e.g., monograph catalog). Can be excellent for well-understood categories of material, but requires expensive metadata, which is rarely available. Full text indexing with ranked retrieval (e.g., news articles). Excellent for relatively homogeneous textual material. Full text indexing with contextual information and ranked retrieval (e.g., Google). Excellent for mixed textual information with rich structure. Contextual information about non-textual materials and ranked retrieval (e.g., Google image retrieval). Promising, but still experimental.

20 The Spectrum of Interoperability LevelAgreementsExample FederationStrict use of standardsAACR, MARC (syntax, semantic, Z and business) HarvestingDigital libraries exposeOpen Archives metadata; simplemetadata harvesting protocol and registry GatheringDigital libraries do notWeb crawlers cooperate; services mustand search engines seek out information

21 Users Collections Metadata repository The Metadata Repository Services The metadata repository is a resource for service providers. It holds information about every collection and item known to the NSDL, including contextual information.

22 Search Service Portal Search and Discovery Service Collections SDLIP Harvest Crawl Metadata Repository

23 The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education. The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research (Dave Fulker), Columbia University (Kate Wittenberg) and Cornell University (Bill Arms). The Technical Director is Carl Lagoze (Cornell University). The Search Service is being developed by James Allan and colleagues at the University of Massachusetts, Amherst. Acknowledgements

24 CS 430 Information Discovery The End!