CS 430 / INFO 430 Information Retrieval

Slides:



Advertisements
Similar presentations
1 Building the NSDL William Y. Arms Cornell University Thinking aloud about the NSDL.
Advertisements

Building a Digital Library with Fedora International Conference on Developing Digital Institutional Repositories Hong Kong December 9, 2004.
1 CS 430 / INFO 430 Information Retrieval Lecture 15 Usability 3.
Building Reliable Distributed Information Spaces Carl Lagoze CS /22/2002.
1 CS 430 / INFO 430 Information Retrieval Lecture 13 Architecture of Information Retrieval Systems.
1 DLESE in Context: Educational Computing, Digital Libraries and Scientific Education William Y. Arms Cornell University.
1 CS 430 / INFO 430 Information Retrieval Lecture 22 Metadata 4.
1 NSDL The National Science Foundation's National Digital Library for Science, Mathematics, Engineering and Technology Education [a.k.a. Smete, NSDL, Learns,...]
SCORM-NSDL Workshop May 18, Educational Materials are Scattered across the Internet NASA Math Forum State standards Scientific American Ask.
Mixed content, mixed metadata: Information discovery in the NSDL.
1 CS 430: Information Discovery Lecture 21 Web Search 3.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Usability 2.
1 William Y. Arms September 26, 2002 A Research Program for Information Science with the NSDL as an Example.
Corporation For National Research Initiatives NSF SMETE Library Building the SMETE Library: Getting Started William Y. Arms.
1 An introduction to the NSDL William Y. Arms Cornell University.
Metadata: Its Functions in Knowledge Representation for Digital Collections 1 Summary.
System Design/Implementation and Support for Build 2 PDS Management Council Face-to-Face Mountain View, CA Nov 30 - Dec 1, 2011 Sean Hardman.
Digital Libraries: New Tools for ScienceTeaching and Learning.
CONTI’2008, 5-6 June 2008, TIMISOARA 1 Towards a digital content management system Gheorghe Sebestyen-Pal, Tünde Bálint, Bogdan Moscaliuc, Agnes Sebestyen-Pal.
1 The NSDL: A Case Study in Interoperability William Y. Arms Cornell University.
OCLC Online Computer Library Center CONTENTdm ® Digital Collection Management Software Ron Gardner, OCLC Digital Services Consultant ICOLC Meeting April.
The role of Parthenos for CLARIN ERIC Steven Krauwer CLARIN ERIC Executive Director 1.
LIS 506 (Fall 2006) LIS 506 Information Technology Week 11: Digital Libraries & Institutional Repositories.
Preserving Digital Collections for Future Scholarship Oya Y. Rieger Cornell University
Ensemble Computing in the National Science Digital Library (NSDL)
Creating and Operating a Digital Library for Information and Learning– the GROW Project Muniram Budhu Department of Civil Engineering & Engineering Mechanics.
University of North Texas Libraries Building Search Systems for Digital Library Collections Mark E. Phillips Texas Conference on Digital Libraries May.
NSDL: OAI and a large- scale digital library Carl Lagoze, Cornell University NSDL Director of Technology
Metadata Lessons Learned Katy Ginger Digital Learning Sciences University Corporation for Atmospheric Research (UCAR)
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
1 CS 430 / INFO 430 Information Retrieval Lecture 24 Architecture of Information Retrieval Systems.
The Digital Library for Earth System Science: Contributing resources and collections Meeting with GLOBE 5/29/03 Holly Devaul.
1 CS 502: Computing Methods for Digital Libraries Lecture 19 Interoperability Z39.50.
CBSOR,Indian Statistical Institute 30th March 07, ISI,Kokata 1 Digital Repository support for Consortium Dr. Devika P. Madalli Documentation Research &
Mixed content, mixed metadata: Information discovery in the NSDL.
1 A Very Large Digital Library Technology Demonstration William Y. Arms Cornell University.
Presented by Scientific Annotation Middleware Software infrastructure to support rich scientific records and the processes that produce them Jens Schwidder.
Presented by Jens Schwidder Tara D. Gibson James D. Myers Computing & Computational Sciences Directorate Oak Ridge National Laboratory Scientific Annotation.
Integrating Access to Digital Content Sarah Shreeves University of Illinois at Urbana-Champaign Visual Resources Association 23 rd Annual Conference Miami.
1 The NSDL Program Stephen Griffin National Science Foundation.
“A Library outranks any other one thing a community can do to benefit its people.” --Andrew Carnegie.
Millman—Nov 04—1 An Update on Digital Libraries David Millman Director of Research & Development Academic Information Systems Columbia University
A Resource Discovery Service for the Library of Texas Requirements, Architecture, and Interoperability Testing William E. Moen, Ph.D. Principal Investigator.
1 CS 430: Information Discovery Lecture 18 Web Search Engines: Google.
NSDL STEM Exchange: Technical Overview and Implications for Active Dissemination of Federally Funded Resources Across Implementation Systems.
The Earth Information Exchange. Portal Structure Portal Functions/Capabilities Portal Content ESIP Portal and Geospatial One-Stop ESIP Portal and NOAA.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems 1.
1 CS 430: Information Discovery Lecture 26 Architecture of Information Retrieval Systems.
Discovery and Metadata March 9, 2004 John Weatherley
1 CS 430: Information Discovery Lecture 13 Case Study: the NSDL.
Automated Information Retrieval
Usage scenarios, User Interface & tools
Building A Repository for Digital Objects
NSDL: A New Tool for Teaching and Learning.
SowiDataNet - A User-Driven Repository for Data Sharing and Centralizing Research Data from the Social and Economic Sciences in Germany Monika Linne, 30.
Building Search Systems for Digital Library Collections
NSDL: OAI and a large-scale digital library
VI-SEEM Data Repository
PDAP Query Language International Planetary Data Alliance
Search Techniques and Advanced tools for Researchers
NSDL Data Repository (NDR)
Building a large-scale digital library for education
The National Science Digital Library (NSDL)
Introduction to Information Retrieval
BUILDING A DIGITAL REPOSITORY FOR LEARNING RESOURCES
CS/INFO 430 Information Retrieval
Web archives as a research subject
Presentation transcript:

CS 430 / INFO 430 Information Retrieval Lecture 18 Metadata 5

Course Administration

Effective Information Discovery Before Digital Information Searching (a) Resources separated into categories of related materials. Each category organized, indexed and searched separately. Catalogs and indexes built on tightly controlled metadata standards, e.g., MARC, MeSH headings, etc. Search engines used Boolean operators and fielding searching. Query languages and search interfaces assumed a trained user. Resources were physical items.

Effective Information Discovery With Homogeneous Digital Information Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog). Full text indexing with ranked retrieval Can be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).

Mixed Content Examples: NSDL-funded collections at Cornell Atlas. Data sets of earthquakes, volcanoes, etc. Reuleaux. Digitized kinematics models from the nineteenth century Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. Nuprl. Logic-based tools to support programming and to implement formal computational mathematics.

Mixed Metadata: the Chimera of Standardization Technical reasons Characteristics of formats and genres Differing user needs Social and cultural reasons Economic factors Installed base

Information Discovery in a Messy World Building blocks Brute force computation The expertise of users -- human in the loop Methods (a) Better understanding of how and why users seek for information (b) Relationships and context information (c) Multi-modal information discovery (d) User interfaces for exploring information

Understanding How and Why Users Seek for Information Homogeneous content All documents are assumed equal Criterion is relevance (binary measure) Goal is to find all relevant documents (high recall) Hits ranked in order of similarity to query Mixed content Some documents are more important than other Goal is to find most useful documents on a topic and then browse Hits ranked in order that combines importance and similarity to query

Case Study Information discovery in the National Science Foundation's National Science Digital Library (NSDL). The goal of the NSDL is to be a digital library for all aspects of science education, where science and education are very broadly defined. http://nsdl.org

Why Technology in Education? Why a Digital Library for Education? Higher Education. U.S. higher education is the best in the world, but it is very expensive. How can we keep the quality while lowering the cost? K-12. The best K-12 education in the U.S. is excellent, but much is mediocre or worse. How can the best be made available to all? Technology-enhanced education offers a way to increase the productivity of the skilled people who teach in both higher and K-12 education.

Why a Digital Library for Science Education? Excellent teaching materials have been developed... but they are not being used effectively. The NSDL provides organization and access for teachers and students • Preservation and reuse. • Searching and browsing. • Links between teaching materials and their educational use.

Educational materials are scattered across the Internet The NSDL Architecture Educational materials are scattered across the Internet State standards Math Forum NASA Ask a Scientist Scientific American

The NSDL Architecture: Basic Assumptions Basic Assumptions • The NSDL is a partnership of organizations who manage collections and provide educational and library services. • There is a central team to integrate the parts and provide central services. • The central team does not manage any collections and does not create any metadata.

Architectural Assumptions: One Library, Many Portals Different Groups of Users Need Different Views of the Library • Central portal for general users. • Development portal library developers • Pathways portals by discipline (e.g., mathematics) and educational level (e.g., middle school)

Architectural Assumptions: A Spectrum of Interoperability The Problem Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners ... most of whom are not directly part of the NSDL program The challenge is to create incentives for independent digital libraries to adopt agreements

Approaches to interoperability The conventional approach  Wise people develop standards: protocols, formats, etc.  Everybody implements the standards.  This creates an integrated, distributed system. Unfortunately ...  Standards are expensive to adopt.  Concepts are continually changing.  Systems are continually changing.  Different people have different ideas.

Interoperability is about agreements Technical agreements cover formats, protocols, security systems so that messages can be exchanged, etc. Content agreements cover the data and metadata, and include semantic agreements on the interpretation of the messages. Organizational agreements cover the ground rules for access, for changing collections and services, payment, authentication, etc. The challenge is to create incentives for independent digital libraries to adopt agreements

Function versus cost of acceptance Few adopters Many adopters Function

Example: security Public key infrastructure Login ID and password Cost of acceptance Public key infrastructure Login ID and password IP address Function

Example: metadata standards Cost of acceptance MARC Dublin Core Function Free text

NSDL: The Spectrum of Interoperability Level Agreements Example Federation Strict use of standards AACR, MARC (syntax, semantic, Z 39.50 and business) Harvesting Digital libraries expose Open Archives metadata; simple metadata harvesting protocol and registry Gathering Digital libraries do not Web crawlers cooperate; services must and search engines seek out information

Architecture: the NSDL Repository The Repository holds information about every collection and item known to the NSDL.

Standards Implemented in the NSDL Repository Phase 1 Object model Metadata: Dublin Core with educational extensions Ingest and redistribution: Open Archives Initiative, Protocol for Metadata Harvesting Collection collection metadata URL item metadata URL Items

The NSDL Search Service Full Text or Metadata? Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing). Comprehensive metadata is available for very few of the materials. What Architecture to Use? Few collections support an established search protocol (e.g., Z39.50).

NSDL Search Service: Phase 1 NSDL Repository Search Service http The search service combines metadata from the Repository and full text from the collections Collections

NSDL Search Service: Phase 1 Approach Collections map metadata to Dublin Core, provide via Open Archives protocol. Search service augments Dublin Core metadata with indexing of full-text where available. The search engine is Lucene (tf.idf weighting) User interface returns snippets derived from the metadata, links to full content and to metadata.

NSDL Search Service: Phase 1 Weaknesses Ranking by similarity to query not sufficient (e.g., no ranking by grade level) Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata). Dublin Core records provide limited information. (d) Browsing environment limited. (e) Many users begin their search with a Web search engine (e.g., Google or Yahoo).

NSDL and the Web Many people will find NSDL materials through Web search engines. Therefore the NSDL must be indexed by them. NSDL Repository http http Collections http

Second Phase Developments NSDL Search Service: Second Phase Developments Metadata Accept any metadata that is available in a range of formats System for reviews and annotations, with reputation management Search system Multimodal retrieval and ranking Dynamic generation of snippets by search engine This work is currently in progress. The first stage is to reimplement the Repository to manage relationships among resources.

Second Phase Developments (cont.) NSDL Search Service: Second Phase Developments (cont.) Usability and human factors Wider range of browsing tools (e.g., collection visualization) Filters by education level and education quality, where known Web compatibility Expose records for Web crawlers to index Browser bookmarklet to add NSDL information to Web pages

Relationship and Contextual Information Methods for capturing context Analysis of citations and links (e.g., PageRank) Mining usage logs (e.g., customers who buy the same product) Reviews (e.g., reputation management) Structural relationships (e.g., domain names)

Acknowledgements The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education. The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research, Columbia University, and Cornell University. The initial version of the Search Service was developed by James Allan and colleagues at the University of Massachusetts, Amherst. The current version was developed by Naomi Dushay and colleagues at Cornell University.