CS 430 / INFO 430 Information Retrieval

CS 430 / INFO 430 Information Retrieval
Lecture 18 Metadata 5

Course Administration

Effective Information Discovery Before Digital Information
Searching (a) Resources separated into categories of related materials. Each category organized, indexed and searched separately. Catalogs and indexes built on tightly controlled metadata standards, e.g., MARC, MeSH headings, etc. Search engines used Boolean operators and fielding searching. Query languages and search interfaces assumed a trained user. Resources were physical items.

Effective Information Discovery With Homogeneous Digital Information
Comprehensive metadata with Boolean retrieval Can be excellent for well-understood categories of material, but requires standardized metadata and relatively homogeneous content (e.g., MARC catalog). Full text indexing with ranked retrieval Can be excellent, but methods developed and validated for relatively homogeneous textual material (e.g., TREC ad hoc track).

Mixed Content Examples: NSDL-funded collections at Cornell
Atlas. Data sets of earthquakes, volcanoes, etc. Reuleaux. Digitized kinematics models from the nineteenth century Laboratory of Ornithology. Sound recording, images, videos of birds and other animals. Nuprl. Logic-based tools to support programming and to implement formal computational mathematics.

Mixed Metadata: the Chimera of Standardization
Technical reasons Characteristics of formats and genres Differing user needs Social and cultural reasons Economic factors Installed base

Information Discovery in a Messy World
Building blocks Brute force computation The expertise of users -- human in the loop Methods (a) Better understanding of how and why users seek for information (b) Relationships and context information (c) Multi-modal information discovery (d) User interfaces for exploring information

Understanding How and Why Users Seek for Information
Homogeneous content All documents are assumed equal Criterion is relevance (binary measure) Goal is to find all relevant documents (high recall) Hits ranked in order of similarity to query Mixed content Some documents are more important than other Goal is to find most useful documents on a topic and then browse Hits ranked in order that combines importance and similarity to query

Case Study Information discovery in the National Science Foundation's National Science Digital Library (NSDL). The goal of the NSDL is to be a digital library for all aspects of science education, where science and education are very broadly defined.

Why Technology in Education? Why a Digital Library for Education?
Higher Education. U.S. higher education is the best in the world, but it is very expensive. How can we keep the quality while lowering the cost? K-12. The best K-12 education in the U.S. is excellent, but much is mediocre or worse. How can the best be made available to all? Technology-enhanced education offers a way to increase the productivity of the skilled people who teach in both higher and K-12 education.

Why a Digital Library for Science Education?
Excellent teaching materials have been developed... but they are not being used effectively. The NSDL provides organization and access for teachers and students • Preservation and reuse. • Searching and browsing. • Links between teaching materials and their educational use.

Educational materials are scattered across the Internet
The NSDL Architecture Educational materials are scattered across the Internet State standards Math Forum NASA Ask a Scientist Scientific American

The NSDL Architecture:
Basic Assumptions Basic Assumptions • The NSDL is a partnership of organizations who manage collections and provide educational and library services. • There is a central team to integrate the parts and provide central services. • The central team does not manage any collections and does not create any metadata.

Architectural Assumptions: One Library, Many Portals
Different Groups of Users Need Different Views of the Library • Central portal for general users. • Development portal library developers • Pathways portals by discipline (e.g., mathematics) and educational level (e.g., middle school)

Architectural Assumptions: A Spectrum of Interoperability
The Problem Conventional approaches require partners to support agreements (technical, content, and business) But NSDL needs thousands of very different partners ... most of whom are not directly part of the NSDL program The challenge is to create incentives for independent digital libraries to adopt agreements

Approaches to interoperability
The conventional approach  Wise people develop standards: protocols, formats, etc.  Everybody implements the standards.  This creates an integrated, distributed system. Unfortunately ...  Standards are expensive to adopt.  Concepts are continually changing.  Systems are continually changing.  Different people have different ideas.

Interoperability is about agreements
Technical agreements cover formats, protocols, security systems so that messages can be exchanged, etc. Content agreements cover the data and metadata, and include semantic agreements on the interpretation of the messages. Organizational agreements cover the ground rules for access, for changing collections and services, payment, authentication, etc. The challenge is to create incentives for independent digital libraries to adopt agreements

Function versus cost of acceptance
Few adopters Many adopters Function

Example: security Public key infrastructure Login ID and password
Cost of acceptance Public key infrastructure Login ID and password IP address Function

Example: metadata standards
Cost of acceptance MARC Dublin Core Function Free text

NSDL: The Spectrum of Interoperability
Level Agreements Example Federation Strict use of standards AACR, MARC (syntax, semantic, Z 39.50 and business) Harvesting Digital libraries expose Open Archives metadata; simple metadata harvesting protocol and registry Gathering Digital libraries do not Web crawlers cooperate; services must and search engines seek out information

Architecture: the NSDL Repository
The Repository holds information about every collection and item known to the NSDL.

Standards Implemented in the NSDL Repository Phase 1
Object model Metadata: Dublin Core with educational extensions Ingest and redistribution: Open Archives Initiative, Protocol for Metadata Harvesting Collection collection metadata URL item metadata URL Items

The NSDL Search Service
Full Text or Metadata? Full text indexing is excellent, but is not possible for all materials (non-textual, no access for indexing). Comprehensive metadata is available for very few of the materials. What Architecture to Use? Few collections support an established search protocol (e.g., Z39.50).

NSDL Search Service: Phase 1
NSDL Repository Search Service http The search service combines metadata from the Repository and full text from the collections Collections

Approach Collections map metadata to Dublin Core, provide via Open Archives protocol. Search service augments Dublin Core metadata with indexing of full-text where available. The search engine is Lucene (tf.idf weighting) User interface returns snippets derived from the metadata, links to full content and to metadata.

Weaknesses Ranking by similarity to query not sufficient (e.g., no ranking by grade level) Snippets do not indicate why item was returned (e.g., terms in full text but not in metadata). Dublin Core records provide limited information. (d) Browsing environment limited. (e) Many users begin their search with a Web search engine (e.g., Google or Yahoo).

NSDL and the Web Many people will find NSDL materials through Web search engines. Therefore the NSDL must be indexed by them. NSDL Repository http http Collections http

Second Phase Developments
NSDL Search Service: Second Phase Developments Metadata Accept any metadata that is available in a range of formats System for reviews and annotations, with reputation management Search system Multimodal retrieval and ranking Dynamic generation of snippets by search engine This work is currently in progress. The first stage is to reimplement the Repository to manage relationships among resources.

Second Phase Developments (cont.)
NSDL Search Service: Second Phase Developments (cont.) Usability and human factors Wider range of browsing tools (e.g., collection visualization) Filters by education level and education quality, where known Web compatibility Expose records for Web crawlers to index Browser bookmarklet to add NSDL information to Web pages

Relationship and Contextual Information
Methods for capturing context Analysis of citations and links (e.g., PageRank) Mining usage logs (e.g., customers who buy the same product) Reviews (e.g., reputation management) Structural relationships (e.g., domain names)

Acknowledgements The NSDL is a program of the National Science Foundation's Directorate for Education and Human Resources, Division of Undergraduate Education. The NSDL Core Integration is a collaboration between the University Center for Atmospheric Research, Columbia University, and Cornell University. The initial version of the Search Service was developed by James Allan and colleagues at the University of Massachusetts, Amherst. The current version was developed by Naomi Dushay and colleagues at Cornell University.

CS 430 / INFO 430 Information Retrieval

Similar presentations

Presentation on theme: "CS 430 / INFO 430 Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS 430 / INFO 430 Information Retrieval

Similar presentations

Presentation on theme: "CS 430 / INFO 430 Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback