MARIAN: Searching and Querying Across Heterogeneous Federated Digital Libraries Marcos André Gonçalves Robert K. France Edward A. Fox Tamas E. Doszkocs Work performed at Virginia Tech, Blacksburg, VA USA Support provided in part by NSF & National Library of Medicine.
JCDL 2001 First Joint ACM/IEEE Conference on Digital Libraries (+ NSF DLI-2 PI mtg) June 24-28, 2001 in Roanoke, VA Conference Committee: General Chair: Edward A. Fox, Virginia Tech Program Chair: Christine Borgman, UCLA Treasurer: Neil Rowe, Naval Postgraduate School Posters Chair: Craig Nevill-Manning, Rutgers U. …
Outline NDLTD Harvesting Strategies and the OAI MARIAN Middleware Generating Digital Libraries with 5SL Future Directions
NDLTD (1 of 3) Context: Networked Digital Library of Theses and Dissertations, Please join! Submit your (student’s) works! International federation of universities, libraries, supporting institutions (e.g., VTLS union catalog) Extremely heterogeneous Autonomy of management and decentralization Disparate protocols, metadata, repositories (e.g., UMI, OCLC’s WorldCat), language, encodings, user characteristics and preferences
NDLTD (2 of 3) Worldwide organization: educational/social context National/regional projects in Australia, Catalunya, Germany, India, Latin America (UNESCO/OAS/ISTEC), South Africa (Mellon), USA (including OhioLINK), … International conference (225 in March 2000, more expected for next, at Caltech) Steering committee representing supporting groups as well as the hundreds of universities
NDLTD (3 of 3) Unique collection – discipline/document context Multilingual and multimedia content Large book-size documents Full-content in several formats (XML, PDF, etc.) Large number of bibliographic references Several sets of metadata with different ranges of quality, that can fit with the Open Archives Initiative (
Harvesting Strategies Harvesting vs. Federated Search Harvesting plus Federated Search Plus local collections The NDLTD Union Collection Multiple Harvesting Protocols Harvest™ System Z39.50 Dienst OAI
Union Collection Architecture
Open Archives Initiative (OAI) Interoperability Standards: Released - Jan/Feb Data + Service Providers Metadata Harvesting Protocol Unique identifiers (URNs) for each record Date-stamp for each record when last modified/created/deleted HTTP server with scripting capabilities 6 Service requests (verbs) Identify, ListMetaFormats, ListSets ListIdentifiers, GetRecord, ListRecords
low-barrier interop umbrella herbert van de sompel metadata OPACimageFTXTA&Ie-print
OAI harvesting tools herbert van de sompel service provider harvester data provider repository Datestamp Identifier Set Records repositoryrepository
OAI harvesting tools herbert van de sompel service provider harvester data provider repository Supporting protocol requests: Identify ListMetadataFormats ListSets Harvesting protocol requests: ListRecords ListIdentifiers GetRecord repositoryrepository
Design Features Combined Harvesting, Federated Search, and Local Collections Object-Oriented Information Graph Representation 5S Model and 5SL Specification Language
MARIAN Middleware Flexible Representation Model Information Graph Class Hierarchies Weights and Weighted Sets (w. lazy eval) Class-Based Search Unified Searcher API Combining Heterogeneous Information Structural Matching Synthetic Superclasses
Information Graph Model (1/2) Each Information Object is a Node. Structure: exposed through Links Features of interest can become Nodes or can remain Hidden within Node Class Search Methods.
Information Graph Model (2/2)
Class-Based Search Common Search Methods Text Link / Weighted Link Node in Context Common Searcher Operations Match Best (weighted maximum) Match Most (summative union)
Class-Based Search public interface ClassManager { public WtdObjSet match(InfoDesc description); public boolean isInClass(FullID id); public Object idToObject(FullID id); public Vector idsToObjects(Vector ids); }
Class-Based Search
Combining Sources of Information Structural Matching Extends Weighted Retrieval to include “Best Match to Document Structure” Recursive, Extensible Collection Views Simple Interface to Complex Collections Common Interface to Diverse Collections Weighted Interface to Collections of Varying Quality
Dc.creatorHasDcCreator HasCrawlerAuthor Headings Dc.Subject Keywords HasDcSubject HasHeadings HasKeywords dc.title crawlerTitle PhysDis-ETD (SOIF) dc.description crawlerDescription body Individual HasAuthor HasSubject title ThesisDissertation description SubClasses SubClasses Subject Individual Dc.creatorHasDcCreator HasCrawlerAuthor Headings Dc.Subject Keywords HasDcSubject HasHeadings HasKeywords dc.title crawlerTitle PhysDis-ETD (SOIF) dc.description crawlerDescription body Individual HasAuthor HasSubject title ThesisDissertation description SubClasses SubClasses Subject Individual NDLTD Collection View (part)
5S Model for Digital Libraries (1/2) Formal Model Streams Structures Spaces Services Societies
5S Model for Digital Libraries (2/2) Formal Model Streams Structures Spaces Services Societies NDLTD / MARIAN Example Document (presentable, indexable information object) Weighted Set (e.g., of results to a match operation) Collection Graph; Inheritance Lattice; Measure Space Adaptive Search; Query History Maintenance Library End-Users; DL Builders
5SL Generates Digital Library (Components)
Generating Digital Libraries: XML
Interoperability with 5S and 5SL Reductionist / Constructivist Approach Compositional mappings between DLs Composition of S-based constructs Mapping language
Student Projects to Integrate Schedule-driven Harvester SDI / Filtering for NDLTD MARIAN-Phronesis (Spanish – Monterrey); and work with German (Oldenburg / DFG), Portuguese, Chinese, Japanese, Korean TREC data formatted for loading
Future Work Fusion on hybrid architecture Incorporation of belief networks Using 5SL to generate wrappers New services/ functionalities Personalization (e.g., history, folders) Visualization (e.g., Envision applet) Integration with PetaPlex (100 nodes, 2.5 Tbytes disk capacity, > 300 Mbps to campus backbone, Sornil inversion)
Conclusions NDLTD provides a real, fertile, DL testbed. Harvesting strategies and the OAI MARIAN middleware: graphs, classes, views Generating Digital Libraries with 5SL Future: high performance services, experimental comparisons