U.S Geological Survey National Biological Information Infrastructure Technical Overview: NBII Metadata Clearinghouse May 2008 Mike Frame
Topics for discussion Metadata CH Background New Metadata CH Design & Demo Underlying Architecture
Services Overview
NBII Metadata Resources
Metadata Resources: FGDC Metadata Program Tool reviewsTraining Opportunities Resources for using the Standard NBII Clearinghouse
7 Sections make up the FGDC Standard: 1. Identification Information 2. Data Quality Information 3. Spatial Data Information 4. Spatial Reference Information 5. Entity and Attribute Information 6. Data Distribution Information 7. Metadata Reference Information Some basic metadata facts…about the FGDC Standard
NBII Metadata CH
Rational for Metadata CH Redesign User Feedback Metadata creation Metadata management Metadata integration with data Open architecture framework Speed and Reliability Data quality Data visualization License Costs
NBII Metadata CH provides: Single portal to information contained in disparate data management systems Free text, fielded, spatial, and temporal search capabilities Allow individuals and database managers to distribute their data while maintaining complete control and ownership Leverage investment in existing information systems and research NBII is part of the Mercury ORNL
NBII CH: New Functionalities Rich Client Interface Combined search results (status page) Filterring search results (Facet) Dynamic sorting of search results Bookmark brief and full metadata pages Based on open source technologies: Lucene Solr
NBII CH New Functionalities Cont.. SOA based design Web services RSS services for search results Portlet support Search Sharing support Thesaurus Support Seamless data ordering/data extraction with various data partners Seamless data visualization integration with external visualization tools Improved User Statistics Collection
The Clearinghouse is operated for NBII by the Oak Ridge National Laboratory Over 38,000 records 41 partners contributing metadata records Ability to search in a variety of ways Redesigned in 2008 The NBII Clearinghouse
NBII CH Demo NBII Clearinghouse interface:
How does the NBII Clearinghouse work?
Metadata CH RSS World Data Center
NBII Metadata Clearinghouse Architecture
Metadata CH Architecture CH Function of the NBII Metadata Program Operated by ORNL NBII is 1 Organization in Mercury Consortium Established relationship in 2001 Formerly based on “Blue Angel Technologies” Currently based on Lucene/Solr Open Source Technologies
3. Remote users query the index via a Web-based browser 6. Highly detailed data and documentation are downloaded directly from the contributing agency 1. Principal investigators create detailed metadata and data files using local applications or ORNL- OME 2. NBII Mercury collects metadata and key data from contributing agencies’ servers distributed around the country and builds a centralized index 4. Metadata summaries are returned to the remote users, including links back to detailed information and data at the PIs’ server or data repository 5. Remote users select links to data of interest Index Users Virtual Internet Database P.I. Summary – John Smith Product A Container: 1; 10/12/2003 Container 2; 01/20/2002 Container 3; 07/05/2001 Product B Container 1; 03/05/1999 …. P.I. Name Product Number Product Title Site Subject Area Thematic Area Keywords etc. Distributed Data Discovery and Access System
Custom Export Program Custom Export Program Existing Database Existing Database Existing Database Existing Database Existing Database Existing Database Encrypted XML Encrypted XML Index Metadata exists in remote legacy databases using any platform, OS or RDBMS Metadata are extracted into XML files yielding standardized data objects Harvested metadata are combined at the central site, transformed (if needed), and indexed Users work with a single, simple, web-like interface to access all data simultaneously Databases can be of different structures and content Export programs are easily written and automated These files can be remotely harvested via the Internet Frequent, automated harvesting and complete re- building of the index keeps the aggregate database up to date No re-programming of existing systems required Business as usual for contributing databases Encrypted XML Encrypted XML Custom Export Program Custom Export Program Z39.50 or WS Z39.50 or WS A Virtual Aggregate Database
NBII CH Design Diagram Solr Schema for defining the fields Index metadata records NBII CH Harvester FGDC-BIO Transformed Files MySQL Mercury3_harvests_nbii DB updater tool (custom Java) Solr Indexer tool (custom java) XML Beans to extract the contents SOLR Search Server Extended Lucene Index UI Solr Searcher (custom Java Spring) Web Service RSS Portlets External Metadata http, ftp, web crawl
Future Development Phase II (May 2008 to September 2008): Harvester engine to use open source tools (Remove COTS) (Phase I & II) Portal integration through JSR-168 Portlet standard Search portlets, portlets for recent datasets, top most searched words etc.. Web service implementation (Phase I & II): Thesaurus support (semantic web integration support) Gazetteer web service implementation OGC Catalog Service (include Web Mapping/Coverage/Feature Servers in search) Universal Description, Discovery, and Integration (UDDI) Directory Services Dynamic RSS support, including Geo-RSS support ISO support OpenSearch support Documentation and Help (Phase I & II) User Statistics Application modifications Phase III (October 2008 to January 2009): Save, Retrieve and user queries Possible integration to OPeNDAP Web Service Harvesting (OAI) Internationalization ????
Search technology using Lucene/SOLR Lucene Overview Who uses Lucene Solr Overview Who uses Solr
Lucene Overview High-performance, full-featured text search engine library written entirely in Java Mature Apache Open Source Java Project Index speed and integrity, search speed uses file based full text and inverted indexing is extremely fast with built-in caching Can easily handle millions of documents Very active mailing list for support
Who uses Lucene Wikipedia MediaWiki European Bioinformatics Institute Liferay Bigsearch.ca Monster Academic Archive On-line Complete list:
SOLR Overview Open source enterprise search server based on the Lucene Java search library Lucene Java Apache project, sub-project of Lucene Advanced Full-Text Search Capabilities Optimized for High Volume Web Traffic Standards Based Open Interfaces - XML and HTTP Solr uses Lucene search library and extends it
SOLR Overview Contd.. A Real Data Schema, with Numeric Types, Date fields, Dynamic Fields Dynamic Faceted Browsing and Filtering Advanced, Configurable Text Analysis Highly Configurable and User Extensible Caching External Configuration via XML Scalability - Efficient Replication to other Solr Search Servers Administration Interface is available
Who uses SOLR CNET Reviews shopper.com AOL Music netflix search.com The Digital Commonwealth mindquarry for complete list:
Mercury Instances Demo NBII Clearinghouse interface: ORNLDAAC interface: LBA Mercury interface: DADDI Mercury interface: GFIS RSS Portal interface:
User Statistics Report Generation Tool
Open source Harvester Re-design (Aperture)
Questions, Comments, Mike Frame Thanks to: Giri Palanisamy Systems Architect and Team Leader Mercury Consortium Vivian Hutchison NBII Metadata Program Manager