The Earth System Grid Discovery and Semantic Web Technologies Line Pouchard Oak Ridge National Laboratory Luca Cinquini, Gary Strand National Center for Atmospheric Research Scientific Web Technologies for Searching and Retrieving Scientific Data ISWCII, Sanibel Island, FL, October 20, 2003
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 A geographically distributed team of climate and computer scientists: –Climate scientists are our target users – simultaneous users –Scientists providing expertise and leadership to the Inter- Governmental Panel on Climate Change (IPCC) A computing and data Grid collaboratory sponsored by the US Department of Energy. A distributed system for storage, access, and discovery of post-processing data resulting from climate simulations on super-computers.
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 Grid and Network Infrastructure Online storage systems Computational resources ? R CAS ESG services: information, replica, metadata, community authorization M Data consumers Data producers ESG: Collaboration Network
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 Current Status of Climate Data Data sizes (estimated to be produced in the next 3-4 years for IPCC), types of storage, location of storage –NCAR (Boulder, CO): Terabytes, NERSC (Berkeley CA): TB, ORNL (Oak Ridge, TN): TB. Total: TB. –Stored on mass storage archives, disk caches and tapes. –Data replicated at 3 locations in the US. Data format conventions and simulation output formats –Minimal metadata produced or associated by current simulations. –Multiple output formats. –Many complex standards. Discovery and retrieval –Datasets are not described in details. –Metadata resides in the data manager’s head. –Largely manual access. –Different access mechanisms at different sites. Far from seamless automated data discovery and access
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 ESG goals for search and retrieval Enable searches and downloads through a seamless process –Data search across multiple sites and storage locations. –Access to all ESG functionality from the desktop through a single point of entry (a Web Data portal). –Some degree of access control (authentication, certificates). Keep track of datasets particularly on deep storage (archives, caches, tapes) –Data formats –Find related datasets: “campaign,” “ensembles” –Simulation model descriptions and configurations –Related simulations: “parent,” “child,” “sibling” –Browse-able, search-able, and extensible metadata Several levels of users –easy-to-use, integrated tools (otherwise, no one will use them) Collaborate with other groups: CCLRC e-Science Center and the British Atmospheric Data Center.
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 Discovery: Ontology and Metadata Services ESG CLIENTS API & USER INTERFACES PUBLISHING SEARCH & DISCOVERY BROWSING & DISPLAY METADATA DISPLAY METADATA DISPLAY METADATA BROWSING METADATA BROWSING METADATA QUERY METADATA QUERY METADATA DISCOVERY METADATA DISCOVERY METADATA REGISTRATION METADATA REGISTRATION HIGH LEVEL METADATA SERVICES METADATA ACCESS (update, insert, delete, query) METADATA ACCESS (update, insert, delete, query) CORE METADATA SERVICES METADATA HOLDINGS Metadata Catalogs Legacy Data Catalogs
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 Motivations for a prototype ontology Development of an ESG metadata schema –Help structure and guide the development efforts –Provide a context Trust –Provenance and logistic information –Data quality and curation Prepare for a federation of data sources and inter-operability between metadata schemas –the ability to perform searches across these sources from a single point of entry.
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 ESG ontology concepts and relationships Datasets –Files names (tells a lot) –Formats and conventions –Coverage (space, time, multi- dimensional physical grids) –Calendar years –Parameters –Related datasets –Campaigns ESG Service –Used_by Pedigree –Participants, roles in ESG –Provenance – traces origins and transformations –Is_generated_by –Storage location Scientific Use: Simulations –has_parent, has-child, has_sibling –Input_type –hardware_type
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 Guiding principles for the development of an ESG ontology Separate entities describing “things” from entities describing processes. Decouple concepts specific to a domain area from those common to other (Grid) projects. Keep terminology intuitive to users. Make explicit relationships between XML elements. Ontology tools were used to analyze current ESG schemas at every stage of development.
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 Object [1] id Object [1] id Activity [0,1] name [0,1] description [0,1] rights [0,n] date type= [0,n] note [0,n] participant role= [0,n] reference uri= Activity [0,1] name [0,1] description [0,1] rights [0,n] date type= [0,n] note [0,n] participant role= [0,n] reference uri= isA Investigation isA Project [0,n] topic type= [0,1] funding Project [0,n] topic type= [0,1] funding isA Ensemble Campaign isPart Of Simulation [0,n] simulationInput type= [0,n] simulationHardware Simulation [0,n] simulationInput type= [0,n] simulationHardware Observation Experiment Analysis isPartOf hasParent hasChild hasSibling Dataset [0,1] type [0,1] conventions [0,n] date type= [0,n] format type= uri= [0,1] timeCoverage [0,1] spaceCoverage Dataset [0,1] type [0,1] conventions [0,n] date type= [0,n] format type= uri= [0,1] timeCoverage [0,1] spaceCoverage isA generated By isPart Of Person [0,1] firstName [0,1] lastName [0,1] contact Person [0,1] firstName [0,1] lastName [0,1] contact Institution [0,1] name [0,1] type [0,1] contact Institution [0,1] name [0,1] type [0,1] contact isA works For participant role= Class AbstractClass inheritance association LEGEND Service [0,1] name [0,1] description Service [0,1] name [0,1] description serviceId ParameterList hasParamet ers Parameter [1] name [0,1] mapping authority= Parameter [1] name [0,1] mapping authority= hasParameter
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003
Physical File Names Storage Location Storage ESG Portal Discovery Service Metadata Catalog Service Replica Location Services Searches Logical File Names Metadata Logical File Names Searches Download Discovery Services Architecture
Line Pouchard, Oak Ridge National LaboratoryOctober 20, 2003 Leveraging Semantic Web efforts in Grid projects The Semantic Web –Highlighted the need for sharing information based on content. –Provided web-based languages for knowledge acquisition and reasoning. –Offers directions for ontology reconciliation. –There exists ontologies in the Earth Sciences. Challenges presented by ESG –Real-life complexity. –Scientists as beginners and expert users demand usability … –Measures of success. –Changing a scientist ’s work habits requires an immediate and visible payoff –Data sizes: scalability of the approach.