SWWG PROJECT OVERVIEW Semantic Technologies for Integrating USGS Data
Past, Present, Future of The Web
What is the Semantic Web? “the idea of having data on the web derived and linked in a way that it can be used by machines not just for display purposes, but for automation, integration and reuse of data across various applications” Tim Berners-Lee (2001) “The main idea of the semantic web is to support a distributed web at the level of data rather than at the level of presentation” quote from the book Semantic Web for the Working Ontologist The Semantic Web is the “Web of Data” Moving forward from the “Document Web” “Data Web” Why is it important? Data can answer questions that documents can’t More important to put data on the web than it is to create a beautiful website that silos the data Open linked data can be used by anyone It sounds like a good idea, but why should we do it? Supports dynamic applications Designed for change Changes are made at the data modeling level Reference multiple sources of data without combining it, data sources remain separate No need for the mega-database (data warehouse model) Ability to map distributed data Merging information from multiple sources Relational and semantic technologies work together well
“Generic” Web vs. Semantic Web Generic Web = Document Web Links Documents to Documents Focused on presenting documents to humans Standard used: HTML HTML describes the syntax not the semantics Example: WikiPedia Semantic Web = Data Web Links Data to Data Focused on providing meaningful data to machines Standard used: RDF RDF represents the semantics of the data to machines Example: DBPedia (linked open data source)
SWWG Project Goals Learn Semantic Web technologies Integrate Sample Data Sets using a common ontology Develop a semantic data integration prototype
Methodology Semantic Web Methodology & Technology Development Process Graphic Credit & Copyright: Dr. Peter Fox, Rensselaer Polytechnic Institute (RPI)
The Use Case Goal: Combine data from a variety of sources into a single dataset to support aquatic habitat research of freshwater fish species in the Susquehanna River Basin. Data Sets Aquatic Bioassessment Data for the Nation (BioData) BioData provides access to aquatic bioassessment data (biological community and physical habitat data) collected by USGS scientists from stream ecosystems across the Nation. Available online at Mineral Resources Online Spatial Data (Geochemistry) Offers national-scale geochemical analysis of stream sediments and soils in the United States collected and analyzed under the National Uranium Resource Evaluation program. Available online at Multistate Aquatic Resources Information System (MARIS) MARIS serves as an online resource containing over one million population estimate, total catch, total weight, and water quality records for nearly 600 fish species sampled by a growing number of state fish and wildlife agencies Available online at National Hydrography Dataset (NHD) NHD contains detailed geospatial information about the Nation's surface water including features such as lakes, ponds, streams, rivers, canals, dams, and stream gages. Available online at
Methodology Semantic Web Methodology & Technology Development Process Graphic Credit & Copyright: Dr. Peter Fox, Rensselaer Polytechnic Institute (RPI)
Heterogeneous Information Models
Map to O & M Ontology Open Geospatial Consortium: Observations & Measurements Ontology
Methodology Semantic Web Methodology & Technology Development Process Graphic Credit & Copyright: Dr. Peter Fox, Rensselaer Polytechnic Institute (RPI)
Flexible Data Model Resource Description Framework (RDF) Convert Relational Data to RDF for integration using the O & M Ontology Data will then be stored in Triple Stores with SPARQL endpoints “The Resource Description Framework (RDF) provides a flexible data model that is used to build a conceptual representation of the data with formal semantics and allows disparate data to share formal relationships along points of integration such as spatial, temporal, and taxonomic information” Stephan Zednik, RPI
High Level Architecture Apache Jena Framework a configurable way to access RDF data using simple RESTful URLs that are translated into queries to a SPARQL endpoint
Methodology Semantic Web Methodology & Technology Development Process Graphic Credit & Copyright: Dr. Peter Fox, Rensselaer Polytechnic Institute (RPI)
Rapid Prototype Development: Using Scrum Scrum is an agile framework for completing complex projects (mostly software development). Roles Product Owner: responsible for the business value of the product ScrumMaster: ensures that the team is functional & productive Team: self-organizes to get work done Meetings Sprint* Planning: team meets with the product owner to choose the set of work to be delivered Daily Scrum: the team meets each day to share struggles & progress Sprint Review: the team demonstrates to the product owner what it has completed during the sprint Sprint Retrospective: the team looks at ways to improve the product and the process Artifacts Product Backlog: prioritized list of desired project outcomes/features Sprint Backlog: set of work from the product backlog that the team agress to complete in a sprint, broken out into tasks Burndown chart: at-a-glance look at the work remaining (can be 2 charts: one for the sprint, one for the overall project) *A Sprint is a development period typically 2-4 weeks in length. Scrum Alliance :
Next Steps Finalize Functional Requirements & Create Product Backlog Sprint Planning Start the Development (over a 2-3 week “sprint”) Prototype Testing
Methodology Semantic Web Methodology & Technology Development Process Graphic Credit & Copyright: Dr. Peter Fox, Rensselaer Polytechnic Institute (RPI)
Questions?