Oceanographic Informatics in a Collaborative Environment. Data Management Special Session N12: Strategies for Improved Marine and Synergistic Data Access and Interoperability. 19 December 2008 San Francisco, CA P.H. Wiebe, R.C. Groman, C. Chandler, M.D. Allison, and D. Glover Woods Hole Oceanographic Institution Woods Hole, MA, USA
A Context Data and Information in oceanography in general are expanding at a rapid pace and there is a significant need for more and better management tools and techniques to preserve and serve them.
Talk Objectives To discuss current developments and new directions to enable better opportunities for data discovery, integration, and synthesis of oceanographic data regardless of origin. To encourage comprehensive efforts to establish broadly based and accepted best practices in the quest to obtain new information about ocean physics, chemistry, biology, geology, and geophysics. To highlight some of the changes I have observed during the past four decades and strongly endorse the New Age that is fast approaching in the way we gather, store, access, and analyze information and data.
A Personal Context I have worked throughout my career as a biological oceanographer on multi-investigator and multi-disciplinary programs and projects. I realized early on that data and information management was an essential element in design, acquisition, and synthesis of data sets in the oceanographic scientific enterprise. But the technology (hardware/software), resources (funding), and mandates were not in place until recently to do it effectively. The effort now is on more than data and information management. It involves what is termed “Data informatics”.
Informatics Defined “Informatics is the science and engineering that occupies the gap between information and communications technology (ICT) systems and cyberinfrastructure (computers, grids, Web services, etc.), and the use of digital data, information, and related services for research and knowledge generation.” From: Baker, D.N., C. E. Barton, W. K. Peterson, and P. Fox Informatics and the 2007–2008 Electronic Geophysical Year. Eos. 89(48):
1976 CCR Program1982 WCR Program 1999 GLOBEC Program Evolution of MOCNESS Data Acquisition HP2100 CBM 8032 Windows PC
Sampling in the Cold-Core Ring Program Cruises Total PO, bio-process, & mapping
Sampling in the Warm-Core Ring Program Cruises Total 6 PO 3 bio-process 3 bio-mapping 2 bio-process & mapping Knorr Endeavor Oceanus
Sampling in the U.S. GLOBEC Georges Bank Program Cruises Total 31 Broad-scale 91 process and mooring.
Data Storage 1970’s – Honeywell Sigma 7 - Simple File Storage plus the Sigma 7 Extended Database Management System. MOCNESS data only – terminal access. 1980’s Digital VAX 11/780 - Flat File Storage – all data – terminal access. Micro-computers with floppies and small hard-drives. 1990’s Sun/Unix-Linux Server’s - GLOBEC Data & Information Management system – project specific - all data – web available. Micro-computers become mainstay for labs. 2000’s Unix/Linux Server’s – BCO-DMO Data & Information Management system – multiple projects – web available
The Biological and Chemical Oceanography Data Management Office (BCO-DMO) The BCO-DMO was initially created in late 2006 to serve PIs funded by the NSF Biological and Chemical Oceanography Sections to serve investigators funded by the National Science Foundation to conduct marine chemical and ecological research. BCO-DMO provides open access to marine biogeochemical and ecological data and information developed in the course of scientific research can easily be disseminated, protected, and stored on short and intermediate time- frames. [
Theorem 1: The probability that all the necessary data and information are collected and preserved to allow another researcher to properly use your data is inversely proportional to the time since the data were collected. Corollary: Unless data and information are collected and preserved during the experiment (e.g., cruise), subsequent researchers will have a difficult time using those data. Theorem 2: The longer the time since the data were collected the less likely the data will ever be considered “final” or available. Groman’s Theorems Conclusion: It is essential that data and information management begin with the start of a project or program.
The Rise in Interdisciplinary Oceanography and Collaboration in Ocean Science have been emphasized by Powell (2008) and Briscoe (2008). Powell, T.M The rise of interdisciplinary Oceanography. Oceanography. 21(3): Briscoe, M.G Collaboration in the Ocean Sciences. Oceanography. 21(3): Powell: “Ocean science has long been interdisciplinary… Today, one can scarcely conceive of an oceanographic question that does not cut across disciplines.” Briscoe: “Ocean science must head toward more collaboration, because many of the research and applications questions we face demand teams of scientists and engineers (and probably social scientists and economists)…..Collaboration in the ocean sciences is critical to addressing emerging ocean problems, and is worth the effort.” It will take data informatics to make it possible! The Informatics Imperative
What has happened to cause a change? Computers more powerful and storage much larger. Software and software tools to handle data management now widely available. More multi-disciplinary research is happening that is building on the works of earlier programs and the earlier data are needed for current and future work. Programs have policies that require data sharing in reasonable time frames (~2 years) Program Managers are requiring that data be made publicly web accessible from previous grants in order to get the funding for the next grant.
Still resistance to sharing data – Why? Scientist does not want others to use the data - fear of lost opportunities. Scientist does not know how to do it. Other Reasons expressed: Structural Impediments I’m not done publishing my papers based on the data. My graduate student is almost done analyzing the data. It’s not final yet. Lack of positive acknowledgment of data shared (give credit on par with papers? Need for DOI’s).
Reasons for sharing data Scientist’s data are not nearly as valuable by themselves as they are in the context of all the other data sets collected within a program. Use of other’s data within a program without sharing their data is not fair. Data publishing with author citable references is coming. Scientists will get credit for putting their data in public repositories. There are real advantages to sharing.
Data Informatics Semantic Web RDF OWL SPARQL BASIN – an example of a prospective new program that will require all the Data Informatics and management techniques possible. Ontology web language (OWL); Resource Description Framework (RDF); SPARQL Query Language for RDF
Research in oceanography proceeds along three major lines: field observation, field and laboratory experimentation, and modeling. Data management and informatics until now have been an after-thought. Efforts like ecosystem-based management requires the integration of oceanographic, biodiversity, fisheries, and other marine environmental data, as well as the development of analysis and assessment tools. Exponential increase in data sources and the proliferation and distributed nature of databases have created a fourth new and important line of marine research. Data management and informatics is now on par with lines of oceanographic research (Baker et al. 2008). Summary FO EX MO Past EX FO MO DM&I Future Baker, D.N., C. E. Barton, W. K. Peterson, and P. Fox Informatics and the 2007–2008 Electronic Geophysical Year. Eos. 89(48):
Research priorities include: More rapid and efficient data acquisition, Enhanced data management, More effective data utilization and reuse, and Improved data visualization Development of ontologies. The ultimate goal is to create a cyberinfrastructure for oceanography that enables open, transparent, interoperable access to data and information, regardless of their location. Summary
Acknowledgments Charlton Galvarino for his excellent skill in implementing the MapServer interface. Huan-Xiang Xu for his help during the metadata database design and his help in the initial loading of the database. Xiaoyan Ye for her help in the initial attempts to develop comprehensive search options, geospatial displays of all the data, and for updating software to take advantage of the new database. Julie Allen for her extensive help and support in implementing our BCO-DMO web site using Drupal and in using Cold Fusion to provide web access to the database. National Science Foundation supported our work under grant numbers OCE and ANT Thanks To: