SCEC5 Planning for Scientific Computing and NSF Data Management Plan SCEC Leadership Retreat 2 June 2015 I’ve learned that proposals represent planning opportunities. SCEC should use these opportunities to
SCEC5 Data Management Planning
USGS View of Data Management
NSF Data Management Plan Requirement The National Science Foundation (NSF) has published a revised version of their Proposal and Award Policies and Procedures Guide (PAPPG) that requires, in all proposals submitted or due on or after January 18, 2011, a supplementary document of no more than two pages describing a Data Management Plan for the proposed research. As a supplementary document, the data management plan is not included in the 15-page limit for proposal bodies. Fastlane will not permit submission of a proposal that is missing the Data Management Plan.
Contents of a NSF Data Management Plans Products of the Research: The types of data, samples, physical collections, software, curriculum materials, and other materials to be produced in the course of the project. Data Formats: The standards to be used for data and metadata format and content (where existing standards are absent or deemed inadequate, this should be documented along with any proposed solutions or remedies). Access to Data and Data Sharing Practices and Policies: Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements. Policies for Re-Use, Re-Distribution, and Production of Derivatives. Archiving of Data: Plans for archiving data, samples, and other research products, and for preservation of access to them.
Division of Earth Sciences (GEO/EAR) Specific Requirements Preservation of all data, samples, physical collections and other supporting materials needed for long- term earth science research and education is required of all EAR-supported researchers. Data archives must include easily accessible information about the data holdings, including quality assessments, supporting ancillary information, and guidance and aids for locating and obtaining data. It is the responsibility of researchers and organizations to make results, data, derived data products, and collections available to the research community in a timely manner and at a reasonable cost. In the interest of full and open access, data should be provided at the lowest possible cost to researchers and educators. This cost should, as a first principle, be no more than the marginal cost of filling a specific user request. Data may be made available for secondary use through submission to a national data center, publication in a widely available scientific journal, book or website, through the institutional archives that are standard for a particular discipline (e.g. IRIS for seismological data, UNAVCO for GPS data), or through other EAR-specified repositories..
Information on Creating NSF Data Management Plans NSF Data Management Plan Requirements 2. Directorate for Geosciences--Data Policies 3. DataOne (NSF Geo Data Management Project) 4. UCLA Library Data Management Plan Information 5. University of Michigan Library Systems
f. Biographical Sketch(es) (c) Products A list of: (i) up to five products most closely related to the proposed project; and (ii) up to five other significant products, whether or not related to the proposed project. Acceptable products must be citable and accessible including but not limited to publications, data sets, software, patents, and copyrights. Unacceptable products are unpublished documents not yet submitted for publication, invited lectures, and additional lists of products. Only the list of 10 will be used in the review of the proposal. Each product must include full citation information including (where applicable and practicable) names of all authors, date of publication or release, title, title of enclosing work such as journal or book, volume, issue, pages, website and Uniform Resource Locator (URL) or other Persistent Identifier. If only publications are included, the heading "Publications" may be used for this section of the Biographical Sketch.
SCEC5 Data Management Strategies
Data Citation Citation of Data Using Digital Object Identifiers (DOIs) Use Digital Object Identifiers (DOIs) to signify datasets that are complete, in a useable format, stable (changes are implemented by publication of new versions), have valid metadata, have passed the quality control checks within the domain of expertise of the data centre, and have long-term stewardship guaranteed by that data centre, underwritten by the ICSU World Data System. This provides the basis for a dataset to be cited as if it were a research paper, putting it on a par with other scientific outputs. [Reference] The International Journal of Digital Curation Volume 7, Issue 1 | 2012
Data Citation Creating a Culture of Data Citation: 2. Directorate for Geosciences--Data Policies 3. DataOne (NSF Geo Data Management Project) 4. UCLA Library Data Management Plan Information 5. University of Michigan Library Systems
SCEC5 Data Exchange As Focus of External Collaborations With adequate personnel support, SCEC5 could establish itself as a valuable contributor and collaborator with external NSF and other projects including IRIS, CIG, EarthScope, EarthCube, NHERP, USGS, CGS, DOE, etc. Collaborations might work as follows: SCEC5 expresses interest and willingness to collaborate with external projects. SCEC5 meets with each group and discuss what data products they produce that might be of interest to SCEC. Describe SCEC products that might be of interest to external group. Agree on data to be exchanged. Agree on exchange format Agree on Metadata content Implement Data formatting of selected products Implement access mechanism Release prototype date exchange
Discussion SCEC5 Computational Research
SCEC5 Scientific Computing Scientific computing software development is a valuable capability within core SCEC and within Special Projects. SCEC5 planning should include Scientific Computing, for several reasons, including: Scientific Computing is Expensive Scientific Computing Could Lead to SCEC5 Growth In this session, I’ll present issues and suggestions for SCEC5 scientific computing for discussion.
SCEC5 Scientific Computing SCEC computer activities under several names including: Scientific Computing Research Computing High Performance Computing Big Data Processing Computer Science Community Modeling Environment Information Technology Computational Science SCEC5 scientific computing includes but is not limited to High Performance Computing
SCEC5 Scientific Software Capabilities SCEC’s core computing skill is scientific software. Both core SCEC and Special Projects have this capability: Core SCEC researchers develop new scientific codes, often to do individual research. Special projects often develop software to perform large-scale community calculations.
SCEC5 Scientific Computing SCEC5 should focus on developing scientific software and using the software to perform research. SCEC5 should avoid spending significant resources building and operating large-scale computer hardware.
Scientific Computing Core SCEC Core SCEC researchers should continue to create, evaluate, improve research software. Collaborative Computational Research Activities are Very Valuable: Source Inversion Site Response Dynamic Rupture Comparison Utilization of Ground Motion Simulations Core SCEC would benefit from a software developer available to the community. However, even if funding existed, finding the right person, and setting appropriate priorities would be a challenge.
CME Software Eco-System The SCEC Community Modeling Environment (CME) software means computing related to the computational pathways designed to improve ground motion forecasting. CME software represents an inter-related set of computational tools from CVM’s, to UCERF3, to CyberShake, to Full 3D Tomography. SCEC CME software is a collection of scientific codes that together provide a full range of seismic hazard analysis tools including SCEC Velocity Models, UCVM, Dynamic Rupture Codes, 1D Broadband, 3D AWP-ODC, 3D Hercules, OpenSHA, CyberShake, and full 3D tomography. In NSF OCI terminology, these programs form an software “eco-system” of inter-related and inter-dependent modeling tools that can be used to calculate physics-based probabilistic seismic hazard models.
SCEC Scientific Computing Successes SCEC’s most productive scientific computing collaborations are organized around an important seismic hazard data product or calculation that can be improved using advanced computational techniques. The scientific challenge defines the computational goal, and computing techniques are introduced as needed to reach the goal. SCEC scientific computing projects are integrative, bringing together inter-related SCEC structural and computational models. SCEC5 should continue to organize and focus integrative, science-driven, broad-impact, scientific computing projects.
SCEC Scientific Computing In Special Projects Within Special Projects, the most successful SCEC scientific computing projects have been multi-disciplinary collaborations that include scientists, engineers, computer scientists, and software developers. Examples Include: OpenSHA Broadband Platform CyberShake OEF CSEP Having software developers work with scientists and engineers is our key strategy to avoiding wasting software developer time, or developing software nobody uses.
SCEC Scientific Computing Successes SCEC special projects are often a mechanism for extending the computational capabilities of individual research codes into community-based, practical, computational data products. Special project calculations often represent more of a community calculation, rather than an individual researcher calculation. Core SCEC5 computational science should play an increased role in identifying the best available codes that should be used in special project calculations.
SCEC Scientific Computing Successes Due to the important of scientific software, SCEC5 should initiate efforts to improve software development capabilities within both scientific and research staff. SCEC should train scientific staff in software basics, such as the material covered in “Software Carpentry” and other software engineering overviews. (e.g. By end of SCEC5, SCEC researchers should use version control for their research software.) Due to the rapidly changing software field, SCEC software staff should be required to perform annual training to keep skills current. SCEC computer training likely needs to be increased. Increasing interactions between SCEC computational science and CEO might enable CEO to support SCEC computer training.
Project Sizes SCEC has most success coordinating the efforts of small software teams, working on well-focused research activities. We recommend that SCEC5 special software projects should be organized around project teams with approximately 6 people or less. If SCEC software project groups grow to larger sizes, SCEC will need to re-organize how groups are organized and managed.
Software Staff-related Issues To maintain a software staff, SCEC5 management must recognize that most software staff people are not academics. Often, the software developer’s goal is to produce working software, used by a community, or used to produce an important result, rather than to publish papers. Also, In the fast-paced software field, forward career motion is important to software people. To retain talented software staff, SCEC5 will need to provide a non-academic software staff career path through which staff members can reasonably progress. Staff software developer career path should define positions with gradually increasing responsibilities, and each SCEC position should be linked to an appropriate official USC staff positions. The career paths should enable staff to progress into either advanced technical, or management roles.
Finding Good Software Staff Special projects best source of staff software developers has been the UseIT Intern program. UseIT was a highly effective as a way to attract student interest in SCEC research, and evaluate the students’ readiness to contribute to SCEC software project. The SCEC intern programs work as a farm team for SCEC’s wider computational science program. If SCEC5 must maintain a significant software staff, operating a UseIT type intern programs could be very valuable for recruitment.
Obtaining HPC Time Both core SCEC and SCEC special project need HPC time. But special projects need more. If special projects are funded, including Keck and Central California, the importance of computing time will increase. To avoid shortfall, SCEC will need to dedicate personnel to obtaining, managing, and reporting on supercomputer hours. At the large proposed scales, the staff will not be able to both raise the computing hours and have time to perform the research. The cost of a person to raise the computing hours will be less than directly purchasing the cost of the computing time.
Obtaining HPC Time An important SCEC5 strategy to obtain large-scale computing activities will be to SCEC5 should work to stay qualified on largest systems to meet needs of HPC research. To stay qualified on a new system often requires a new, or re-written version of a high-performance code. SCEC wave propagation codes, which are being pushed to higher and higher frequencies, are good candidates for codes that SCEC can develop to keep us qualified on the newest and largest HPC systems. Participating with HPC centers developing next generation supercomputers (co-Design concept) is advanced HPC. It would require several more SCEC people including senior computer scientists involved. High-risk, high-reward, with greatest danger to SCEC that no research computing gets done, only system testing software.
Sustainability Strategy SCEC5 can benefit from a computational science group, and avoid wasting software development time by doing the following: Integrate the best available core SCEC scientific software into important broad impact data products such as CSEP, UCERF, Broadband, CyberShake, High-F, and Full 3D tomography. Evaluate all USGS seismological data products including EEW, ShakeMap, UCERF, Hazard Maps, OEF, and identify ways core SCEC research can improve them. Where clear improvements are possible, implement a multi-disciplinary group to implement the improvements.
Additional Topics (in HPC White Paper) Key Software Needed for SCEC5 Computational Science Contributing to SCEC Visibility Additional Software Sustainability Strategies