Data Management Scope and Strategies K.L. Sender and J.L. Pappas Information and Technical Services National Marine Fisheries Service Southwest Fisheries Science Center Honolulu Laboratory For those of you that haven’t met me before, my name is Karen Sender. I have been working at the Honolulu Lab as a JIMAR employee for a little over one year as an application developer in the Information and Technical Services group. For most of the past 25 years, I worked in marine geology and geophysics at the Hawaii Institute of Geophysics, first as the scientific lab officer onboard UH’s research vessel and the data manager for their ocean-bottom seismometer program. Subsequently, I was a principal in the development of the University’s seafloor mapping project, serving as data manager and data processing systems developer. My responsibilities included all aspects of acquisition, processing, analysis, imaging, archiving and documentation of large, complex data sets. Over the past several months, Jan Pappas and I have spent much of our time working with the Longline Observer data, trying, primarily, to resolve issues with data transfer from PIAO to HL, but in the process to assess data integrity. Our findings on that were presented to PIAO and interested HL data users last week but in a broad sense, most of the issues we were concerned with seemed to be the result of a lack good data management practices. These issues were also not uncommon with other data sets at the Lab and eventually we were asking ourselves with increasing frustration why these mistakes get repeated. When news came our way about the new Bottom Fish Observer Program, we started talking about ways to design data sets that would result in maximum data quality with the least amount of wasted effort. These talks resulted in our drafting a Data System Design Protocol that outlined procedures, roles and responsibilities for designing and maintaining a quality data set. While Jan and I were, we think, justifiably pleased with this document, further discussion led us to the conclusion that we have no way of implementing it. No structure or mechanism exists wherein it can be proposed or adopted as a laboratory standard. Clearly, what is needed within the Laboratory is a better understanding of the scope and importance of good data management and, just as importantly, a cohesive set of policies and guidelines on how data are collected and managed within our organization. This presentation is our attempt to address those issues. Let us first begin by reviewing why we care about data management.
Data Is the Foundation on Which We Build Success. Within NMFS we are working in an environment in which our success is measured not only by our ability to provide informed decisions on managing our natural marine resources, but also by our ability to defend the science on which we base those decisions. Successful management of data is critical to our mission to make informed decisions.
The quality of the science can be only as good as the data it was based upon.
What Is the Goal of Data Management? The most important goal of Data Management is to ensure access and dissemination of quality data to appropriate end-users in a timely manner. What is the goal of data management? …to provide quality data to appropriate users in a timely manner.
Quality Data Is the Result of Good Data Management From data program conception through the lifecycle of the data, good data management requires consistent, well thought-out and universally supported procedures and guidelines for its collection, maintenance and dissemination.
Why Do We Want to Manage Data? To fulfill the agency’s mission to conserve and manage the nation’s coastal and marine resources to ensure sustainable economic opportunities. Why do we want to manage data? Our gut answer should unanimously be, “because we want the best data.” But even if that were not true, we are obligated to manage our data to fulfill the agency’s mission.
What Is Data Management? Data Management is the vast array of tasks that begin before data collection and continue throughout the lifecycle of the data. Before the rise of the modern scientific enterprise, data would commonly be collected and processed by an individual or a project, and various levels of raw, processed or final data were possibly turned over to a data repository for inventory and archiving. Lack of data management standards might cause headaches for those generating and processing the data, but often the only noticeable results would be delays in the final product. In today’s scientific enterprise, data must be shared, sometimes in near real time, in order to maximize the value of the data. Management of data in this type of environment requires a broader understanding of the processes through which data are collected and maintained within the centralized data resource. Documentation, or metadata, must be complete and accessible. Users must be able to rely on the quality of the data they obtain from sources not their own. It’s Not Just Archiving!
Data Management Includes Data Resource Development. The data resource is the centralized data repository used for storage and access of scientific enterprise data. Data management tasks for developing and maintaining the data resource include…
· What data need to be shared? Identification of data to be included in the data resource · What data need to be shared? · Will sharing a data set increase its value and the value of an existing data set? Defining business rules for the data elements · What are the valid ranges of a given data element? · What does a null value mean versus a zero in a data field? Designing the data models · Is the data model compatible with other data sets with which it needs to be integrated? Designing data element naming/format standards · Can users easily understand what the data fields represent? · (6 names for temperature and multiple units of measure) Complying with hardware/software standards · Are we using agency-approved hardware and software tools? · Are our current tools maximizing data quality and facilitating data access? Performing routine and disaster recovery system administration Are we performing both preventive maintenance and planning for catastrophic failures?
Data Management Includes Data Collection. Data management includes all aspects of data design and collection…
Identifying not only the data elements desired but also any additional information that will be required to make use of those data. · Do data records require unique identifiers? · Do any reference tables need to be designed or can existing ones be used? Developing clear, concise manuals and instructions for data collection. Are the instructions for data fields unambiguous? Designing user-friendly paper and electronic forms for data capture and entry · Is data quality being compromised at data collection because the forms are too complicated or difficult to fill out? Building data validation schemes in data entry applications · Are unreasonable data excluded from the data set at data entry? · Do you ever want to see north longitudes in your data? Maintaining data set documentation and history Can you reproduce the path of a piece of data from data collection through data extraction should you be required by some legal review process?
Data Management Includes Data Maintenance. Data Maintenance is an ongoing process that ensures the highest quality data are available at all times. It includes…
Storing data in the centralized data resource · Are data-transfer routines automated and fully documented? · Have they been proved not to introduce errors or modify the data? Detecting and reporting errors via data monitoring applications · Do you want to have your users catch the errors before you do? Maintaining a history of when and by whom data are added, modified or deleted · Not only is this required for data security considerations, it provides information that can be used for process improvement. · How quickly is the data set growing? · What are the most common errors that are edited post data entry and can we eliminate them? Periodically auditing data by tracking data from collection to dissemination via the data path. · Are we ensuring that data are not lost or contaminated? · The first time your data are audited, do you really want it to be by an outside auditor? Developing and following formal change control procedures · Are all changes documented? · Have the changes been tested in a development environment? · Has proper notice been provided to all role groups? Developing and documenting data processing procedures · Can you fully document the path of your data? Developing and testing applications before release · Can data downtime and user frustration be minimized?
Data Management Includes Data Dissemination. Data management includes those tasks related to data dissemination…
Providing user support via training and problem resolution Are the users provided with proper tools and information to allow access to the data and metadata? Complying with data accessibility issues Have data access tools been reviewed and tested by an outside, independent reviewer? Ensuring data availability and ease of integration with other data sets, as needed (Whether this is within the lab or across the agency.) Establishing data security Are all appropriate users granted access in a timely manner? Is access revoked promptly when necessary? Complying with data publication formats and deadlines Do final data products such as maps, reports and web pages include standardized logs, scales and necessary metadata and disclaimers? Do dates clearly indicate the point in time when data was extracted?
Data Management is… Data management is…
…a lot more than we tend to think!
What Are the Costs of Poor Data Management? Misinterpretation of the data Lost data Inaccessible data Indefensible data Wasted time and money Missed deadlines Lost user confidence Any one of these can mean failure to a project!
What Are the Benefits of Good Data Management? Optimum data quality Improved user confidence Efficient and timely access to data Improved knowledge and understanding of the agency’s data holdings All of which should be our goal.
How Is Good Data Management Achieved? Through the development and implementation of well-conceived data management policy and data administration guidelines. Data management policy is a short, clearly written statement or outline of the organization’s philosophy, vision and goals for management of data. This policy applies to the entire organization, not just the IT/IM role groups and as such must be written in clear non-technical language. It should be inspiring, not threatening. The policy defines what you want everyone in the enterprise to accomplish. The actual methods for fulfilling the data management policy are defined in the Data Administration Guidelines. These guidelines define how you are going to implement the Data Management Policy. Where the Data Management policy might not change for a number of years, the Data Administration Guidelines will most likely be revised and refined on a regular basis.
Data Management Policy Policy for the management and protection of agency data. Set of broad, high-level principles forming a framework in which data management can operate efficiently and effectively. Data management policies exist for NOAA and NMFS and we are mandated to work within them. Why do we require a separate data management policy? Because creating our own defines and dictates our own attitudes and philosophies toward our data.
Data Management Policy Serves To… Ensure availability of stable, reliable and accessible collections of data in electronic form to all appropriate parties; Ensure compliance with all agency-wide mandates and directives. Improve direct access to data by the public and across the agency. Ensure availability of stable, reliable and accessible collections of data in electronic form to all appropriate parties; Ensure compliance with all agency-wide mandates and directives. Improve direct access to data by the public and across the agency.
Data Management Policy Allows… Good fisheries science. Good fisheries management Data users to be confident in their interpretations of quality data; The agency to properly defend its data in a court of law. Good fisheries science. Good fisheries management Data users to be confident in their interpretations of quality data; The agency to properly defend its data in a court of law.
Data Management Policy What might work… After reviewing NOAA and NMFS data management policies, along with innumerable others from both federal and state research institutions, Jan and I came up with what we think might be taken as a framework or at least a starting point for developing data management policy for the Laboratory.
A Draft Data Management Policy Programs that generate data will adhere to data management policies and guidelines. Data and metadata will be managed and stored in a centralized data resource. The data resource will be safeguarded and protected. and… 1. All functional units within the agency will comply with the data management policy. All outside organizations that collaborate with the agency on data will conform to the established data management policies and guidelines. (It is not enough to follow good data management practices ourselves if we then do not hold our data collaborators and contractors to the same standards.) 2. Database organization and structure will be planned on functional and agency levels. Data will be managed through the data stewardship principles of administering and controlling data quality and standards in support of agency goals and objectives. (Data stewardship implies a caretaker rather than an owner of the data.) Data will be protected from deliberate, unintentional or unauthorized alteration, destruction and/or inappropriate disclosure or use in accordance with agency policies and practices and federal and state laws. (An obvious and necessary consideration.)
A Draft Data Management Policy Data will be shared based on agency policies. Agency data will be cataloged and documented. Information quality will be actively managed throughout the life cycle of the data. 4. A particular individual, unit or group does not own agency data. The data will be made accessible to all authorized users in a timely manner, per agency policy and state and federal laws. 5. Standards will be developed for the representation of agency data and its metadata in the database. Business processes will be defined and documented. Controls will be established to assure the completeness and validity of the data and to manage redundancy. The agency data resource will be the officially recognized source for data reporting purposes. 6. Explicit criteria for data validity, availability, accessibility, understanding and ease of use will be established and promoted through data administration guidelines. An active program of process improvement will be applied to all data management polices, guidelines and protocols
Who Sets Data Management Policy and Guidelines? An information technology/information management steering committee, composed of one or more representatives from the key data management role groups: Administration Data generators Data users Information technology Management Data Management Steering Committee Ensures that data management policies are in line with those of NMFS, NOAA and DOC. Directs development, implementation and maintenance of detailed data policies, standards procedures and guidelines across the agency. Reports progress to the director on the performance achieved against the targets for improvement of data quality and the value gained from effective data management. Think of this as much more a “working group” than a committee.
Data Management Steering Committee Ensures that data management policies are in line with those of NMFS, NOAA and DOC. Directs development, implementation and maintenance of detailed data policies, standards procedures and guidelines across the agency. Reports progress to the director on the performance achieved against the targets for improvement of data quality and the value gained from effective data management. Ensures that data management policies are in line with those of NMFS, NOAA and DOC. Directs development, implementation and maintenance of detailed data policies, standards procedures and guidelines across the agency. Reports progress to the director on the performance achieved against the targets for improvement of data quality and the value gained from effective data management.
Data Management Roles and Responsibilities Clearly defined data management roles and responsibilities are required to execute the policies and guidelines of the agency.
All data management role groups must live within the policies and guidelines of DOC, NOAA and NMFS.
Data Ownership The agency is the owner of the data. The organizational unit or group that commissions the collection of a data set assigns a data steward to that data set. Agency data are not owned by a particular individual, unit or project but by the agency itself. Data Stewardship implies formal responsibility and accountability for management and quality of the data assigned to this role, in accordance with the defined data management policy The buck stops here!
Data Stewards set policy for the collection, management and accessibility of the data set and its metadata to ensure compliance with agency data management policies, mandates and relevant state/federal laws. Data Stewards Have planning and policy-level responsibility in their functional area. Document policies and procedures for access and use of the data set. Ensure that the data set is stored, managed and accessed in the enterprise database per the agreed upon data management policy. Periodically review costs and benefits of continuing to maintain the data set.
Data Managers Perform and supervise operational management of their data sets per the data management policy, data administration guidelines and the data set policies set by the data stewards. Have operational-level responsibility for data management activities related to data Capture Maintenance Dissemination
Database Administrator Has responsibility for the physical data resource: Generating physical database schema Performing database tuning Creating database backups Planning for database capacity Implementing data security requirements
Data Users are individuals requiring access to agency data in the course of meeting the requirements of their position and anyone in the public who wishes access to public information held in the data resource.
The Data Administrator, along with IT/IM steering committee, develops and maintains data administration guidelines and procedures. The Data Administrator facilitates the coordination between the data management role groups and provides guidance and training to comply with data policies and guidelines. Data Administrator Ensures that all data management role groups comply with data management policies and guidelines. Periodically reports to director on status of compliance with data management policies and guidelines.
Data Administration Guidelines Required for all data management task areas Data resource development Data collection Data maintenance Data dissemination The Data Administration Guidelines define how we accomplish our goals set in the Data Management Policy. Each task within the four data management areas must have clearly defined guidelines that, when followed, will ensure that the final information produced is of the highest quality. The guidelines will typically outline the roles and responsibilities of each member of the data management team. They most likely will state specifics on what hardware and software tools are acceptable to use. They should define what types of data should be stored in the enterprise database and specify the time frame in which that should happen. Some guidelines will probably refer to the development and implementation of very detailed procedures and protocols that must be followed, for example, the Data System Design Protocol that we referred to earlier. It is important to note again that where the Data Management Policy outlines high level principles and philosophy on managing data within our organization, the Data Administration Guidelines provide us with the specifics on how to achieve those goals.
The task areas in Data Management may seem daunting, and we do not mean to imply that most of these are not getting accomplished here at the lab. Unfortunately, some tasks are slipping through the cracks, while some are being performed redundantly or by the wrong role groups and others are only partially addressed. Here at the Lab we have great people with a lot of talent working in these areas, but we seem to have not completely adopted the concept of the scientific enterprise. What we really need to do is assess how we can redirect our skills and energies to accomplish all these tasks in an effective and efficient manner with the common goal of maximizing the value of our data. Changes, however, need to be made, but in the field of Information Technology and Information Management, change is a way of life, especially in science and research. It is important to note that in today’s scientific enterprise, each one of these tasks cannot be executed in isolation from the rest. A data form cannot be designed without input from the database designer or even a publication editor. The data steward cannot decide to add additional fields to the data set without consulting with the person training the data collectors and the person writing the data collection manuals and probably every role group down the data chain.
Good data management is fundamental to our success Good data management is fundamental to our success. For some, it will require a new way of thinking: Enterprise Thinking. In the past, collecting an additional data set would, of course, be of great value to the lab. Now, adding that same data set to the data resource has the potential to significantly increase the value of multiple existing data sets if the data are properly managed.
References Brohan, M., 2001, The Need for a Formal Data Management Policy, DM Review, v. May. http://www.dmreview.com/. Data Administration Forum, 1999, Data Management Roles and Responsibilities Guidelines, Ver. 1.3, Advisory Council for Information Management, British Columbia. Fisheries Information Technology for the 21st Century (FITS21). Flanagan, T., et al, 1998, A Practical Guide to Achieving Enterprise Data Quality. http://www.techquide.com. Imhoff, C., 1998, Ensuring Data Quality Through Data Stewardship, DM Review, v. Apr. http://www.dmreview.com/.
References (cont.) Imhoff, C., 1997, Data Stewardship: Finally a Process for Achieving Data Integrity, The Data Administration Newsletter. http://www.tdan.com. Information Resources Management Staff, Information Systems Office, Office of Finance and Administration, 2000, National Oceanic and Atmospheric Administration Strategic Information Technology Plan, FY 2001 – FY 2005. http://www.rdc.noaa.gov/~irm/index.html. Intra-governmental Group on Geographic Information, 2000, The Principles of Good Data Management. http://www.detr.gov.uk/. Sargent, J., Bistodeau, R. and Seem, D., 2000, NOAA Fisheries Information Technology (FIT) Architecture White Paper, Systems Development Methodologies. U.S. Fish and Wildlife Service, 2000, Data Standards. http://www.fws.gov/stand/. University of Michigan, 1994, Institutional Data Resource Management Policy.