U.S. Department of the Interior U.S. Geological Survey National Water Quality Monitoring Council April 2014 Duplicated Water Data - Causes, Implications,

Slides:



Advertisements
Similar presentations
Common Data Repository and Water Resource Assessment for the Southern and Northern Piceance Basin Northwest Colorado Oil and Gas Forum June 2008 Jude Thomas,
Advertisements

History Data Service1 Good Design for Historical source based Databases History Data Service Hamish James.
GTA Network Management Systems On Behalf Of BellSouth.
Foundational Objects. Areas of coverage Technical objects Foundational objects Lessons learned from review of Use Case content Simple Study Simple Questionnaire.
From Words to Meaning to Insight Julia Cretchley & Mike Neal.
1 Web Services USGS/EPA Collaboration November 27, 2007 Dwane Young, U.S. EPA Nate Booth, USGS.
Maintenance Modifying the data –Add records –Delete records –Update records Modifying the design –Add fields into tables –Remove fields from a table –Change.
Time Series Analyst An Internet Based Application for Viewing and Analyzing Environmental Time Series Jeffery S. Horsburgh Utah State University David.
Chapter 14 Getting to First Base: Introduction to Database Concepts.
Security in Databases. 2 Srini & Nandita (CSE2500)DB Security Outline review of databases reliability & integrity protection of sensitive data protection.
Reporting Module for Gateway Yvonne Yao. Recap: What is the Gateway? Web-base system Create, schedule, send mailings Statistics collected and presented.
Security in Databases. 2 Outline review of databases reliability & integrity protection of sensitive data protection against inference multi-level security.
1 Chapter 2 Reviewing Tables and Queries. 2 Chapter Objectives Identify the steps required to develop an Access application Specify the characteristics.
Multiple Indicator Cluster Surveys Survey Design Workshop Data Analysis and Reporting MICS Survey Design Workshop.
U.S. Department of the Interior U.S. Geological Survey NWIS Web Services Snapshot for ArcGIS Sally Holl and David Maltby Based on work by David McCulloch.
Business and Management Research
NWIS Data Pulls for National- and Regional-Scale Applications Kathy Smith (Crustal Geophysics and Geochemistry) and Nancy Bauch (Colorado Water Science.
More than you probably wanted to know about NWIS and NWISWeb U.S. Department of the Interior, U.S. Geological Survey Kenneth J. Lanfear USGS.
Pam Fuller U.S. Geological Survey Gainesville, FL Nancy Elder U.S. Geological Survey Western Fisheries Research Center Marrowstone Marine Station Nonindigenous.
6. Implications for Analysis: Data Content. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2.
U.S. Department of the Interior U.S. Geological Survey NWIS, STORET, and XML National Water Quality Monitoring Council August 20, 2003.
Roger Miller, Arkansas Department of Environmental Quality Barry Jackson, USGS Arkansas Water Science Center ARKANSAS EXCHANGE NETWORK FOR GROUNDWATER-QUALITY.
Managing Monitoring Data from Many Sources A New Hampshire Experience Deb Soule Watershed Management Bureau New Hampshire Department of Environmental Services.
TECHNICAL DOCUMENTATIONPARTNERS DOWNLOAD DATA Download water quality data in MS Excel, CSV, TSV, and KML formats. Learn how to use the portal and data.
Data management in the field Ari Haukijärvi 2nd EHES training seminar.
Concepts and Terminology Introduction to Database.
Copyright © 2007, Oracle. All rights reserved. Managing Concurrent Requests.
Copyright : Hi Tech Criminal Justice, Raymond E. Foster Public PolicyPublic Policy and Practice in Criminal Justice Criminal Justice Public.
Introduction to Geospatial Metadata – ISO 191** Metadata National Centers for Environmental Information (NCEI)
Key Data Management Tasks in Stata
Water Quality Data, Maps, and Graphs Over the Web · Chemical concentrations in water, sediment, and aquatic organism tissues.
U.S. Department of the Interior U.S. Geological Survey NWIS, STORET, and XML Advisory Committee on Water Information September 10, 2003 Kenneth J. Lanfear,
PM2.5 Model Performance Evaluation- Purpose and Goals PM Model Evaluation Workshop February 10, 2004 Chapel Hill, NC Brian Timin EPA/OAQPS.
Microsoft ® Office Access ® 2007 Training Build a database I: Design tables for a new Access database ICT Staff Development presents:
U.S. Department of the Interior U.S. Geological Survey Dr. Robert M. Hirsch Associate Director for Water April 16, 2007 USGS: Water Resources Program.
Copyright: ©2005 by Elsevier Inc. All rights reserved. 1 Chapter - 2 Basics of Sound Structure Author: Graeme C. Simsion and Graham C. Witt.
Return to Overview Slide A B,C,D,E Possible NCI based status display program Site anonymization computer Secure transfer in-house to LIDC computer Remote.
Chapter 4 Data and Databases. Learning Objectives Upon successful completion of this chapter, you will be able to: Describe the differences between data,
U.S. Department of the Interior U.S. Geological Survey USGS Water Data Exchange Services USGS Office of Water Information June 2009 Nate Booth, Dave Briar.
Databases Organizing Sorting Querying A Presentation by Karen Work Richardson.
U.S. Department of the Interior U.S. Geological Survey Using Monitoring Data from Multiple Networks/Agencies to Calibrate Nutrient SPARROW* Models, Southeastern.
Topic 1: Introduction to SQL. SQL stands for Structured Query Language. SQL is a standard computer language for accessing and manipulating databases SQL.
1 Non-Response Non-response is the failure to obtain survey measures on a sample unit Non-response increases the potential for non-response bias. It occurs.
N atural R esource I nformation S ystems. ObjectivesObjectives  NRIS Water Overview  Spatial Data in NRIS Water  Integration of NHD data.
SSWD Advisory Committee on Water Information Status report: Subcommittee on Spatial Water Data Herndon, VA September 15, 2004.
Data Sharing: Progress Report National Water-Quality Monitoring Council July 20, 2004.
Water Quality Exchange (WQX) Pilot and its Potential Role in NWIS/STORET Coordination October 18, 2005.
6 th Annual Focus Users’ Conference 6 th Annual Focus Users’ Conference Import Testing Data Presented by: Adrian Ruiz Presented by: Adrian Ruiz.
Compass ® Reports Customized List Report KDE:OAA:pp: 9/5/
Metadata Content Entering Metadata Information. Discovery vs. Access vs. Understanding Cannot search on content if it is not documented. Cannot access.
U.S. Department of the Interior U.S. Geological Survey Automatic Generation of Parameter Inputs and Visualization of Model Outputs for AGNPS using GIS.
U.S. Department of the Interior U.S. Geological Survey MACOORA 4 th Annual Meeting Understanding the Coastal Ocean: Partnerships for a Changing World--
U.S. Department of the Interior U.S. Geological Survey Overview and Evaluation of the Current Hydrologic Data Network Scott Morlock USGS Indiana Water.
U.S. Department of the Interior U.S. Geological Survey Ensuring Full Access to Federal Cost Shared Conservation Practices W. Dean Hively, Ph.D. U.S. Geological.
1 Web Services USGS/EPA Collaboration February 21, 2008 Dwane Young, U.S. EPA; Jon Scott, USGS; Dorinda Gellenbeck, USGS; Nate Booth, USGS.
TOPSpro Special Topics I: Database Managemen t. Agenda for Module I: Database Management  TOPSpro Backup/Restore Wizard  TOPS-TOPS Import/Export Wizard.
How official statistics is produced Alan Vask
Introduction/ Section 5.1 Designing Samples.  We know how to describe data in various ways ◦ Visually, Numerically, etc  Now, we’ll focus on producing.
Research Proposal Writing Resource Person : Furqan-ul-haq Siddiqui Lecture on; Wednesday, May 13, 2015 Quetta Campus.
TOPSpro Special Topics TOPSpro Special Topics Data Detective I: Tracking Your Pre- and Post-test Results.
IT 5433 LM3 Relational Data Model. Learning Objectives: List the 5 properties of relations List the properties of a candidate key, primary key and foreign.
Chapter 9 - Report Writing: From Formal Documents to Short Summaries 1 Understanding the Nature of a Report A report is the compilation of information.
Overview Introduction to marketing research Research design Data collection Data analysis Reporting results.
WORKSHOP GROUP ON QUALITY IN STATISTICS
Intro to Expert Systems Paula Matuszek CSC 8750, Fall, 2004
Getting to First Base: Introduction to Database Concepts
Getting to First Base: Introduction to Database Concepts
Getting to First Base: Introduction to Database Concepts
Presentation transcript:

U.S. Department of the Interior U.S. Geological Survey National Water Quality Monitoring Council April 2014 Duplicated Water Data - Causes, Implications, Solutions Dorrie Gellenbeck and Jonathon Scott, U.S. Geological Survey

Introduction

Overview  This is a story about :  How we broke the USGS hydrologic database through the best of intentions  The consequences of those intentions  How we are fixing it and the implications  What may lie ahead in the future

How we broke it.  WATSTORE precursor to NWIS  Born in 1971  Disabled in 1997  Followed technology changes  Mainframe->minicomputers.  The WATSTORE database was moved into NWIS instances.  48 separate NWIS databases

How we broke it - Best of intentions  Given the huge time and expense of new data collection...  Copying data was commonplace:  Within USGS between NWIS systems  Outside USGS with USGS data - STORET  Coupled with large spatial and temporal variability  It was natural to compile everyone's data

How we broke it - Best of intentions  Congress identified difficulties with synthesizing data from multiple agencies  USGS copied data into STORET for years  Warning signs: we observed multiple copies of USGS data in EPA STORET  ~2003 USGS data from STORET was removed  Triage step #1

Problems of duplicated data:  Bias  Statistical analyses  Same data included more than once  Incorrect analyses from stale copies  Incomplete analyses  Copies between systems  Data updates in one version and not the other

How can USGS fix the problem?  Identify the problem  A completely described problem is 50% solved ~Charles Kettering  Gain support and resources  Alone we can do so little; together we can do so much ~Helen Keller  Identify the technical approach  The devil is in the details ~Anonymous

How we are fixing it.  Create aggregated national database  Includes all non-continuous data from separate NWIS systems  Water-quality data are one data type  Currently internal only  Non-authoritative databases were identified  Not included in the aggregation  Find & fix identical station ID for non- colocated stations  Where possible

How we are fixing it. *The short version* 1. Group data into “Data Topics” 2. Develop algorithms to compare duplicates 3. Choose the “best” copy 4. Eventually, keep only the best record

National Internal DB Automated Deduplication National Internal DB

How we are fixing it. *The long version*  Divide DB into topics. For each topic:  Specify keys for exact duplicates  Specify keys for inexact duplicates  Computer automatically resolves exact duplicates  Subject matter experts formulate rules to resolve inexact duplicates  Devise scoring weights and tie breakers

Example: Subset of water-quality scores Note: each of 37 data topics have unique rules ScoreRule +2Most results +1Earliest result-creation date +0.5Most results with non-empty remark code +0.5Most results with non-empty value-qualifier codes +1Greatest number of non-null, non-mandatory fields +0.1Earliest sample creation date

Example: Subset of water-quality scores  Scores are weighted  More important characteristics get more points  Determined by data experts  Example:  Highest scoring rule (2.0 pts) – Most results  Explanation: ‘More is better’  Lowest scoring rule (0.1 pts) - Earliest sample creation date  Explanation: Indicates original sample record

How we are fixing it.  A water-quality example CategoryArizonascoreNew Mexicoscore Number of results Earliest result creation dateno0yes1 Most results with non-empty remark code no0yes0.5 Most results with non-empty value-qualifier codes yes (tie)0.5yes (tie)0.5 Greatest number of non null, non-mandatory fields no0yes1 Earliest sample creation dateno0yes0.1 Total Score

What's next for USGS?  NWISWeb & QW Portal - Use ‘deduplicated’ aggregated database as single source for public display  NWISWeb – expand data-types available  Public may see some content changes:  Drainage area  Well depth  HUC

What's next for others? Duplicates may exist from multiple agencies  Agency code differences  Site Identifier differences  USGS internal problems were just the tip of the iceberg.

What's next for others?  Employ similar approaches?  Employ river reach?  Track aliases for site identifiers among agencies  Employ a common site numbering scheme  Stop COPYING!

Questions? Contacts  Dorrie Gellenbeck, USGS  Denver, Colorado   Jon Scott, USGS Retired 