# 1 METADATA: A LEGACY FOR OUR GRANDCHILDREN N. Scott Urquhart STARMAP Program Director Department of Statistics Colorado State University
# 2 DISCLAIMERSDISCLAIMERS The work reported here today was developed under the STAR Research Assistance Agreement CR awarded by the U.S. Environmental Protection Agency (EPA) to Colorado State University. This presentation has not been formally reviewed by EPA. The views expressed here are solely those of author and the STARMAP, the program he represents. EPA does not endorse any products or commercial services mentioned in this presentation. The people of CEER-GOM have heard parts of this presentation. Sorry. That presentation at Ocean Springs, MS (3/26/02) led to an invitation for this talk.
# 3 CONTEXT FOR COMMENTS SPACE-TIME AQUATIC RESOURCES MODELING AND ANALYSIS PROGRAM = STARMAP STARMAP IS FUNDED BY EPA’s STAR PROGRAM, AS ARE ALL OF THE EaGLes PROGRAMS (==> “SIBLING” PROGRAMS) STARMAP IS TO USE EMAP AS A DATA SOURCE AND CONTEXT NSU = STARMAP PROGRAM CSU 10 YEARS OF COLLABORATION WITH EMAP 40 + YEARS AS STATISTICIAN WORKING WITH ECOLOGISTS
# 4 AN IMPORTANT LESSON YOU DO NOT KNOW WHAT YOUR DATA WILL BE USED FOR 20 YEARS FROM NOW BY THE TIME THE VARIOUS EaGLes PROGRAMS ARE COMPLETE WE, AS TAX PAYERS, WILL HAVE INVESTED > $40M IN THE VARIOUS STUDIES THE RESULTING DATA NEEDS TO BE RESPONSIBLY AND READILY AVAILABLE TO FUTURE GENERATIONS
# 5 YOU DO NOT KNOW WHAT YOUR DATA WILL BE USED FOR 20 YEARS FROM NOW POPULAR PRESPECTIVE - WE “KNOW” LOTS ABOUT THE “ENVIRONMENT” REALITY: GOOD AQUATIC DATA IS SCARCE SPATIALLY EXTENSIVE OVER A REASONABLE TIME SPAN WELL DOCUMENTED PROCEDURES WELL TRAINED CREWS CAREFULLY EXECUTED STUDIES DATA PUBLICALLY AVAILABLE
# 6 THE VALUE OF “METADATA” DATA WITHOUT CONTEXT ARE NUMBERS NEARLY WORTHLESS TO OTHERS How many file cabinets full of data are in your park offices? DATA WITH CONTEXT IS INFORMATION CAN BE VALUABLE TO OTHERS CONTEXT IS CALLED METADATA
# 7 VERY DISCOURAGING EXPERIENCE WITH HISTORIC DATA THREE HISTORIC DATA SETS NUTRIENTS IN NORTHEAST LAKES Larsen, D. P., N. S. Urquhart and D. Kugler (1995). Regional scale trend monitoring of indicators of trophic condition of lakes. Water Resources Bulletin 31: E. COLI IN A RIVER BASIN IN OREGON NUTRIENTS IN LAKES & STREAMS IN EPA REGION 10 EMAP SURFACE WATERS I THOUGHT THIS WAS WELL DOCUMENTED!
# 8 SO WHAT IS METADATA? BEST DEF’N SEEMS TO BE ORGANIZED “DATA ABOUT DATA” VERY DIVERSE VIEWS ABOUT WHAT IT SHOULD CONTAIN: LIBRARIANS W3 - GROUP - - DEFINING FEATURES OF THE WORLD WIDE WEB { title, description, publication date and author } CENSUS-BUREAU TYPES, WORLDWIDE GEOGRAPHIC DATA STANDARDS EPA’s STORET
# 9 WHAT IS METADATA GOOD FOR? A Librarian probably would answer Discovery Managing the resource (Ownership &responsibility) ARCHIVING AUTHENTICATING - QA/QC - UNCHANGING GROWING This statistician answers For correctly analyzing data in the future Not discovery, but correct utilization Paths to related documents based on the same dataset
# 10 METADATA COMPONENTS IMPORTANT TO A PERSON ANALYZING THE DATA NAME OF DATASET DEFINITION OF RESPONSES EVALUATED MOTIVATING FACTORS INTERNAL FEATURES OF DATASET
# 11 IMPORTANT METADATA COMPONENT: DATASET NAME IS THIS REALLY IMPORTANT? YES! IMPORTANT FINDINGS FROM A DATASET WILL BE PUBLISHED. WE NEED TO ADOPT A CONVENTION THAT THE DATASET NAME IS A KEYWORD. Name needs to be permanent and consistently used THEN THEN FUTURE INVESTIGATORS CAN USE STANDARD SEARCH TOOLS TO FIND INFORMATION EXTRACTED FROM EACH DATASET. MUCH LONGER LIVED THAN WEB LINKS
# 12 IMPORTANT METADATA COMPONENT: DATASET NAME { continued } Filtering criteria for data on which publication is based Name of existing named subset Geographic/temporal subset Response subset
# 13 IMPORTANT METADATA COMPONENT: DEFINITION OF RESPONSES EVALUATED USE IT TO DOCUMENT SITE SELECTION AND LOCATION FIELD PROTOCOLS FOR GATHERING DATA & MATERIAL Peck DV, Lazorchak JM, Klemm DJ, editors EMAP Surface Waters: Western Pilot Study field operations manual for wadeable streams. Corvallis (OR): U.S. Environmental Protection Agency, Office of Research and Development. 275 p. LABORATORY METHODS QUALITY ASSURANCE/QUALITY CONTROL
# 14 IMPORTANT METADATA COMPONENT: MOTIVATING FACTORS WHAT WERE THE STUDY OBJECTIVES? Scale = one page (perhaps a lot more in this context); Specific objectives Narrative on their origin WHY & HOW WERE THE SITES SELECTED? From some population of sites (restrictions) Purposefully Good idea - accessibility of whole study plan
# 15 IMPORTANT METADATA COMPONENT: INTERNAL FEATURES OF DATASET LARGE DATASETS OFTEN CONSIST OF MANY SUB DATA SETS EG: EMAP MAHA DATA COLLECTION CONSISTS OF 42 SAS DATASETS UNIQUE SITE IDENTIFICATION; WITH DATE OF SITE VISIT DATA IS UNIQUELY IDENTIFIED. Why was this subset of the data constructed? Who knows more about it Which responses are in which data sets? Be careful that values are the same in each data set
# 16 IMPORTANT METADATA COMPONENT: INTERNAL FEATURES OF DATASET (continued) Data dictionary Usable paths to definition of variables METHODS USED TO DEAL WITH NONDETECTS, MISSING OR LOST DATA, ETC
# 17 THANK YOU FOR YOUR ATTENTION Acknowledgement: Nancy Chaffin, Metadata Librarian, Morgan Library, Colorado State University QUESTIONS and/or COMMENTS ARE WELCOME