George Alter, ICPSR (PI)

Slides:



Advertisements
Similar presentations
Data Documentation Initiative (DDI) Workshop Carol Perry Ernie Boyko April 2005 Kingston Ontario.
Advertisements

DOCUMENT TYPES. Digital Documents Converting documents to an electronic format will preserve those documents, but how would such a process be organized?
Metadata at ICPSR Sanda Ionescu, ICPSR.
Developments in Data Discovery at ICPSR George Alter Director, ICPSR University of Michigan.
Demonstration of a Blaise Instrument Documentation System “BlaiseDoc” Gina-Qian Cheung May 25, 2005 Institution for Social Research University of Michigan.
Information Retrieval in Practice
Input Validation For Free Text Fields ADD Project Members: Hagar Offer & Ran Mor Academic Advisor: Dr Gera Weiss Technical Advisors: Raffi Lipkin & Nadav.
1 COS 425: Database and Information Management Systems XML and information exchange.
Introducing Symposia : “ The digital repository that thinks like a librarian”
U of R eXtensible Catalog Team MetaCat. Problem Domain.
Reducing Metadata Objects Dan Gillman November 14, 2014.
 Name and organization  Have you worked with DDI before? (2 or 3)  If not, are you familiar with XML?  What kind of CAI systems do you use?  Goals.
Data Management: Documentation & Metadata Types of Documentation.
Overview of Search Engines
© 2014 by the Regents of the University of Michigan Metadata from Blaise and DDI 3.0/3.2 Gina Cheung Beth-Ellen Pennell North American DDI Conference April.
CASE Tools And Their Effect On Software Quality Peter Geddis – pxg07u.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
EARTH SCIENCE MARKUP LANGUAGE “Define Once Use Anywhere” INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Survey Data Management and Combined use of DDI and SDMX DDI and SDMX use case Labor Force Statistics.
IMS Proof of Concept for Data Capture using Metadata Bryan Fitzpatrick Rapanea Consulting Limited June 2014.
Chuck Humphrey Data Library Co-ordinator University of Alberta May 16, Capitalising on Metadata Tool development plans IASSIST 2007.
Metadata Portal Project: Using DDI to Enhance Data Access and Dissemination Mary Vardigan Assistant Director, ICPSR Director, DDI Alliance.
Data Management: Documentation & Metadata Sherry Lake, Senior Data Consultant Bill Corey, Data Consultant Jeremy Bartczak, Intellectual Access & Metadata.
Exploring an Open Source Automation Framework Implementation.
EARTH SCIENCE MARKUP LANGUAGE Why do you need it? How can it help you? INFORMATION TECHNOLOGY AND SYSTEMS CENTER UNIVERSITY OF ALABAMA IN HUNTSVILLE.
Smart Qualitative Data: Methods and Community Tools for Data Mark-Up SQUAD Libby Bishop Language and Computation Day University of Essex 4 October 2005.
Colectica: A Platform for DDI 3 based Metadata Management Design. Collect. Share.
Semantic Phyloinformatic Web Services Using the EvoInfo Stack Speaker: John Harney LSDIS Lab, Dept. of Computer Science, University of Georgia Mentor(s):
JISC/NSF PI Meeting, June Archon - A Digital Library that Federates Physics Collections with Varying Degrees of Metadata Richness Department of Computer.
August 2003 At A Glance The IRC is a platform independent, extensible, and adaptive framework that provides robust, interactive, and distributed control.
The Data Documentation Initiative (DDI) Fostering Community Engagement and Adoption Breakout 9 RDA Sixth Plenary, Paris Mary Vardigan, ICPSR, University.
John Porter Sheng Shan Lu M. Gastil Gastil-Buhl With special thanks to Chau-Chin Lin and Chi-Wen Hsaio.
February 8, 2006copyright Thomas Pole , all rights reserved 1 Lecture 3: Reusable Software Packaging: Source Code and Text Chapter 2: Dealing.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
A SCRIPT FOR ARCHIVING DIGITAL RESEARCH DATA IMPROVING ACCURACY AND EFFICIENCY IN THE DATAVERSE NETWORK ABSTRACT SUMMARY Rachel Carriere, Thu-Mai Christian,
University of Colorado at Denver and Health Sciences Center Department of Preventive Medicine and Biometrics Contact:
Metadata standards Using DDI to Inform, Organize, and Drive Survey Data Production.
Information Retrieval in Practice
Search Engine Architecture
CSCI-235 Micro-Computer Applications
An Overview of Data-PASS Shared Catalog
System Programming and administration
XML QUESTIONS AND ANSWERS
Software Maintenance.
Accelerate define.xml using defineReady - Saravanan June 17, 2015.
Michigan Questionnaire Documentation System (MQDS)
Project that MIDUS is working on with Colectica using DDI 3
What’s New in Colectica 5.3 Part 1
Requirements – Scenarios and Use Cases
DataForge: A DDI-Enabled Toolkit for Researchers and Data Managers
Data Management: Documentation & Metadata
DDI for the Uninitiated
What’s New in Colectica 5.3 Part 2
Enhancing ICPSR metadata with DDI-Lifecycle
2. An overview of SDMX (What is SDMX? Part I)
Lesson 1 The Web.
Updates on the XSLT stylesheets for DDI
Secondary Data Analysis Lec 10
Question Banks, Reusability, and DDI 3.2 (Use Parameters)
Data Preparation (Click icon for audio) Dr. Michael R. Hyman, NMSU.
Recitation on AdFisher
A General Approach to Real-time Workflow Monitoring
Implementing DDI in a Survey Organisation
Capitalising on Metadata
Biology Writing a Lab Report
Introducing the Data Documentation Initiative
Software Re-engineering and Reverse Engineering
Web Application Development Using PHP
Language Technology and Data Analysis Laboratory (LADAL)
Presentation transcript:

George Alter, ICPSR (PI) NSF Data Infrastructure Building Blocks (DIBBs) (ACI-1640575) Pascal Heus, Metadata Technology North America Jeremy Iverson, Colectica Jared Lyle, ICPSR Ørnulf Risnes, Norwegian Centre for Research Data Dan Smith, Colectica Title: C2Metadata project. Alternative title: Automated Capture and Description of Data Transformations Abstract: Accurate and complete metadata is essential for data sharing and for interoperability across different data types. However, the process of describing and documenting scientific data has remained a tedious, manual process even when data collection is fully automated. Researchers are often reluctant to share data even with close colleagues, because creating documentation takes so much time. This presentation will describe a project to greatly reduce the cost and increase the completeness of metadata by creating tools to capture data transformations from general purpose statistical analysis packages. Researchers in many fields use the main statistics packages (SPSS®, SAS®, Stata®, R) for data management as well as analysis, but these packages lack tools for documenting variable transformations in the manner of a workflow system or even a database. At best the operations performed by the statistical package are described in a script, which more often than not is unavailable to future data users. Our project is developing new tools that will work with common statistical packages to automate the capture of metadata at the granularity of individual data transformations. Software-independent data transformation descriptions will be added to metadata in two internationally accepted standards, the Data Documentation Initiative (DDI) and Ecological Markup Language (EML). These tools will create efficiencies and reduce the costs of data collection, preparation, and re-use. Our project targets research communities with strong metadata standards and heavy reliance on statistical analysis software (social and behavioral sciences and earth observation sciences), but it is generalizable to other domains, such as biomedical research. EDDI URL: http://www.eddi-conferences.eu/ocs/index.php/eddi/eddi17/paper/view/316

http://c2metadata.org/ Partners: Our web site: http://c2metadata.org/ Partners include the American National Election Studies, Colectica, ICPSR, MTNA, NORC, the Norwegian Centre for Research Data, and the University of Michigan. The C2Metadata project is supported by the Data Infrastructure Building Blocks (DIBBs) program of the United States National Science Foundation through grant NSF ACI-1640575.

The Problem

Most data are born digital (the program is the metadata) Most data are born digital (the program is the metadata). Metadata can often be extracted directly from the data creation process. (We already have tools to convert CAI to machine-readable metadata. These include: MQDS, Colectica.) But when a project modifies the data, they no longer match the metadata. This means the transformations are then recorded manually and incompletely -- if at all.

What’s missing? Statistics packages have limited metadata No question text No interview flow (question order, skip pattern) No variable provenance Data transformations are not documented.

The Solution

We are automating the capture of transformation metadata We are automating the capture of transformation metadata. We will build missing links with the script parser and XML updater.

Why Metadata? Data are useless without metadata Metadata should: Include all information about data creation Describe transformations to variables Be easy to create Our goal: Automated capture of metadata

Benefits of automated metadata capture Metadata will be better All the original information can be included. Variable transformations can be described Automation will lower costs Metadata will not be discarded and re-created All metadata will be standardized and machine readable Codebooks with rich information can be rendered at will If we make it easy and beneficial, researchers will use it.

Some Details

Project High Level Tasks All Collect scripts. Agree on VTL representations. Standards for incorporating VTL into DDI and EML ICPSR Test data and scripts Testing created DDI metadata Testing created EML metadata Active DDI codebook with VTL NSD Stata Script Parser SAS Script Parser Colectica SPSS Script Parser R data transformation package MTNA VTL to DDI Metadata Updater VTL to EML Metadata Updater DDI comparison and validation tool

Target Data Transformations for Year 1 Unconditional assignment  (SPSS: COMPUTE; Stata: gener) Arithmetic operations List of mathematical functions (exp, ln, log, …) Conditional assignment (SPSS: IF; Stata: replace ...if) Logical expressions Recode (SPSS: RECODE; Stata: recode) SPSS Stata SAS R VTL COMPUTE generate replace [assignment] [assignment] := IF replace … if IF...THEN/ELSE if … then … else RECODE recode IF...THEN … ELSE … IF … THEN if … then … elseif … then

One command in detail: RECODE

What is recode (in Stata)? recode -- Recode categorical variables recode var1 var2 (1 2 = 2) (3/max = 5) recode a b (1 = 2), prefix("modified_") recode x y (1 = 2), generate(modified_x modified_y) recode x y (1 = 2 "Labelfor2") (3 = 5 "Labelfor5") recode total (0/140=0 F) (141/180=1 D) (181/210=2 C) (211/234=3 B) (235/300=4 A), gen(grade) recode a (1 . .a 5/6 = 7) (nonmissing = 8) (missing = 9) (* = 2)

What is recode (in SPSS)? RECODE var1 var2 var3 (-1=7) (-2=8). RECODE AGE (MISSING=9) (18 THRU HI=1) (0 THRU 18=0) INTO VOTER.

Recode in JSON

SDTL Development Still in active development http://c2metadata.gitlab.io/sdtl-docs/ Using COGS, an open source production framework to generate JSON schema, XML schema, and rich documentation

Apps to Detect Transforms

Stata Parser Tool: Online Functional recode parser available at http://ekstern.nsd.no/metacap/stata2vtl

SPSS: Command line

SDTL Reader Desktop application for Windows, macOS, and Linux Open SPSS syntax files (*.sps) Open SDTL JSON files

DDI Metadata Updater This illustrates the process of: (1) using a script parser to generate C2-VTL (2) Read the C2-VTL to create a DDI file matching the updates data file (using input DDI from source file) (3) Use the comparison tool to analyze / document the differences (4) For validation purposes, use the tool to compare the generate DDI with actual DDI produce from the new data file (e.g. with our SledgeHammer utility)

Future

Continuing Work Describe more types of transformations Detect those transformations in SPSS, Stata, SAS R library for transforming data

Follow Along Project website at c2metadata.org Source code managed at gitlab.com/c2metadata Public Proof-of-Concept app in March

Thanks! http://c2metadata.org/

Following are slides not used in the main presentation...

Metadata for the American National Election Study: What question was asked? How was the question coded? Who answered this question?

We already have tools to convert CAI to machine-readable metadata. Original data Computer Assisted Interviewing CAI We already have tools to convert CAI to machine-readable metadata. CAI to DDI Convert to DDI: CollecticaMQDS others Original metadata DDI XML

What happens when a project modifies the data. Original data What happens when a project modifies the data. Statistical Packages SPSS SAS Stata R Command scripts: Revised data Computer Assisted Interviewing SPSSSAS Stata R CAI The modified data no longer match the metadata. CAI to DDI Convert to DDI: CollecticaMQDS others Original metadata DDI XML

Metadata are re-created after the data are transformed. Original data Statistical Packages SPSS SAS Stata R Command scripts: Revised data Computer Assisted Interviewing SPSSSAS Stata R CAI Transformations are documented by hand CAI to DDI Convert to DDI: CollecticaMQDS others Original metadata DDI XML

Automating the capture of transformation metadata. Original data Statistical Packages SPSS SAS Stata R Revised data Computer Assisted Interviewing Command scripts: SPSSSAS Stata R CAI Standard Data Transformation Language Revised metadata Script Parser SDTL XML Updater DDI XML CAI to DDI Convert to DDI: CollecticaMQDS others Original metadata Missing links that we will build. DDI XML