William Block, Co-PI Warren Brown & Stefan Kramer, Senior Scientists Florio Arguillas & Jeremy Williams, Project Staff Cornell Institute for Social and.

Slides:



Advertisements
Similar presentations
3rd International Digital Curation Conference Washington, DC, Dec 2007 Paper Presentations: Interoperability, Metadata & Standards Data Documentation Initiative:
Advertisements

DDI for the Uninitiated ACCOLEDS /DLI Training: December 2003 Ernie Boyko Statistics Canada Chuck Humphrey University of Alberta.
KEYS TO SUCCESS DATA PREPARATION AND ORGANIZATION
DIGIDOC A web based tool to Manage Documents. System Overview DigiDoc is a web-based customizable, integrated solution for Business Process Management.
OVERVIEW & LIBRARY SUPPORT FOR DATA MANAGEMENT/SHARING Jim Van Loon, MSME/MLIS Science Librarian.
Research developments at the Census Bureau Roderick J. Little Associate Director for Research & Methodology and Chief Scientist Bureau of the Census.
Developments in Data Discovery at ICPSR George Alter Director, ICPSR University of Michigan.
& Slovak Archive of Social Data “SASD” and DDI 3 IASSIST/IFDO Conference, Tampere, Juraj.
Plannes security for items, variables and applications NEPS User Rights Management.
IASSIST 2003 Changes in the Way Data Archives Process Data Data Processing at ICPSR Darrell Donakowski.
Semi-Permeable Boundaries Among Institutions: Non-Public Data and the Census RDC at Berkeley IASSIST 2009 – Tampere, Finland Jon StilesMay 27, 2009.
A database-driven tool to create items, variables and questionnaires NEPS Metadata Editor.
Präsentationstitel IAB-ITM Find the right tags in DDI IASSIST 2009, 27th-30th Mai 2009 IAB-ITM Finding the Right Tags in DDI 3.0: A Beginner's Experience.
Archiving our Social Science Digital History ECURE 2005 March 1, 2005.
© John M. Abowd 2005, all rights reserved Introduction John M. Abowd January 2005.
Improving access to digital resources: a mandate for order mandate: managing digital assets in tertiary education craig green,
Data-PASS Shared Catalog Micah Altman & Jonathan Crabtree 1 Micah Altman Harvard University Archival Director, Henry A. Murray Research Archive Associate.
Representing and utilizing DDI in relational databases A new DDI best practices working paper Ingo Barkow, Senior researcher, Leibniz Institute for Educational.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Building Quality Address Data: A Census Bureau Perspective Rocket City Geospatial Conference Huntsville, AL November 16, 2011.
Data Documentation Initiative (DDI): Goals and Benefits Mary Vardigan Director, DDI Alliance.
World Bank, Africa Region, Africa Household Survey Databank - The World Bank - Africa.
Data Exchange Tools (DExT) DExT PROJECTAN OPEN EXCHANGE FORMAT FOR DATA enables long-term preservation and re-use of metadata,
Ihr Logo Data Explorer - A data profiling tool. Your Logo Agenda  Introduction  Existing System  Limitations of Existing System  Proposed Solution.
Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau William C. Block, 1 Warren Brown, 1 Jeremy Williams, 1 Lars Vilhuber,
The Complicated Provenance of American Community Survey Data: How Far will PROV and DDI Take Us? William C. Block, 1 Warren Brown, 1 Jeremy Williams, 1.
Data Collection, Harmonisation and Storage (An international perspective) Jon Johnson (CLS, Senior Database Manager) Sub-brand to go here CLS is an ESRC.
DDI Profiles to Support Software Development as well as Data Exchange and Analysis NADDI , Vancouver (Canada) David Schiller, IAB Ingo Barkow,
1 Benjamin Perry, Venkata Kambhampaty, Kyle Brumsted, Lars Vilhuber, William Block Crowdsourcing DDI Development: New Features from the CED 2 AR Project.
Research Data Management Services Katherine McNeill Social Sciences Librarians Boot Camp June 1, 2012.
Human Resource Management Lecture 27 MGT 350. Last Lecture What is change. why do we require change. You have to be comfortable with the change before.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Chuck Humphrey Data Library Co-ordinator University of Alberta May 16, Capitalising on Metadata Tool development plans IASSIST 2007.
Metadata Architecture at StatCan MSIS 2008 Luxembourg, April 7-9, 2008 Karen Doherty Director General Informatics Branch Statistics Canada.
Copyright 2010, The World Bank Group. All Rights Reserved. ICT - a core management issue Part 1 Managing ICT resources Produced in Collaboration between.
MCRDC Michigan Census Research Data Center The MCRDC is a joint project of the U.S. Bureau of the Census and the University of Michigan to enable qualified.
Background Cornell Institute for Social and Economic Research (CISER): Data and Computing Support for Social and Economic Researchers at Cornell University.
InSPIRe Australian initiatives for standardising statistical processes and metadata Simon Wall Australian Bureau of Statistics December
Secure Epidemiology Research Platform (SERPent) Kick Start Meeting - April 15 th, 2010 Pascal Heus
The NSF-Census Research Network (NCRN) Spring 2014 Meeting Introduction by John Thompson Director, Census Bureau.
ABS Statistical Databases Session 6 Mark Viney Australian Bureau of Statistics 6 June 2007.
User-Driven Integrated Statistical Solutions: Government for the People by the People Open Forum on Metadata Registries Santa Fe, New Mexico January 20,
NATIONAL TREASURES DATA PRESERVATION WITH METADATA Sharon Shin Metadata Coordinator Federal Geographic Data Committee Secretariat ASPRS-Reno 2006.
1 The NORC Data Enclave for Sensitive Microdata Timothy M. Mulcahy Senior Research Scientist, NORC/University of Chicago,
An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR) William C. Block and Jeremy Williams, 1 John Abowd.
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
An Enterprise Clinical Data Search Solution. is Designed for: Informatics professionals, clinicians, statisticians, data managers and process/quality.
INFO 7470/ECON 7400/ILRLE 7400 Understanding Social and Economic Data John M. Abowd and Lars Vilhuber January 21, 2013.
PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA …………………………………………………………………………………………………… LOUISE CORTI …………………….…………………………….… UK DATA ARCHIVE.
Demographic Full Count Review Presentation to the FSCPE March 26, 2001 Washington D.C.
Ingest – Acquisition and deposit Irena Vipavc Brvar ADP SEEDS Workshop I Belgrade, October.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Systems Analysis and Design in a Changing World, Fifth Edition
The NSF-Census Research Network in 2016
Incorporating W3C’s DQV and PROV in CISER’s Data Quality Review and
DataNet Collaboration
An Overview of Data-PASS Shared Catalog
IASSIST , Toronto (Canada)
Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
Questasy: Documenting and Disseminating Longitudinal Data Online with DDI 3 Edwin de Vet 11/14/2018.
Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau William C. Block,1 Warren Brown,1 Jeremy Williams,1 Lars Vilhuber,2.

Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
Colectica 5 A New Generation of Open Metadata Tools
An update on Rogatus Supporting the survey workflow with open standards and tools Ingo Barkow (DIPF) – Senior Researcher, Data Manager David Schiller (IAB)
Question Banks, Reusability, and DDI 3.2 (Use Parameters)
Statistical System in India
The role of metadata in census data dissemination
EDDI Copenhagen (Denmark)
Presentation transcript:

William Block, Co-PI Warren Brown & Stefan Kramer, Senior Scientists Florio Arguillas & Jeremy Williams, Project Staff Cornell Institute for Social and Economic Research (CISER) The Cornell NSF-Census Research Node: Integrated Research Support, Training and Data 3rd Annual European DDI Users Group Meeting (EDDI11), Dec. 6, 2011

Overview Background on NCRN and the RDC Environment The Problem Proposed Solution Early Work

NSF : NSF-Census Research Network (NCRN) Program Solicitation “The NSF-Census Research Network will provide support for a set of research nodes, each of which will be staffed by a team of scientists conducting interdisciplinary research and educational activities on methodological questions of interest and significance to the broader research community and to the Federal Statistical System, particularly the U.S. Census Bureau. The activities will be expected to advance both fundamental and applied knowledge as well as further the training of current and future generations of researchers in research skills of relevance to the measurement of economic units, households, and persons.” Total funding: $18,500,000 Expected Number of Awards:

NCRN Program Goals 1.Establish a set of complementary research programs that advance the development of innovative methods and models for the collection, analysis, and dissemination of data in the social, behavioral, and economic sciences. 2.Relate fundamental advances in methods development to the problems of the Federal Statistical System, particularly the U.S. Census Bureau. 3.Facilitate the collaborative activities of scientists from across multiple disciplines, including the social, behavioral, and economic sciences, the statistical sciences, and the computer sciences. 4.Foster the development of the next generation of researchers in research skills of relevance to the measurement of economic units, households, and persons.

1 The RDC Network

Research Opportunities at the Cornell Census Research Data Center (RDC) CISER, 391 Pine Tree Road

Data Available in the RDC Economic Data Interview business establishments and firms No public use versions Demographic Data Interview individuals and households Complete geography, no income topcoding Matched Employer-Employee Data Health Data: Partnerships with NCHS AHRQ

Current situation within Census RDCs Tension between confidentiality and user-friendliness Lack of consistent documentation at variable level Barriers to data discovery and use Long term curation precluded Scientific replication impossible Two examples: Census of Manufactures and American Community Survey

Present data documentation situation for RDC users What can be learned online about the most widely used RDC dataset, the Census of Manufactures, outside of an RDC: No information provided here about which variables are contained in this dataset. Source:

Present data documentation situation for RDC users cont. What can be learned online about the most widely used RDC dataset, the Census of Manufactures, inside an RDC. One can browse folders, “read me” files, or Leave RDC, use a computer with public Internet access!

Example from American Community Survey (ACS) Internal project experience Data and basic SAS program file for ACS “zero obs” files No internal documentation IPUMS documentation (public)

The Cornell NSF-Census Research Node: Integrated Research Support, Training, and Data The Comprehensive Census Bureau Metadata Repository (CCBMR; co-PI Block) DDI-based metadata schema synchronization between public and confidential instances of the repository Disclosure Avoidance Review-compliant and friendly Similar web interfaces for public (partial) and internal (full-information) documentation Create and share a CCBMR toolkit Other elements of the Cornell NCRN Project Integrated Doctoral Instruction on the Information Science of Restricted- access Data Analysis (PI: John Abowd; co-PI: Lars Vilhuber) Adding Computational Statistics to the Administrative Record Toolkit (co- PI Ping Li)

D ifferent metadata inside vs. outside RDCs Not all of the metadata about studies in RDCs that should be viewable inside an RDC can be made available on the public web –Example: certain value ranges, or the very existence, of a variable cannot be made known outside of an RDC Two different versions of a metadata: bad idea Better: complete internal version; derive subset for public

How DDI could help with this challenge Develop a metadata schema based on DDI, with modifications and additional fields/elements and/or attributes as warranted to capture and represent the necessary information about RDC datasets Information about specific confidentiality considerations and access conditions can already be expressed in the DDI 3.1 specification with elements such as and DDI could be extended through the addition of machine-actionable markup describing variables to control what information about them is revealed – example follows (assuming an XML-based implementation, which is not the only option)

How DDI could help with this challenge (cont.) RDC Metadata (complete) …: Derived Public Use Metadata (limited) …: 0 not disclosable The above example denotes that one extreme value associated with this variable is not releasable, regardless of the public-use nature of other information on the variable. In this example, the maximum value would not be included in the publicly-released documentation.

Metadata platforms being considered for CCBMR Related to this topic: Representing and Utilizing DDI in Relational Databases ( XML database: BaseX or eXist or Relational database (RDB): SQL server (MS Windows only) or PostgreSQL or MySQL or Oracle (or possibly others) or Hybrid approach of using RDB and storing XML (such as variable-level DDI generated by Stat/Transfer) as objects inside it or Proprietary product: Colectia Repository with SDK (with which we could create a customized web front- end)

Metadata specifications being considered for CCBMR Approach: “as much as necessary, as little as possible” metadata for the project; not limited only to DDI DDI SDMX DataCite

From draft functional requirements for CCBMR of Nov. 14, 2011 …. 3.Searching and navigating 3.1.Nested Boolean searching (i.e.: (A OR B) AND (X OR Y OR Z)) 3.2.Explicit truncation 3.3.Case-sensitivity option 3.4.Date-range searching 3.5.Choice which field(s) to search, e.g., title, description/abstract, variable-level elements 3.6.Browsing by … to-be-determined fields/elements 3.7.Searching by values from summary statistics in metadata? 4.Metadata export and transformation 4.1.Between RDC-internal and external version, generate elements/fields and their contents based on the (to be) developed suppression method (page 6 of proposal – “Disclosability” example) 4.2.In a format that will allow fairly easy creation of syntax/command files for statistical packages – DDI? 5.Metadata import and transformation 5.1.Ingest DDI 3.x metadata generated by tools such as Stat/Transfer from datasets housed in RDC that are selected for project 5.2.Transform ingested metadata into schema (to be) developed for project, e.g., rename and/or drop fields/elements 5.3.Import metadata from/for PUFs, e.g. from IPUMS – method TBD 6.Metadata editing 6.1.Web-based front end for adding and editing metadata elements 7.Persistent identifiers 7.1.Use DOIs? Needs further consideration.

CCBMR: planned features Nested Boolean searching, for instance: (Literacy OR Dropout rates) AND (Employee turnover OR Employee retention) AND (Number of employees) Harnessing collective knowledge: incorporating social media (“data wikis”)

Initial CCBMR Datasets Longitudinal Business Database (LBD) American Community Survey (ACS) American Housing Survey (AHS) Longitudinal Employer-Household Dynamics (LEHD)

First draft of an enterprise application diagram

Project Collaborators Pascal Heus – Metadata Technology Jeremy Iverson – Colectica Ingo Barkow – German Institute for Educational Research (DIPF) David Schiller - Research Data Center (FDZ) of the German Federal Employment Agency (BA) at the Institute for Employment Research (IAB) Chuck Humphrey – University of Alberta; Canadian RDC Network

Thank you! Questions? Comments? William Block