The Complicated Provenance of American Community Survey Data: How Far will PROV and DDI Take Us? William C. Block, 1 Warren Brown, 1 Jeremy Williams, 1.

Slides:



Advertisements
Similar presentations
UK DATA ARCHIVE Louise Corti, ODAF April UK Data Archive an internationally-renowned centre of expertise in data acquisition, preservation, dissemination.
Advertisements

DDI for the Uninitiated ACCOLEDS /DLI Training: December 2003 Ernie Boyko Statistics Canada Chuck Humphrey University of Alberta.
© S.J. Coles 2006 Institutional Data Repositories for Chemistry Simon Coles School of Chemistry, University of Southampton, U.K.
Metadata Management at GESIS-ZA Reiner Mauer GESIS – Data Archive and Data Analysis CESSDA-Expert Seminar Odense, September 11th 2008.
Research developments at the Census Bureau Roderick J. Little Associate Director for Research & Methodology and Chief Scientist Bureau of the Census.
Developments in Data Discovery at ICPSR George Alter Director, ICPSR University of Michigan.
Chuck Humphrey Data Library University of Alberta.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
Dr. Alexandra I. Cristea RDF.
UKOLN is supported by: A non-technical introduction to: OAI-ORE ( Defining Image Access project meeting.
COMP 6703 eScience Project Semantic Web for Museums Student : Lei Junran Client/Technical Supervisor : Tom Worthington Academic Supervisor : Peter Strazdins.
Building Historical Social Science Infrastructure: Data Integration Projects of the Minnesota Population Center Steven Ruggles Minnesota Population Center.
Building Data-rich Web Sites: The Integration Projects of the Minnesota Population Center William C. Block IASSIST 2006 Ann Arbor, Michigan, USA 24 May.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
© 2013 Association for Computing Machinery Honeywell Introduction to the ACM Digital Library January 16, 2013 Honeywell Introduction to the ACM Digital.
Q: What objects documented by DDI should be citable? All versionable objects, some may not be used Q: What elements are needed in DDI and CDISC to support.
2 nd Data without Boundaries Training Course Bucharest, February 2013.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
Data Documentation Initiative (DDI): Goals and Benefits Mary Vardigan Director, DDI Alliance.
Key integrating concepts Groups Formal Community Groups Ad-hoc special purpose/ interest groups Fine-grained access control and membership Linked All content.
Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau William C. Block, 1 Warren Brown, 1 Jeremy Williams, 1 Lars Vilhuber,
William Block, Co-PI Warren Brown & Stefan Kramer, Senior Scientists Florio Arguillas & Jeremy Williams, Project Staff Cornell Institute for Social and.
U.S. Decennial Census Finding and Accessing Data Summer Durrant October 20, 2014 Data & Geographical Information Librarian Research Data Services
Distributed Access to Data Resources: Metadata Experiences from the NESSTAR Project Simon Musgrave Data Archive, University of Essex.
1 Benjamin Perry, Venkata Kambhampaty, Kyle Brumsted, Lars Vilhuber, William Block Crowdsourcing DDI Development: New Features from the CED 2 AR Project.
3 rd Annual European DDI Users Group Meeting, 5-6 December 2011 The Ongoing Work for a Technical Vocabulary of DDI and SDMX Terms Marco Pellegrino Eurostat.
DDI-RDF Discovery Vocabulary A Metadata Vocabulary for Documenting Research and Survey Data Linked Data on the Web (LDOW 2013) Thomas Bosch.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Integrating ACS with the World’s Census Data: ACS Microdata and the IPUMS Presented at the Pre-ALAP ACS/IPUMS Workshop November 16, 2010 Trent Alexander.
Chuck Humphrey Data Library Co-ordinator University of Alberta May 16, Capitalising on Metadata Tool development plans IASSIST 2007.
Linked data the next network?. The Web of documents is for people The Web of data is for computers The Web of documents is difficult for computers to.
American Community Survey Overview September 4, 2013 Tim Gilbert American Community Survey Office.
Fedora Content Models for the National Science Digital Library Data Repository Fedora User’s Group Meeting Copenhagen, September 28, 2005 Carl Lagoze Cornell.
National Institute of Economic and Social Research WERS 2004 Second User Group Meeting WERS 2004 Information & Advice Service.
PLoS ONE Application Journal Publishing System (JPS) First application built on Topaz application framework Web 2.0 –Uses a template engine to display.
Study Discovery in Support of the Data Without Boundaries Initiative, the NIH Data Documentation Index and Infonomics Jay Greenfield Booz Allen Hamilton.
Some comments on using research data in the social sciences Paul Lambert, School of Applied Social Science, University of Stirling, 25 March 2013.
Discovery Metadata for Special Collections Concepts, Considerations, Choices William E. Moen School of Library and Information Sciences Texas Center for.
Data on the Foreign Born in 2010: Accessing Information on Immigrants and Immigration from the U.S. Census Bureau’s American Community Survey Thomas A.
Background Cornell Institute for Social and Economic Research (CISER): Data and Computing Support for Social and Economic Researchers at Cornell University.
PREMIS Implementation Fair, San Francisco, CA October 7, Stanford Digital Repository PREMIS & Geospatial Resources Nancy J. Hoebelheinrich Knowledge.
Brief: Data Science Progress/ Activities and Renewal Plans DCO Executive Committee. Oct. 8-9, Rome (IT) DCO-DS = DCO Data Science.
DDI AND EXPERIENCES AT ICPSR Prepared for Expert Seminar Finnish Social Science Data Archive Tampere, Finland September 1-2, 2000.
The NSF-Census Research Network (NCRN) Spring 2014 Meeting Introduction by John Thompson Director, Census Bureau.
The Integrated Public Use Microdata Series database IPUMSwww.ipums.org Lab 1 Background on the IPUMS and SPSS.
Barry Weiss 1/4/ Jet Propulsion Laboratory, California Institute of Technology Quality Elements in ISO Metadata Design for Proposed SMAP Data.
Statistical Metadata Extensions to the X3.285 Metamodel By Daniel W. Gillman Chairman, NCITS/L8 U.S. Bureau of the Census.
An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR) William C. Block and Jeremy Williams, 1 John Abowd.
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
Presented By Margaret Hellen Atiro Uganda Bureau of Statistics at the United Nations Regional Seminar on Census Data Archiving 20 – 23 Sep 2011, Addis.
PERSISTENT IDENTIFIERS FOR THE UK: SOCIAL AND ECONOMIC DATA …………………………………………………………………………………………………… LOUISE CORTI …………………….…………………………….… UK DATA ARCHIVE.
Metadata Driven Survey Research Jeremy Iverson. Open Standards.
1 Crowdsourcing Metadata – Challenges and Outlook Lars Vilhuber, William Block (Cornell University) Washington, 10 May 2016.
Census 2010: Accessing Census Data THURSDAY, July 21, :30am.
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Writing a HOWTO Guide for DDI An approach for getting started.
Developing our Metadata: Technical Considerations & Approach Ray Plante NIST 4/14/16 NMI Registry Workshop BIPM, Paris 1 …don’t worry ;-) or How we concentrate.
RDA Cataloging and DOI Assignments for NOAA Technical Publications NOAA Central Library October 2015.
Proposed Outline of Volume 2
Understanding Social and Economic Data
Incorporating W3C’s DQV and PROV in CISER’s Data Quality Review and
Introduction to Survey Documentation and Analysis (SDA)
Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau William C. Block,1 Warren Brown,1 Jeremy Williams,1 Lars Vilhuber,2.
Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
Introduction to IPUMS NYTS and IPUMS YRBSS
Introduction to IPUMS NYTS and IPUMS YRBSS
Survey Documentation and Analysis (SDA)
Capitalising on Metadata
Presentation transcript:

The Complicated Provenance of American Community Survey Data: How Far will PROV and DDI Take Us? William C. Block, 1 Warren Brown, 1 Jeremy Williams, 1 Lars Vilhuber, 2 and Carl Lagoze, 3 1 Cornell Institute Social and Economic Research (CISER), Cornell University 2 Labor Dynamics Institute (LDI), Cornell University 3 School of Information, University of Michigan Presentation at the 2 nd Annual North American DDI User Conference (NADDI14) Vancouver, British Columbia, Canada 2 April 2014

Outline Answering the Q: How far will PROV and DDI take us? We don’t know; complicated story! Background/Previous Work Use Case(s) involving ANCESTRY Variable in ACS Technical solutions at File (Dataset) and Variable Level Future Work Questions and Discussion 2

CED 2 AR is one part of this project Funded by NSF Grant # # For more information, see NSF-Census Research Network (NCRN) – Cornell Node (“Integrated Research Support, Training and Documentation”) 3

Part of NCRN Research Network

(CED 2 AR): Comprehensive Extensible Data Documentation and Access Repository Method for solving the data curation problem that confronts the custodians of restricted-access research data and the scientific users of such data Accommodates physical security and access limitation protocols, and allows for much improved provenance tracking Metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata (via a web-based user interface or programmatically through a search API) 5

NCRN DDI Solution at the Variable Level: 6 Proposed a Solution at EDDI12 in Bergen

Variable Level Solution (continued) 7

No DDI Solution at the level of a Value Label Small tweak to the DDI Codebook Schema would fix this. 8

9 developments since EDDI12 In Lagoze, Block et.al. (2013) we more completely described the solution for embedding field-specific and value-specific cloaking in DDI Metadata* Proposed formal change to DDI 2.5 (April 2013) Brought modified “DDI 2.5.NCRN” schema online for testing (Fall 2013) Look forward to DDI Technical Implementation Committee taking up our proposal * Lagoze, C., Block, W., Williams, J., Abowd, J. M., & Vilhuber, L. (2013). Data Management of Confidential Data. In International Data Curation Conference. Amsterdam.

10 Select Cornell NCRN Publications Forthcoming. “Lagoze, Carl, Lars Vihuber, Jeremy Williams, Benjamin Perry, and William C. Block, “CED 2 AR: The Comprehensive Extensible Data Documentation and Access Repository.” In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), London UK, September Lagoze, Carl, with William C. Block, Jeremy Williams, John M. Abowd, and Lars Vilhuber. “Data Management of Confidential Data”. In: International Journal of Digital Curation 8.1, pp DOI: /ijdc.v8il Abowd, John M., Lars Vilhuber, and William C. Block. “A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs.” In: Privacy in Statistical Databases. Ed. By Josep Domingo-Ferrer and Ilenia Tinnirello. Vol Lecture Notes in Computer Science. Springer, pp DOI: / _17

11 Provenance “data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources” [...] “from it, one can ascertain the quality of the data base and its ancestral data and derivations, track back sources of errors, allow automated reenactment of derivations to update the data, and provide attribution of data sources”* *Simmhan, Plale, and Gannon, “A survey of data provenance in e-science,” ACM Sigmod Record, 2005

Provenance and Metadata Not (currently) a “native” component of DDI, closest thing is: Downside: No structure. Mostly verbose entries. 12

2013 work with PROV Explored encoding PROV in RDF/XML* (Required use of CDATA tag to avoid interfering with schema compliance; deemed less promising) More recently: exploring W3C PROV Model as basis for encoding provenance metadata in DDI W3C PROV Model is based upon: entities that are physical, digital, and conceptual things in the world; activities that are dynamic aspects of the world that change and create entities; and agents that are responsible for activities. A set of relationships that can exist between them that express attribution, delegation, derivation, etc. *Lagoze, C., Williams, J., & Vilhuber, L. (2013). Encoding Provenance Metadata for Social Science Datasets. In 7th Metadata and Semantics Research Conference. Thessaloniki. 13

The American Community Survey (ACS) Ongoing statistical survey conducted by the U.S. Census Bureau Approximately 250,000 surveys/month (3 million per year) Replacement for detailed long- form decennial census 14

ACS Question on Ancestry or Ethnic Origin 15

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS 16

Multiple Sources of Data originating from the ACS: Examples of Aggregate Data 2012 ACS 1-year Estimate: 6,626 individuals of Alsatian Ancestry living in the United States 17

Multiple Sources of Data originating from the ACS: Example of PUMS Microdata ACS 2012 PUMS: ANCESTRY Code is 001 for Alsatian 18

Multiple Sources of Data originating from the ACS: Example of IPUMS-USA IPUMS-USA for ACS 2012: 001 Alsatian ANCESTRY Code 75 cases in the sample 19

Let’s review… 2012 ACS Code List ACS 2012 PUMS IPUMS- USA AFFNHGIS AlsatianYES (001) YES (001) YES (75 cases) 6,626 (est.) Andorran Cypriots 20

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS 21

2012 ACS Code List ACS 2012 PUMS IPUMS- USA AFFNHGIS AlsatianYES (001) YES (001) YES (75 cases) 6,626 (est.) AndorranYES (002) CypriotsYES (017) 22

23

2012 ACS Code List ACS 2012 PUMS IPUMS- USA AFFNHGIS AlsatianYES (001) YES (001) YES (75 cases) 6,626 (est.) AndorranYES (002) CypriotsYES (017) 6,486 (est.) 6,486 (est.) 24

25

2012 ACS Code List ACS 2012 PUMS IPUMS- USA AFFNHGIS AlsatianYES (001) YES (001) YES (75 cases) 6,626 (est.) AndorranYES (002) CypriotsYES (017) NO 6,486 (est.) 6,486 (est.) 26

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS 27

28

2012 ACS Code List ACS 2012 PUMS IPUMS- USA AFFNHGIS AlsatianYES (001) YES (001) YES (75 cases) 6,626 (est.) AndorranYES (002) NO CypriotsYES (017) NO 6,486 (est.) 6,486 (est.) 29

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS 30

Simple Provenance of ACS Data Files 31 ACS Questionaire Internal Census File(s) PUMS IPUMS Aggregate Tabulations AFF NHGIS RDC

2012 ACS Code List ACS 2012 PUMS IPUMS- USA AFFNHGISRDC AlsatianYES (001) YES (001) YES (75 cases) 6,626 (est.) Yes AndorranYES (002) NO ? CypriotsYES (017) NO 6,486 (est.) 6,486 (est.) ? 32

Provenance of ACS Data Files 33 ACS Questionaire Internal Census File(s) PUMS IPUMS Aggregate Tabulations AFF NHGIS RDC

34 Mentioned earlier our exploration of encoding PROV in RDF/XML* (CDATA; less promising) More recently: exploring W3C PROV Model as basis for encoding provenance metadata in DDI XML using “…information on the relationship of the current data collection to others (e.g., predecessors, successors, other waves or rounds or to other editions of the same file). This would include the names of additional data collections generated from the same data collection vehicle plus other collections directed at the same general topic, which can take the form of bibliographic citations.” Others working to develop an RDF encoding for DDI metadata that could easily accommodate RDF-encoding of provenance metadata Both solutions might be viable; implementation could depend on local preferences PROV at the Dataset/File Level

35 Step 1 - Material Reference Complex Type with Prov To include prov:Document within the element, a new complex type called ‘materialReferenceWithProvType’, which inherits from materialReferenceType can be introduced as follows: Embed PROV within This allows PROV document to be embedded or referenced by URI.

36 Step 2 - Modify the type of relStdy to the new complex type The relStdy element is changed to inherit from this ‘materialReferenceWithProvType’, and examples are given Embed PROV within (Cont)

Provenance of ACS Data Files 37 ACS Questionaire Internal Census File(s) PUMS IPUMS Aggregate Tabulations AFF NHGIS RDC

38 Variable Level Provenance A single attribute is added to the variable type

39 Variable Level Provenance (cont.) In our implementation, a variable would reference a prov:Bundle that would be found within embedded prov:Document. Here is an example of a prov:Bundle:

40

41

DDI Presentations at NCRN Spring Meeting 42 Data Documentation Initiative (DDI) Metadata within the Federal Statistical System: Implementation Challenges, Record Linkage, and Provenance Encoding Jay Greenfield and Sophia Kuan (Booz Allen Hamilton): Describing Adaptive/Responsive Protocols and Data Processing in DDI Tim Mulcahy (NORC): DDI and Record Linkage Jeremy Williams (Cornell University): Encoding Provenance Metadata for Social Science Datasets DDI Tools Demonstration Abdul K. Rahim and Pascal Heus (Metadata Technologies North America)

Thank you! Questions? ncrn.cornell.edu ncrn.cornell.edu 43