Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau William C. Block,1 Warren Brown,1 Jeremy Williams,1 Lars Vilhuber,2.

Slides:



Advertisements
Similar presentations
Open repositories: value added services The Socionet example Sergey Parinov, CEMI RAS and euroCRIS.
Advertisements

DDI for the Uninitiated ACCOLEDS /DLI Training: December 2003 Ernie Boyko Statistics Canada Chuck Humphrey University of Alberta.
The PREMIS Data Dictionary Michael Day Digital Curation Centre UKOLN, University of Bath JORUM, JISC and DCC.
Provenance in Open Distributed Information Systems Syed Imran Jami PhD Candidate FAST-NU.
Object Re-Use and Exchange Mellon Retreat, Nassau Inn, Princeton, NJ, March Herbert Van de Sompel, Carl Lagoze The OAI Object Re-Use & Exchange.
Fedora Commons: Introduction and Update Swedish National Library June 24, 2008.
The MetaDater Model and the formation of a GRID for the support of social research John Kallas Greek Social Data Bank National Center for Social Research.
The NSDL Registry Diane Hillmann  Jon Phipps. What We’re Doing Received an NSF grant in Oct. 2006, to: Register metadata schemas, vocabularies, application.
The Open Archives Initiative Simeon Warner (Cornell University) Symposium on “Scholarly Publishing and Archiving on the Web”, University.
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
FGDC, Meet the DDI Adding Geospatial Metadata to a Numeric Data Catalog Julie Linden Yale University.
Policy-Carrying, Policy-Enforcing Digital Objects Sandra Payette Project Prism - Cornell University DLI2 All-Projects Meeting June 14, 2000.
IPUMS to IHSN: Leveraging structured metadata for discovering multi-national census and survey data Wendy L. Thomas 4 th Conference of the European Survey.
January, 23, 2006 Ilkay Altintas
Key integrating concepts Groups Formal Community Groups Ad-hoc special purpose/ interest groups Fine-grained access control and membership Linked All content.
Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau William C. Block, 1 Warren Brown, 1 Jeremy Williams, 1 Lars Vilhuber,
William Block, Co-PI Warren Brown & Stefan Kramer, Senior Scientists Florio Arguillas & Jeremy Williams, Project Staff Cornell Institute for Social and.
The Complicated Provenance of American Community Survey Data: How Far will PROV and DDI Take Us? William C. Block, 1 Warren Brown, 1 Jeremy Williams, 1.
1 Benjamin Perry, Venkata Kambhampaty, Kyle Brumsted, Lars Vilhuber, William Block Crowdsourcing DDI Development: New Features from the CED 2 AR Project.
DDI-RDF Discovery Vocabulary A Metadata Vocabulary for Documenting Research and Survey Data Linked Data on the Web (LDOW 2013) Thomas Bosch.
Using IPUMS.org Katie Genadek Minnesota Population Center University of Minnesota The IPUMS projects are funded by the National Science.
Chuck Humphrey Data Library Co-ordinator University of Alberta May 16, Capitalising on Metadata Tool development plans IASSIST 2007.
Using the Open Metadata Registry (openMDR) to create Data Sharing Interfaces October 14 th, 2010 David Ervin & Rakesh Dhaval, Center for IT Innovations.
Indo-US Workshop, June23-25, 2003 Building Digital Libraries for Communities using Kepler Framework M. Zubair Old Dominion University.
Fedora Content Models for the National Science Digital Library Data Repository Fedora User’s Group Meeting Copenhagen, September 28, 2005 Carl Lagoze Cornell.
Towards Low Overhead Provenance Tracking in Near Real-Time Stream Filtering Nithya N. Vijayakumar, Beth Plale DDE Lab, Indiana University {nvijayak,
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
1 Schema Registries Steven Hughes, Lou Reich, Dan Crichton NASA 21 October 2015.
Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.
11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)
Deepcarbon.net Xiaogang Ma, Patrick West, John Erickson, Stephan Zednik, Yu Chen, Han Wang, Hao Zhong, Peter Fox Tetherless World Constellation Rensselaer.
1 Registry Services Overview J. Steven Hughes (Deputy Chair) Principal Computer Scientist NASA/JPL 17 December 2015.
Metadata and Meta tag. What is metadata? What does metadata do? Metadata schemes What is meta tag? Meta tag example Table of Content.
An Early Prototype of the Comprehensive Extensible Data Documentation and Access Repository (CED 2 AR) William C. Block and Jeremy Williams, 1 John Abowd.
Improving User Access to Metadata for Public and Restricted Use US Federal Statistical Files William C. Block Jeremy Williams Lars Vilhuber Carl Lagoze.
A Semi-Automated Digital Preservation System based on Semantic Web Services Jane Hunter Sharmin Choudhury DSTC PTY LTD, Brisbane, Australia Slides by Ananta.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Research Objects Preserving scientific data and methods Stian Soiland-Reyes, Khalid Belhajjame School of Computer Science, Univ of Manchester myGrid NIHBI.
Understanding Social and Economic Data
Incorporating W3C’s DQV and PROV in CISER’s Data Quality Review and
RDA US Science workshop Arlington VA, Aug 2014 Cees de Laat with many slides from Ed Seidel/Rob Pennington.
Introduction to Survey Documentation and Analysis (SDA)
An Overview of Data-PASS Shared Catalog
The International Plant Protection Convention
Flexible Extensible Digital Object Repository Architecture
Flexible Extensible Digital Object Repository Architecture
The evolution of the SDMX infrastructure and services
Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
Managing ETDs with Associated Complex Digital Objects
Wsdl.
Analyzing and Securing Social Networks
IPUMS CPS Summer Data Workshop June 4, 2018 Kari Williams
DDI for the Uninitiated
Connecting Researchers with Data: Discovery, Documentation, Access and Security Cornell Institute for Social and Economic Research (CISER); German Institute.
Enabling direct data access to social science research data
An ecosystem of contributions
PREMIS Tools and Services
NSDL Data Repository (NDR)
Capturing and Organizing Scientific Annotations
Metadata in Digital Preservation: Setting the Scene
Open Archive Initiative
Research Infrastructures: Ensuring trust and quality of data
Survey Documentation and Analysis (SDA)
Bird of Feather Session
Data Provenance.
Developing Institutional Data Repositories
Capitalising on Metadata
The role of metadata in census data dissemination
Draft revision of ISPM 6: National surveillance systems ( )
Australian and New Zealand Metadata Working Group
Presentation transcript:

Integrating PROV with DDI: Mechanisms of Data Discovery within the U.S. Census Bureau William C. Block,1 Warren Brown,1 Jeremy Williams,1 Lars Vilhuber,2 and Carl Lagoze3 1 Cornell Institute Social and Economic Research (CISER), Cornell University 2 Labor Dynamics Institute (LDI), Cornell University 3School of Information, University of Michigan Presentation at the IASSIST 2014 Meeting Toronto, CA

Outline Background and Previous Work Use Case involving ANCESTRY Variable in ACS Technical solutions at Dateset and Variable Level Future Work Questions

NSF-Census Research Network (NCRN) – Cornell Node (“Integrated Research Support, Training and Documentation”) CED2AR is one part of this project Funded by NSF Grant #1131848. For more information, see www.ncrn.cornell.edu.

(CED2AR): Comprehensive Extensible Data Documentation and Access Repository Method for solving the data curation problem that confronts the custodians of restricted-access research data and the scientific users of such data Accommodates physical security and access limitation protocols, and allows for much improved provenance tracking Metadata repository system that allows researchers to search, browse, access, and cite confidential data and metadata (via a web-based user interface or programmatically through a search API)

Select Cornell NCRN Publications Forthcoming. “Lagoze, Carl, Lars Vihuber, Jeremy Williams, Benjamin Perry, and William C. Block, “CED2AR: The Comprehensive Extensible Data Documentation and Access Repository.” In: Proceedings of the ACM/IEEE Joint Conference on Digital Libraries (JCDL), London UK, September 2014. 2013 Lagoze, Carl, with William C. Block, Jeremy Williams, John M. Abowd, and Lars Vilhuber. “Data Management of Confidential Data”. In: International Journal of Digital Curation 8.1, pp.265-278. DOI: 10.2218/ijdc.v8il.259 2012 Abowd, John M., Lars Vilhuber, and William C. Block. “A Proposed Solution to the Archiving and Curation of Confidential Scientific Inputs.” In: Privacy in Statistical Databases.  Ed. By Josep Domingo-Ferrer and Ilenia Tinnirello. Vol. 7556. Lecture Notes in Computer Science. Springer, pp.216-225. DOI: 10.1007/978-3-642-33627-0_17  

Provenance “data provenance, one kind of metadata, pertains to the derivation history of a data product starting from its original sources” [...] “from it, one can ascertain the quality of the data base and its ancestral data and derivations, track back sources of errors, allow automated reenactment of derivations to update the data, and provide attribution of data sources”* *Simmhan, Plale, and Gannon, “A survey of data provenance in e-science,” ACM Sigmod Record, 2005

The American Community Survey (ACS) Ongoing statistical survey conducted by the U.S. Census Bureau Approximately 250,000 surveys/month (3 million per year) Replacement for detailed long-form decennial census

ACS Question on Ancestry or Ethnic Origin

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS

Multiple Sources of Data originating from the ACS: Examples of Aggregate Data 2012 ACS 1-year Estimate: 6,626 individuals of Alsatian Ancestry living in the United States

Multiple Sources of Data originating from the ACS: Example of PUMS Microdata ACS 2012 PUMS: ANCESTRY Code is 001 for Alsatian

Multiple Sources of Data originating from the ACS: Example of IPUMS-USA IPUMS-USA for ACS 2012: 001 Alsatian ANCESTRY Code 75 cases in the sample

Let’s review… 2012 ACS Code List ACS 2012 PUMS IPUMS-USA AFF NHGIS Alsatian YES (001) YES (75 cases) 6,626 (est.) Andorran Cypriots

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS

2012 ACS Code List ACS 2012 PUMS IPUMS-USA AFF NHGIS Alsatian YES (001) YES (75 cases) 6,626 (est.) Andorran (002) Cypriots (017)

2012 ACS Code List ACS 2012 PUMS IPUMS-USA AFF NHGIS Alsatian YES (001) YES (75 cases) 6,626 (est.) Andorran (002) Cypriots (017) 6,486 (est.)

2012 ACS Code List ACS 2012 PUMS IPUMS-USA AFF NHGIS Alsatian YES (001) YES (75 cases) 6,626 (est.) Andorran (002) Cypriots (017) NO 6,486 (est.)

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS

2012 ACS Code List ACS 2012 PUMS IPUMS-USA AFF NHGIS Alsatian YES (001) YES (75 cases) 6,626 (est.) Andorran (002) NO Cypriots (017) 6,486 (est.)

Three Use Cases: Researchers interested in people of Alsatian, Andorran, and Cypriot Ancestry U.S. Census Bureau Documentation Ancestry Code List 2012 ACS

Simple Provenance of ACS Data Files ACS Questionaire Internal Census File(s) PUMS IPUMS Aggregate Tabulations AFF NHGIS RDC

2012 ACS Code List ACS 2012 PUMS IPUMS-USA AFF NHGIS RDC Alsatian YES (001) YES (75 cases) 6,626 (est.) Yes Andorran (002) NO ? Cypriots (017) 6,486 (est.)

Provenance of ACS Data Files ACS Questionaire Internal Census File(s) PUMS IPUMS Aggregate Tabulations AFF NHGIS RDC

Variable Level Provenance (cont.)

Variable Level Provenance (cont.)

Provenance and Metadata Not (currently) a “native” component of DDI, closest thing in Codebook is: <xs:complexType name="othrStdyMatType"> <xs:complexContent> <xs:extension base="baseElementType"> <xs:sequence> <xs:element ref="relMat" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="relStdy" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="relPubl" minOccurs="0" maxOccurs="unbounded"/> <xs:element ref="othRefs" minOccurs="0" maxOccurs="unbounded"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType> Downside: No structure. Mostly verbose entries.

2013 work with PROV Explored encoding PROV in RDF/XML* (Required use of CDATA tag to avoid interfering with schema compliance; deemed less promising) More recently: exploring W3C PROV Model as basis for encoding provenance metadata in DDI *Lagoze, C., Williams, J., & Vilhuber, L. (2013). Encoding Provenance Metadata for Social Science Datasets. In 7th Metadata and Semantics Research Conference. Thessaloniki.

Prov Integration with DDI-C Intro to Prov Our approach to integration Challenges and future work 12

PROV 101 Entities - physical, digital, conceptual, or other kinds of things Activities - how entities come into existence and how their attributes change Usage and Generation - activities generate new entities, often using other entities to do so Agents -  take a role in an activity such that they can be assigned some degree of responsibility Derivation and Revision - when one entity's existence, content, characteristics and so on are at least partly due to another entity Available as an ontology, an XML schema, and a generic data model http://www.w3.org/TR/2013/NOTE-prov-primer-20130430/

Example of Small PROV Graph http://www.w3.org/TR/2013/NOTE-prov-primer-20130430/

PROV Integration with DDI-C

Embed PROV within <RelStdy> Extend Material Reference Complex Type with Prov To include prov:Document within the <relStdy> element, a new complex type called ‘materialReferenceWithProvType’, which inherits from materialReferenceType can be introduced as follows:   This allows PROV document to be embedded or referenced by URI.

Embed PROV within <RelStdy> (Cont) Step 2 - Modify the type of relStdy to the new complex type The relStdy element is changed to inherit from materialReferenceWithProvType’, to facilitate the embedding of a prov:document within the relstdy element.  

Provenance of ACS Data Files ACS Questionaire Internal Census File(s) PUMS IPUMS Aggregate Tabulations AFF NHGIS RDC

Variable Level Provenance A single attribute is added to the variable type The value of prov:ref must be a valid PROV Identifier, but there is no requirement that for every prov:ref a correspondingprov:id must be known to exist. Uses xsd:Qname instead of xsd:ID and IDREF.

Variable Level Provenance (cont.) A prov:bundle provides a way to wrap a provenance chain and refer to it as an entity In our implementation, a variable would reference a prov:Bundle that would be found within the embedded prov:Document Prov:Bundle

Variable Level Provenance (cont.) To establish the relationship between a given variable and the dataset that contains it, prov:collection (which are also prov:entities) can be used Prov:Collection

Challenges and Future Work To encode the provenance at the variable level: must uniquely identify each variable as its own entity (difficult when variables may exist in multiple codebooks) Investigating DOIs/ARKs, and URNs as possible solutions for global variable level identification Generic activities – defining the processes by which datasets and variables are derived in a standard way Considering feasibility of transitioning from DDI-C to DDI-L Considering experimental W3C Prov-Links schema – Using the Mention to connect bundles Finding ways to efficiently generate PROV encoded metadata - How to deal with low quality metadata when forming linkages in prov (in the UI) http://www.w3.org/TR/prov-links/#term-mention

Thank you! Questions? ncrn.cornell.edu 11/15/2018 Not for further distribution