Social Science Data Management & Curation Jared Lyle January 13, 2014
New Data
Safety Pilot Project 2800 cars, trucks, and buses with Vehicle Awareness Devices, sensors, and video ~ 1 Petabyte of data per year
MET Longitudinal Database Scale –2 academic years –6 large school districts –6 grade levels –3,000 teachers –44,500 students –24,000 videos –22,500 observation sessions –900+ observers trained by ETS to score videos –~12GB of quantitative data –~10TB of video
New Incentives & Discussions
Berman, F., and V. Cerf, Who Will Pay for Public Access to Research Data? Science, (6146): p
“There’s an attitude in the profession that collecting data is for lesser people. That it’s like janitor work; it would dirty our hands. There’s social climbing in academia. So if you write a paper computing an index, that seems low- prestige, so you don’t want to do that. …some of the best theorizing comes after collecting data because then you become aware of another reality.” -Robert Shiller (2013 Nobel Laureate, Economic Science)
Challenges remain
“It saves funding and avoids repeated data collecting efforts, allows the verification and replication of research findings, facilitates scientific openness, deters scientific misconduct, and supports communication and progress.” Niu (2006). “Reward and Punishment Mechanism for Research Data Sharing.”
Vines et al. Current Biology 24, 94–97, January 6, Image:
Griswold et al. (2013) See also:
Pienta, Gutmann, & Lyle (2009). “Research Data in The Social Sciences: How Much is Being Shared?” See also: Pienta, Gutmann, Hoelter, Lyle, & Donakowski (2008). “The LEADS Database at ICPSR: Identifying Important ‘At Risk’ Social Science Data.” See also: Pienta, Alter, & Lyle (2010). “The Enduring Value of Social Science Research: The Use and Reuse of Primary Research Data.”
Data Management & Curation
About ICPSR Founded in 1962 as a consortium of 21 universities to share the National Election Survey Today: 700+ members around the world Data dissemination for more than 20 federal and non-government sponsors 600,000+ visitors per year
Examples of popular data General Social Surveys, [Cumulative File] National Longitudinal Study of Adolescent Health (Add Health), Monitoring the Future: A Continuing Study of American Youth (12th-Grade Survey), 2012 Drug Abuse Warning Network (DAWN), 2011 National Survey on Drug Use and Health, 2012 American National Election Study, Collaborative Psychiatric Epidemiology Surveys (CPES), [United States]
What we do Acquire, curate and archive social science data Distribute data to researchers Preserve data for future generations Provide training in quantitative methods Archive size 8,600+ data collections, over 60,000 data sets Grows by 300+ collections a year
Unique capabilities Curated data with rich metadata Digital preservation Bibliography and citation Confidential data Training Community
Data Management & Curation
Quality
A well-prepared data collection “contains information intended to be complete and self-explanatory” for future users.
Do no harm.
Data
Documentation
Variable-level Details National Longitudinal Study of Adolescent Health (Add Health), (National Longitudinal Study of Adolescent Health (Add Health), Wave I School Administrator Codebook.
Processing History
practice-co-published-with-ads/
Confidentiality
Sharing confidential data Safe data: Modify the data to reduce the risk of re-identification Safe places: Physical isolation and secure technologies Safe people: Training and Data use agreements
Safe Data Suppressing unique cases Grouping values (e.g., 13-29=1, 30-49=2) Top-coding (e.g., >1,000=1,000) Aggregating geographic areas Swapping values Sampling within a larger data collection Adding “noise” Replacing real data with synthetic data
Further Resources: Safe Data Statistical Policy Working Paper 22 - Report on Statistical Disclosure Limitation Methodology The American Statistical Association, Committee on Privacy and Confidentiality - Methods for Reducing Disclosure Risks When Sharing Data ICPSR's Confidentiality and Privacy web page confidentiality/ confidentiality/
Safe Places Secure Deposit Form Secure Processing Environment (SDE) Data protection plans Virtual data enclave Physical enclave
The Virtual Data Enclave (VDE) provides remote access to quantitative data in a secure environment.
Further Resources: Safe Places ICPSR “Instructions for Preparing the Data Protection Plan” all.pdf all.pdf “Introducing ICPSR’s Virtual Data Enclave (SDE)” icpsrs-virtual-data-enclave.html icpsrs-virtual-data-enclave.html ICPSR Physical Data Enclave access/restricted/enclave.html access/restricted/enclave.html
Safe People Staff training Data use agreements –Responsible Use Statement –Research plan –IRB approval –Data protection plan –Behavior rules –Security pledge –Institutional signature
Further Resources: Safe People Example NAHDAP Restricted Data Use Agreement NAHDAP “Restricted-Use Data Deposit and Dissemination Procedures” DAP-RestrictedDataProcedures.pdf DAP-RestrictedDataProcedures.pdf “Navigating Your IRB to Share Restricted Data” Webinar
Preservation
Digital Preservation has a unique set of requirements: –Persistence –Reliability –Scalability –Preserving bits as well as the meaning –Cost Source: Yakel, 2012
Digital Preservation Challenges Vulnerabilities of digital information –Neglect –System Failure –Intention Destruction –Errors (Human and System-Induced) –Inter-dependencies (hardware, software, OS) –Context dependencies –Technology Obsolescence –Heterogeneity Source: Yakel, 2012
Digital Preservation Challenges Sustainability –Repositories –File formats –Processes –Expertise Source: Yakel, 2012
Digital Preservation Policies Digital Preservation Policy Framework –OAIS compliance; organizational capacity; technology and security Access Policy Framework –Access levels; authorization/authentication rules Collection Development Policy –Selection and appraisal criteria; areas of emphasis Disaster Planning Policy Framework –Business continuity, communications, disaster recovery
Repository Assessments TRAC/ISO Data Seal of Approval World Data System
Attribution
Title Author Date Version Persistent identifier (such as the Digital Object Identifier, Uniform Resource Name URN, or Handle System)
Example
Access
ICPSR’s Guidelines for OSTP Data Access Plan Page See also:
Tools
Manage and Curate to Share
See: Summer Program Course: Curating and Managing Research Data for Re-Use
Thank you!
Comparing variables across studies