Creating Something from Nothing: Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa,

Slides:



Advertisements
Similar presentations
DLI Orientation: Concepts
Advertisements

Microdata access in practice Felix Ritchie. Overview Concerns Conceptual and practical concerns International practice UK experience Key lessons.
DLI & Research Data Centres Creating a better understanding of these two programs Chuck Humphrey Data Library University of Alberta April 2004.
Aggregate Data and Statistics
EQUINOX DATA DELIVERY SYSTEM May 31, 2011 –Elizabeth Hill Equinox.uwo.ca.
DLI PRESENTATION University of Victoria December 4, 2002 Raymond Roy Special Surveys Division Statistics Canada.
DLI Orientation: Concepts A Framework for Thinking about Statistical Information Train the Trainers Montreal, March 9, 2004 Chuck Humphrey Data Library.
Dealing with confidential research information - Anonymisation techniques and access regulations to enable using and sharing research data Data Management.
Business microdata dissemination at Istat Daniela Ichim Luisa Franconi
Data Access and Data Use: the Missing Link? Elizabeth Hamilton University of New Brunswick Chuck Humphrey University of Alberta Data and Knowledge Transfer.
Chuck Humphrey Data Library University of Alberta.
Using synthetic data to improve the accessibility of the SLS Susan Carsley, SLS Project Manager.
Meeting the Challenge The National Population Health Survey and Data Access E. Hamilton UNB Libraries IASSIST 2003.
Fitting a survey life cycle in the DDI Irene Wong Chuck Humphrey IASSIST Edinburgh May 2005.
Statistics Canada Statistique Canada mai 2005 / 1.
Quantitative Evidence for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library March 6, 2009.
Statistics and Data for Marketing Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 27, 2008.
EAS 293 Data Library, Rutherford North 1 st Floor Chuck Humphrey Data Library October 14, 2008.
United Nations Expert Group Meeting on Revising the Principles and Recommendations for Population and Housing Censuses New York, 29 October – 1 November.
STATISTICS CANADA SURVEY LIFECYCLE WOLFVILLE, APRIL 2008 SURVEY LIFECYCLE Michel B. Séguin Atlantic DLI Training.
The Data Liberation Initiative Orientation Session Statistics Canada / Statistique Canada University of Alberta December 5, 2001 Chuck Humphrey.
PUBH 898: Health Economics Finding data and statistics.
Product Retrieval Statistics Canada / Statistique Canada Chuck Humphrey ACCOLEDS/DLI Training December, 2001.
Whither or wither? Tracking and Sharing Survey Data: Findings from the Field E. Hamilton UNB Libraries Accoleds 2003.
Searching for Statistics Why can’t we find the data we need? Where should we even start?
Statistics Canada’s Real Time Remote Access Solution 2011 MSIS Meeting – Karen Doherty May 2011.
Disclosure Avoidance: An Overview Irene Wong ACCOLEDS/DLI Training December 8, 2003.
Data and Social Research Chuck Humphrey Data Library Rutherford North Library.
Chuck Humphrey, University of Alberta Atlantic DLI Training, 2008 DLI Orientation: Concepts A Framework for Thinking about Data and Statistics.
1 The 2001 Census PUMFS Odyssey Sponsored by HAL and PALS Presented by Chuck Humphrey.
DLI Workshop -- Mar Hosted by Dalhousie University March 2000 DLI Training Workshop.
The Census of Canada and Immigration & Ethno-cultural Data Chuck Humphrey University of Alberta February 10, 2006.
POLS 328.3: Public Policy Analysis Finding data and statistics.
The Research Data Centre Program Microdata Access Division Heather Hobson April 23, 2009.
Framework of Statistical Information. This is a typology of the categories or classes of statistical information. Remember the relationship between statistics.
Innovations in Data Dissemination Thomas L. Mesenbourg, Jr. Acting Director U.S. Census Bureau United Nations Seminar on Innovations in Official Statistics.
October 2008 Getting to Know Data Sources SOC 3140 Prof. Sylvie Lafrenière Susan Mowers, GSG / Library.
Soc : Principles of Research Design LONGITUDINAL DATA Sunny Kaniyathu, Data Services Librarian.
January 20089SOC4112 Getting to Know Data Sources Geographic, Statistical and Government Information Centre GSG Team Susan Mowers.
Disclosure Avoidance at Statistics Canada INFO747 Session on Confidentiality Protection April 19, 2007 Jean-Louis Tambay, Statistics Canada
Project? Microdata? Say what? TRY Conference May 5, 2008 Suzette Giles, Ryerson University Laine Ruus, University of Toronto.
Peter Granda Archival Assistant Director / Data Archives and Data Producers: A Cooperative Partnership.
Creating Something from Nothing: Working with Synthetic Files ACCOLEDS /DLI Training: December 2003 Chuck Humphrey University of Alberta.
The Data Documentation Initiative: more discussion Chuck Humphrey University of Alberta Atlantic DLI Workshop 2005, Acadia University.
Disclosure Analysis: What do RDC Analysts do? Research Data Centre Program, Statistics Canada James Chowhan Ontario DLI Training, Queen's University
Eve Powell-Griner National Center for Health Statistics Centers for Disease Control and Prevention National Center for Health Statistics Microdata Release.
AN OVERVIEW OF STATISTICS CANADA Mike Sivyer KINGSTON, APRIL 2004.
National Boot camp Vancouver Heather Dryburgh and Michel B. Séguin May 31 st, 2011 Survey Life cycle.
Handling Reference Questions DLI Orientation Session Kingston, Ontario April 5, 2004.
David Price October 2011 Real Time Remote Access (RTRA) #10.
DLI and EQUINOX Question 1 How do I find out what survey datasets are available from Statistics Canada ?
Stretching Your Data Management Skills Chuck Humphrey University of Alberta Atlantic DLI Workshop 2003.
Mike Sivyer VICTORIA, DECEMBER, 2002 UPDATE ON DATA LIBERATION.
Hosted by the University of Regina Library December 1999 DLI Training Workshop Chuck Humphrey.
Soc 332.6: Principles of research design Finding statistics.
Rural Development Finding data and statistics.  Statistics Canada: Federal statistical agency  Data released under the Data Liberation Initiative (DLI)
Real Time Remote Access: Educational resources Susan Mowers, University of Ottawa.
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Navigating Your Way Through the EFT, Nesstar and Beyond 20/20 (WDS)
Accessing data – a user’s perspective
Creating Something from Nothing: Working with Synthetic Files
Research Data Centre DLI Workshop (December, 2001)
Product Retrieval Statistics Canada / Statistique Canada Title page
Susan Mowers, Data Librarian, GSG Centre - UOttawa
DLI PRESENTATION University of Victoria December 4, 2002 Raymond Roy
Disclosure Avoidance: An Overview
Telling Canada’s story in numbers Marie-Josée Major
Data Liberation Initiative (DLI)
Exploring the DLI Product line
Creating Something from Nothing: Working with Synthetic Files
Presentation transcript:

Creating Something from Nothing: Synthetic and Dummy files Bo Wandschneider University of Guelph Chuck Humphrey University of Alberta DLI Training: Ottawa, May, 2003

Outline Types of data Files Implications for analysis Where do we get access Which file is appropriate Providing service with synthetic files NPHS: an exercise SLID: an exercise

Types of Data Files Microdata Confidential Microdata Products Master Files Share Files Public Access Microdata Products Public Use Anonym zed microdata (PUMFS) Synthetic Files

Microdata Products Microdata raw data organized in a file where the records or lines in the file are observations of a specific unit of analysis and the information on the lines are the values of variables requires some form of processing or analysis to be used

Microdata Products Microdata - SCF Example

Confidential Microdata Master Files These files contain the fullness of detail captured about the unit of observation. The information in these files can identify the individual who provided the original information and, therefore, are considered confidential.

Confidential Microdata Master File – Example

Confidential Microdata Master File - Personal identifiers

Confidential Microdata Master File – Geography (SLID)

Confidential Microdata Master File - Fullness of Data (NPHS)

Confidential Microdata Master File - Fullness of Data

Confidential Microdata Master File - Fullness of Data (SLID)

Confidential Microdata Master File - Fullness of Data

Confidential Microdata Share Files these are confidential files in which the respondents have signed a consent form permitting Statistics Canada to allow access to their information for approved research. Used with NPHS and NLSCY

Public Access Microdata Anonymized Microdata these microdata are specially prepared to minimize the possibility of disclosing or identifying any of the cases or observations the original data from the master file are edited to create a public use microdata file

Public Access Microdata Steps in Anonymizing Microdata removal of all personal identification information (names, addresses, etc) include only gross levels of geography collapse detailed information into a smaller number of general categories suppress the values of a variable

Public Access Microdata Statistics Canada PUMFs only available for select social surveys that undergo a review of the Data Release Committee, an internal Statistics Canada committee no ‘enterprise’ public use microdata

Public Access Microdata Statistics Canada PUMFs almost all are cross-sectional, that is, represent data collected at one point in time longitudinal data are difficult to anonymize while maintaining any useful information

Public Access Microdata PUMFs – personal identifiers

Public Access Microdata PUMFs – gross geography

Public Access Microdata PUMFs – collapsed data

Public Access Microdata PUMFs – suppressed data

Public Access Microdata Synthetic Files These microdata do not contain actual ‘real’ cases but are pseudo- cases that provide aggregate results close to the ‘real’ cases

Public Access Microdata Synthetic Files They have been prepared to create analysis runs with the master file without possibly disclosing or identifying any of the cases

Public Access Microdata Synthetic Files The results are not to be reported; strictly to be used to prepare analyses of master files Usually associated with longitudinal files

Public Access Microdata Steps in creating Synthetic Files Observations are transformed No records actually exist Keep fullness of detail

Public Access Microdata Synthetic Files – NPHS example

Public Access Microdata Synthetic Files – NPHS 1999 general file PUMFSynthetic Obs49046 Var176400

Public Access Microdata Synthetic Files – NPHS 1999

Public Access Microdata Synthetic Files – NPHS 1999

Implications for Analysis What are the implications in doing analysis with these different types of microdata files?

Implications for Analysis Master File All observations Has the most variables with the most detail Lots of geography and personal characteristics Little grouping or capping of categories

Implications for Analysis Master File Restricted access: only available to authorized Statistics Canada employees, which includes ‘deemed employees’

Implications for Analysis Master File Includes linkage variables across files within a study, e.g., NLSCY linkage among the files for different units of analysis (kids, parents, teachers)

Implications for Analysis Public Use Microdata (PUMF) Suppressed observations Suppressed variables: removed from the file Suppressed content Gross geography Collapsed categories Capped values

Implications for Analysis Public Use Microdata (PUMF) Licensed product: agree to certain terms of use No linkage to multiple units of analysis, with a few exceptions (GSS Time Use and Family)

Implications for Analysis Synthetic Files “Looks like a duck and quacks like a duck”, but it isn’t a duck or any other type of fowl.

Implications for Analysis Synthetic Files Looks like master files Lots of observations Lots of variables Little grouping or capping of categories Lots of geographic detail

Synthetic Files Precautions Results not authentic – but close in the aggregate Use for testing analysis setups only Still need the master files for publishable results

Where do we get Access? Master File Restricted access governed under the Statistics Act Remote Job Submission Research Data Centres Apply to SSHRC to obtain a peer- reviewed proposal and STC for security clearance

Where do we get Access? Public Use Microdata Files (PUMF) Get from DLI Analyze where ever is convenient Can use a variety of analysis software, including SAS, SPSS, Stata, HLM, LISREL, etc. Slidret sans data

Where do we get Access? Synthetic Files Author Divisions ‘may’ create it Most relevant when dealing with new Panel Data, but not necessarily, e.g., the Census has potential NPHS synthetic files on DLI FTP site

Where do we get Access? Synthetic files SLID, WES, YITS coming ???? Do we need to encourage them? Work with locally Build SAS and SPSS setups

Which File is Appropriate? 1 st stop is still the PUMF This file has the easiest access for us Probably meets the needs of most clients Not as administratively burdensome as synthetic or master file Perfect for clients just looking for ‘data’ – courses in quantitative analysis

Which File is Appropriate? If more detail is needed, refer to the Master File Documentation (similar to Synthetic File Documentation) Make them aware that the cost of use is higher, both in terms of accessibility and analytical requirements Interest most likely to come from grad students and ‘experienced’ researchers

Which File is Appropriate? Download the Synthetic files from DLI Make them aware of problems with synthetic files – RESULTS ARE NOT PUBLISHABLE Encourage them to submit an application for RDC access – there is a time lag

Which File is Appropriate? RDC

Which File is Appropriate? Some of you may work with client using synthetic files before passing her/him off to RDC

DLI Contacts can provide four basic services with synthetic files. Build SPSS and SAS system files from the raw synthetic data files that are distributed through DLI; Provide information about the use of Remote Job Submission (a.k.a, Remote Access) and RDC’s; Services for Synthetic Files

Assist with finding variables in the synthetic files; Provide instruction about ways of capturing SPSS or SAS code from “dummy” analysis runs with the synthetic files. It is this code that is then submitted to STC through remote job submission. Services for Synthetic Files

1. Building SPSS and SAS system files for synthetic data The NPHS synthetic data are distributed as a raw ASCII file with accompanying command files for SPSS and SAS Separate synthetic data files exist for the master file setup and for bootstrapping analysis Services for Synthetic Files

1. Building SPSS and SAS system files for synthetic data The synthetic data for the NPHS has 4,138 variables and 17,276 fabricated cases. Creating the SPSS and SAS system files from this file is not difficult, but it does take time. DLI Contacts may wish to create these products for their patrons. Services for Synthetic Files

2. Information about Remote Job Submission (RJS) The author divisions supporting RJS have established their own guidelines and have different operating procedures. Not all divisions supporting longitudinal surveys currently support RJS. Therefore, there is a need to track down this information for our patrons. Services for Synthetic Files

2. Information about Remote Job Submission (RJS) For example, the sources for information about RJS include the Centre for Education Statistics: Services for Synthetic Files

2. Information about Remote Job Submission (RJS) Where do you find this information? Ask the DLI Team via the DLI List The EAC has asked for a description of RJS on the DLI website, which should be on the DLI Team’s to-do list Services for Synthetic Files

2. Information about Research Data Centres The collection of master files available through RDC’s is listed on the STC website for RDC’s Each RDC has its own website describing its services Services for Synthetic Files

3. Data Reference for the content of the synthetic files Helping researchers identify variables over longitudinal files is an important service Need to keep the unit of analysis straight Need to understand the mnemonic naming convention for variables over cycles Develop indexing aids for you and your patrons Services for Synthetic Files

4. Provide helpful tips for preserving the code from “dummy” analysis runs in SPSS and SAS Researchers will run analyses on the synthetic file to generate the code that they will subsequently for Remote Job Submission Providing information about how to do this easily will be helpful to your patrons Services for Synthetic Files

Let’s look at an example of these four services using the synthetic files from the NPHS, An Example Using the NPHS

Let’s look at an example of a “dummy” file using SLIDRET, a retrieval system developed to extract data from the cycles of the SLID. A “data-less” version of SLIDRET is available through DLI to help identify variables for RJS. An Example Using SLID

Location of Slides and Exercices