Working with Data Julia Lane. Key idea Measures measures everywhere – we have to stop and think (with apologies to Samuel Taylor Coleridge) what are we.

Slides:



Advertisements
Similar presentations
Alternative Approaches to Data Dissemination and Data Sharing Jerome Reiter Duke University
Advertisements

SERG Universidad de Huelva FACTORS OF BUSINESS SUCCESS IN ANDALUSIA.
Metadata to Support the Survey Life Cycle Alice Born, Statistics Canada Joint UNECE/Eurostat/OECD Work Session on Statistical Metadata (METIS) Geneva,
© Statistisches Bundesamt, IIA - Mathematisch Statistische Methoden Summary of Topic ii (Tabular Data Protection) Frequency Tables Magnitude Tables Web.
Presented to: Presented by: Transportation leadership you can trust. LEHD OnTheMap Data Planning Applications Conference, Session 2 Bruce Spear, Cambridge.
Research methods – Deductive / quantitative
What are Wage Records? Wage records are an administrative database used to calculate Unemployment Insurance benefits for employees who have been laid-off.
March 2013 ESSnet DWH - Workshop IV DATA LINKING ASPECTS OF COMBINING DATA INCLUDING OPTIONS FOR VARIOUS HIERARCHIES (S-DWH CONTEXT)
Causal-Comparative Research
© John M. Abowd 2005, all rights reserved Statistical Tools for Data Integration John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Recent Advances In Confidentiality Protection John M. Abowd April 2005.
© John M. Abowd 2005, all rights reserved Sampling Frame Maintenance John M. Abowd February 2005.
Recent Advances In Confidentiality Protection – Synthetic Data John M. Abowd April 2007.
TOOLS OF POSITIVE ANALYSIS
1 Human resources management in NSOs Training workshop for SADC member states. Luanda, 2-6 Dec 2006 Olav Ljones, Deputy Director General, Statistics Norway.
Mexico's experience using enterprise-based surveys to measure entrepreneurship Félix Vélez Fernández Varela National Institute of Statistics and Geography,
State Data Center Annual Affiliate Meeting New York State Department of Labor Earlene Dowell LEHD Program Center for Economic Studies U.S. Census Bureau.
Beyond 2011 – A new paradigm for population statistics? Pete Benton, Beyond 2011 Programme Director Office for National Statistics, UK.
Dr. Engr. Sami ur Rahman Assistant Professor Department of Computer Science University of Malakand Research Methods in Computer Science Lecture: Research.
Evidence-Based Practice Current knowledge and practice must be based on evidence of efficacy rather than intuition, tradition, or past practice. The importance.
1 Emergency Infant Feeding Surveys Assessing infant feeding as a component of emergency nutrition surveys: Feasibility studies from Algeria, Bangladesh.
Use of survey (LFS) to evaluate the quality of census final data Expert Group Meeting on Censuses Using Registers Geneva, May 2012 Jari Nieminen.
Dutch Virtual Census Presentation at the International Seminar on Population and Housing Censuses; Beyond the 2010 Round November, 2012 Egon Gerards,
12th Meeting of the Group of Experts on Business Registers
Intruder Testing: Demonstrating practical evidence of disclosure protection in 2011 UK Census Keith Spicer, Caroline Tudor and George Cornish 1 Joint UNECE/Eurostat.
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
All the answers? Statistics New Zealand’s Integrated Data Infrastructure Paper by Felibel Zabala, Rodney Jer, Jamas Enright and Allyson Seyb Presented.
Quality issues on the way from survey to administrative data: the case of SBS statistics of microenterprises in Slovakia Andrej Vallo, Andrea Bielakova.
User-focused Threat Identification For Anonymised Microdata Hans-Peter Hafner HTW Saar – Saarland University of Applied Sciences
Assessing Quality for Integration Based Data M. Denk, W. Grossmann Institute for Scientific Computing.
Metadata Models in Survey Computing Some Results of MetaNet – WG 2 METIS 2004, Geneva W. Grossmann University of Vienna.
1 Assessing the Impact of SDC Methods on Census Frequency Tables Natalie Shlomo Southampton Statistical Sciences Research Institute University of Southampton.
Assessing Disclosure for a Longitudinal Linked File Sam Hawala – US Census Bureau November 9 th, 2005.
The Dutch Virtual Census based on registers and already existing surveys Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics.
Potential Errors In Epidemiologic Studies Bias Dr. Sherine Shawky III.
Inventor Disambiguation Workshop EVALUATION OUTCOMES.
The Dutch Virtual Census of 2001 A New Approach by Combining Different Sources Eric Schulte Nordholt ECE Census meetings Geneva, November 2004.
Use of Administrative Data Seminar on Developing a Programme on Integrated Statistics in support of the Implementation of the SNA for CARICOM countries.
1 IPAM 2010 Privacy Protection from Sampling and Perturbation in Surveys Natalie Shlomo and Chris Skinner Southampton Statistical Sciences Research Institute.
The availability of Dutch census microdata Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands Division Social.
19 June 2007 Improving the quality of business registers UNECE/Eurostat/OECD 18 – 19 June 2007.
Using Targeted Perturbation of Microdata to Protect Against Intelligent Linkage Mark Elliot, University of Manchester Cathie.
Protection of frequency tables – current work at Statistics Sweden Karin Andersson Ingegerd Jansson Karin Kraft Joint UNECE/Eurostat.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Access to microdata in the Netherlands: from a cold war to co-operation projects Eric Schulte Nordholt Senior researcher and project leader of the Census.
Evidence-Based Practice Evidence-Based Practice Current knowledge and practice must be based on evidence of efficacy rather than intuition, tradition,
© Statistisches Bundesamt, VI A Statistisches Bundesamt The new method of the next german Population census Johann Szenzenstein, Federal Statistical Office,
Moving Up Or Moving On: Workers, Firms and Advancement in the Low-Wage Labor Market Fredrik Andersson Harry Holzer Julia Lane.
Towards a Process Oriented View on Statistical Data Quality Michaela Denk, Wilfried Grossmann.
Insights and Inference Opportunities and challenges with administrative data and non-probability sources (including organic data)
INFO 4470/ILRLE 4470 Visualization Tools and Data Quality John M. Abowd and Lars Vilhuber March 16, 2011.
Census quality evaluation: Considerations from an international perspective Bernard Baffour and Paolo Valente UNECE Statistical Division Joint UNECE/Eurostat.
An Overview of Editing and Imputation Methods for the next Italian Censuses Gianpiero Bianchi, Antonia Manzari, Alessandra Reale UNECE-Eurostat Meeting.
Using administrative data to produce official social statistics New Zealand’s experience.
Synthetic Approaches to Data Linkage Mark Elliot, University of Manchester Jerry Reiter Duke University Cathie Marsh Centre.
The LEHD Program and Employment Dynamics Estimates Ronald Prevost Director, LEHD Program US Bureau of the Census
Julia Lane, and many many coauthors. BIG DATA DEFINITION “Big Data” is an imprecise description of a rich and complicated set of characteristics, practices,
INFO 7470 Statistical Tools: Edit and Imputation Examples of Multiple Imputation John M. Abowd and Lars Vilhuber April 18, 2016.
Stat 101Dr SaMeH1 Statistics (Stat 101) Associate Professor of Environmental Eng. Civil Engineering Department Engineering College Almajma’ah University.
Chapter 3 Building Business Intelligence Chapter 3 DATABASES AND DATA WAREHOUSES Building Business Intelligence 6/22/2016 1Management Information Systems.
Measuring Data Quality in the BLS Business Register Richard Clayton Sherry Konigsberg David Talan WiesbadenGroup on Business Registers Tallin, Estonia.
PS Research Methods I with Kimberly Maring Unit 9 – Experimental Research Chapter 6 of our text: Zechmeister, J. S., Zechmeister, E. B., & Shaughnessy,
Reconciling Confidentiality Risk Measures from Statistics and Computer Science Jerry Reiter Department of Statistical Science Duke University.
Julia Lane New York University
Statistics Netherlands Division Social and Spatial Statistics
Anonymisation: Theory and Practice
Quality evaluation of register-based statistics
Istat - Structural Business Statistics
The role of metadata in census data dissemination
Presentation transcript:

Working with Data Julia Lane

Key idea Measures measures everywhere – we have to stop and think (with apologies to Samuel Taylor Coleridge) what are we measuring? how are we measuring it? what are we missing? are we protecting human subjects? can entities be reidentified?

Outline Define a research question (what are we measuring?) Think about what data are available and the measurement error (how are we measuring it?) Link datasets (what are we missing?) Address Privacy and Confidentiality/Ethics (are we protecting human subjects?) Disseminate results (can people be reidentified?)

Outline Define a research question Think about what data are available and the measurement error Link datasets Address Privacy and Confidentiality/Ethics Disseminate results

Here’s the problem “ Big Data” is an imprecise description of a rich and complicated set of characteristics, practices, techniques, ethics, and outcomes all associated with data. (AAPOR) No canonical definition By characteristics: Volume Velocity Variety (and Variability and Veracity) By source: found vs. made By use: professionals vs. citizen science By reach: datafication By paradigm: Fourth paradigm

Define a research question Write down a conceptual framework/hypothesis Check the literature Develop an empirical approach Why? correlations

Example: LEHD What is the return to training? What is the impact of firms on workers (low wage work)? What is the impact of workers on firms (productivity and competitiveness)?

Outline Define a research question Think about what data are available and the measurement error Link datasets Address Privacy and Confidentiality/Ethics Disseminate results

UI Wage Record data Universal 98% of employment in partner states Longitudinal in businesses and workers Current Sent quarterly (six months after transaction date) – start date 1990(+) Comparable Same definitions across states

Measurement Error Data generation process (Kreuter and Peng, 2014) Total error = Row error + Column error + Cell error (Thanks to Paul Biemer, 2015)

Row error Omissions – some rows are missing which implies that elements in the target population are not represented on the file Duplications – some population elements occupy more than one row Erroneous inclusions – some rows contain elements or entities that are not part of the target population

Column error Specification error – Concept (what is the underlying latent variable of interest) – Measurement (what is the actual measurement used) – Interpretation (how is the analyst interpreting it)

Cell error content error (problems in the measurement process, transcription error, data processing error) specification error (error in data capture for specific units) missing data (the measurement process, transcription error, data processing error

Consequences for inference Regression on y on x with and without Variable Error. The figure on the left is the population regression with no error in the x variable. On the right, variable error was added to the x-values with a reliability ratio of Note its attenuated slope which is very near the theoretical value of 0.77 (Source, Paul Biemer, 2015)

Mitigation Major challenge with big data (and not corrected with large samples!) – data editing and cleaning – Anomaly identification and resolution (Chandola et al, 2009) – selective editing strategies (see for example, Granquist and Kovar, 1997; De Waal, Pannekoek, and Scholtus, 2011)

Mitigation data mining (Natarajan, Li, and Koronis 2009), machine learning (Clarke, 2014) cluster analysis (Duan, Xu, and Lee, 2009; He, Xu, Deng, 2003) various data visualization tools such as treemaps (Shneiderman, 1992; Tennekes, de Jonge and Daas, 2012) tableplots (Tennekes, de Jonge and Daas 2013; Tennekes, 2012; Puts, Daas and de Waal, 2015)

Outline Define a research question Think about what data are available and the measurement error Link datasets Address Privacy and Confidentiality/Ethics Disseminate results

UI Wage Record data Universal 98% of employment in partner states Longitudinal in businesses and workers Worker Information Date of birth, place of birth, sex, earnings Firm Information Four digit industry, turnover, growth, sales Current Sent quarterly (six months after transaction date) – start date 1990(+) Comparable Same definitions across states Detailed Geography: place of residence and place of work to latitude/longitude

LEHD data Link Record Person-ID Employer-ID Data Business Register Employer-ID Census Entity-ID Data Economic Censuses and Surveys Census Entity-ID Data Demographic Surveys Household Record Household-ID Data Person Record Household-ID Person-ID Data

Preprocessing: Workflow

Source: Köpcke H, Thor A, Rahm E. Evaluation of entity resolution approaches on real-world match problems. Proceedings of the VLDB Endowment

UMASSISTICFLEMING/LI Results when trained on random mixture dataset Results when trained on common characteristics dataset Results when trained on random mixture dataset Results when trained on common characteristics dataset No training Precision Splitting Recall Lumping E-05 F score True Positives False Negatives False Positives Runtime 7 hours on c3.8xlarge AWS instance N/A (CPU usage topped at 69%)(CPU usage topped at 11.85%) Source: PatentsView Evaluation

Outline Define a research question Think about what data are available and the measurement error Link datasets Address Privacy and Confidentiality/Ethics Disseminate results

Impossible to ask for consent

Approach Change way in which we access and disseminate data

Common Rule Suggestions Source: Julia Lane

Set up major research facilities

Practical approach

Outline Define a research question Think about what data are available and the measurement error Link datasets Address Privacy and Confidentiality/Ethics Disseminate results

Core challenge Introduction to Statistical Disclosure Control (SDC) Matthias Templ, Bernhard Meindl and Alexander Kowarik

What is disclosure Identity disclosure – linkage with external available data Attribute disclosure Inferential disclosure

Approaches basic risk measurement Recoding local suppression PRAM (postrandomization) information loss measures Shuffling Microaggregation adding noise

Example Introduction to Statistical Disclosure Control (SDC) Matthias Templ, Bernhard Meindl and Alexander Kowarik

Practical approach

And a reminder of why Measures measures everywhere – we have to stop and think what are we measuring? how are we measuring it? what are we missing? are we protecting human subjects? can people be reidentified?