Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social.

Slides:



Advertisements
Similar presentations
The Organisation As A System An information management framework The Performance Organiser Data Warehousing.
Advertisements

1 of 15 Information Access Internal Information © FAO 2005 IMARK Investing in Information for Development Information Access Internal Information.
1 INCOSE HRA Advanced Risk Management Conference 2007 Courtney Lane INCOSE HRA Risk Management Conference November 9, 2007 Its More Than Just Numbers:
UNITED NATIONS REGIONAL WORKSHOP ON DATA DISSEMINATION AND COMMUNICATION VENUE: Amman, Jordan DATE: 9th September, 2013 Presenter: GODWIN ODEI GYEBI Statistical.
ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.
Innovation data collection: Methodological procedures & basic forms Regional Workshop on Science, Technology and Innovation (STI) Indicators.
Innovation data collection: Advice from the Oslo Manual South East Asian Regional Workshop on Science, Technology and Innovation Statistics.
Cultural Heritage in REGional NETworks REGNET Project Meeting Content Group Part 1: Usability Testing.
1 Future strategy for e-submission as seen by industry Dr Michael Colmorgen, IFAH-Europe 2nd Veterinary Workshop on E-submission 4 Dec 2009, EMEA, London.
Improving imputation methodology in the Hungarian Central Statistical Office (HCSO) NTTS 2009 seminar, Bruxelles February 2009 Improving imputation.
0 - 0.
Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.
Support to the Policy Making Process Knowledge Maps, Ontologies and multi- criteria decision making Arantza Aldea.
1 ESDS Government Vanessa Higgins Cathie Marsh Centre for Census and Survey Research University of Manchester ESDS Awareness Day December 2003.
ESDS user support materials and resources: how to use them Support Services Royal Statistical Society, London 13 February 2009.
The Economic and Social Data Service (ESDS) Kevin Schürer ESDS/UKDA ESDS Awareness Day 5 December 2003.
Using storytelling to reach new audiences for data Eileen Capponi ESDS International Conference, 3 December 2007 eileen.capponi.
An Introduction to the UK Data Archive and the Economic and Social Data Service November 2007 Jack Kneeshaw, UKDA.
Economic and Social Data Service June What is the ESDS? national service supporting the archiving, dissemination and use of social and economic.
The Economic and Social Data Service (ESDS) Karen Dennison UK Data Archive Improving access to government datasets 18 January 2007.
Accessing the MCS via the Economic and Social Data Service Jack Kneeshaw and Alasdair Crockett MCS workshop 20 November 2003 ESDS Longitudinal.
Social Sciences Collections & Research: a new content-based team Gillian Ridgley, Ian Cooke, Jerry Jenkins.
For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.
For the e-Stat meeting of 6-7 April 2011 Paul Lambert / DAMES Node inputs 1)Updates on DAMES 2)Bringing DAMES inputs to e-Stat 3)Misc. feedback - Stat-JR.
Training opportunities – What do I need? And where can I get it? Vernon Gayle
Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
IHS: Requirements for Secondary Analysts Jo Wathan ESDS Government University of Manchester.
ESDS Resources for BCS Users Vanessa Higgins ESDS Government Centre for Census and Survey Research University of Manchester.
© University of Reading David Spence 20 April 2014 e-Research: Activities and Needs.
Cross-national data in DAMES and GE*DE Paul Lambert, University of Stirling Prepared for the Workshop on Cross-Nationally comparative social survey research,
DAMES - Data Management through e-Social Science 1 DAMES: Data Management through e-Social Science NCeSS Research Node University of Stirling / University.
Dealing with data on ethnicity: Principles and practice Paul Lambert, University of Stirling Talk presented to the DAMES Node workshop on Data on ethnicity.
DAMES, 31/JAN/2012, T6 Opportunities and prospects in social research Paul Lambert, 31 st January 2012 Talk to the seminar Data management in the social.
Metadata workshop, June The Workshop Workshop Timetable introduction to the Go-Geo! project metadata overview Go-Geo! portal hands on session.
Collection-level description & collection management: tool for the trade or information trade-off? Collection Description Focus Workshop 4 Newcastle, 8.
Collection-level description & the Information Landscape: users evaluate strategies for resource discovery Collection Description Focus Workshop 5 Cambridge,
Longitudinal Workforce Analysis using Routinely Collected Data: Challenges and Possibilities Shereen Hussein, BSc MSc PhD Kings College London.
International Workshop on Industrial Statistics Dalian, China June 2010 Shyam Upadhyaya UNIDO Use of IIP in other measures.
1 Title I Program Evaluation Title I Technical Assistance & Networking Session May 23, 2011.
New Products for © 2009 ANGEL Learning, Inc. Proprietary and Confidential, 2 Update Summary Enrich teaching and learning Meet accountability needs.
1 NEST New and emerging science and technology EUROPEAN COMMISSION - 6th Framework programme : Anticipating Scientific and Technological Needs.
Multiple Indicator Cluster Surveys Data Interpretation, Further Analysis and Dissemination Workshop Basic Concepts of Further Analysis.
1 Functional Strategy – IS & IT Geoff Leese November 2006, revised July 2007, September 2008, August 2009.
©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 28 Slide 1 Process Improvement 1.
The Research Excellence Framework RIOJA meeting 7 July 2008 Graeme Rosenberg REF Pilot Manager.
1 ESDS Government: added value for large-scale government datasets Vanessa Higgins, Economic and Social Data Service CCSR, University of Manchester MOF.
 Survey Skills Programme - 1 seminar day field days  Working at Survey Organization - 10 (+0-3) days of placement on a survey related.
1 Access to HE National and Regional How did we get to this point? 2003: request to QAA to make proposals for developing Access to HE in The.
Purple Market Research at INSIGHT 2006 November 2006 OLD MEETS NEW : using the Delphi Method to research the latest technology.
Test B, 100 Subtraction Facts
©2013 Experian Limited. All rights reserved. Experian and the marks used herein are service marks or registered trademarks of Experian Limited. Other products.
Chapter 12 Strategic Planning.
1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
11 Securing the Future of Canada’s AHSCs… NATIONAL CONSULTATION FORUM Sheraton Hotel – Ottawa January 28 & 29, 2010 Dr. Nick Busing Co-chair, Steering.
Final Report: NAHRS/MLA Magnet Coordinator Survey, July 2007  Pamela Sherwill-Navarro, Co-Chair, NAHRS Task Force to Create Standards for Nursing Information.
See ( OECD-JRC handbook on CI The ‘pros’: Can summarise complex or multi-dimensional issues in view of supporting decision-makers.
Unit 3 Siobhan Carey Department for International Development Making cross-national comparisons using micro data.
1 Scottish Social Survey Network: Master Class 1 Data Analysis with Stata Dr Vernon Gayle and Dr Paul Lambert 23 rd January 2008, University of Stirling.
GEODE Project introduction and summary, 12/12/05 GEODE: Grid Enabled Occupational Data Environment GEODE Project introduction and summary, 12/12/05 Motivation.
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
ESDS resources for managing data Jack Kneeshaw Economic and Social Data Service University of Essex, 27 January 2009.
Some comments on using research data in the social sciences Paul Lambert, School of Applied Social Science, University of Stirling, 25 March 2013.
ESDS - Support and resources Beate Lichtwardt, ESDS/UKDA British Library Conference Centre, London 9 March 2009.
GEODE - Durban ISA RC33, July 2006 Utilising a Grid Enabled Occupational Data Environment GEODE – Paper presented.
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,
: LSS1 Longitudinal Studies Seminars: Longitudinal Analyses Using STATA Stirling University, Data and Variable Management Paul Lambert.
Online survey analysis tools Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar.
Tools of data analysis Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 2 on.
Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on.
What is Administrative Data?
Presentation transcript:

Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social Science, Univ. Stirling) Vernon Gayle (Dept. Applied Social Science, Univ. Stirling and ISER, Univ. Essex) 27 th January 2009 Presented to the workshop The significance of data management for social survey research, University of Essex, a workshop organised by the Economic and Social Data Service ( and the Data Management through e- Social Science research Node of the National Centre for e-Social Science (

2 Manipulating data Operations performed on datasets by researchers and/or data distributors At any stage of the research lifecycle Of considerable consequence to analytical results DAMES Node: Data Management = manipulation of data, and documenting/assisting the processes of manipulation E-Social Science approach to facilitating data manipulation (metadata resources; data access facilities; workflow models)

3 Deriving variables, handling missing data, and cleaning data..Especially common types of data manipulation.. 1)Deriving variables = computing new measures for purposes of analysis oE.g. recoding complex categorical variables; standardising scores; linking micro- and macro-data o{Creating composite vars., e.g. selection model hazards, propensity scores, weights} 2)Handling missing data = strategies for item or case non-response oE.g. imputation approaches; listwise/pairwise deletion o{deriving missing variables via data fusion} oClarifying, stating & documenting assumptions (see 3)Cleaning data = monitoring and adjusting responses across a given set of variables oE.g. extreme values; erroneous values; re-scaling distributions;

4 In this talk… Practices, services and standards …For deriving variables, handling missing data, and cleaning data… Practices oKey, or common, features of current approaches Services oResources available/conceivable Standards oPreliminary thoughts on standards setting

5 (i) An brief illustrative example from the UK RAE 2008 Research Assessment Exercise data published Dec 2008 Extended reporting on basic data by media/within HE sector, e.g. Cambridge leads the way Nursing raises its status Numerous enhancements/amendments to data & analysis could be easily generated, and often lead to a different story Lambert, P.S. and Gayle, V. (2008). Data management and standardisation: A methodological comment on using results from the UK Research Assessment Exercise 2008, University of Stirling: Technical Paper of the Data Management through e-Social Science Research Node (

6 …Extending analysis of the 2008 RAE using data manipulations... Deriving variables Commonly used RAE Grade point average [4.(%4*) + 3.(%3*) + 2.(%2*) + (%1*)] / 100 Calculate alternative GPA measures Standardise GPA within Units of Assessment Rate Units of Assessment by external measures of relative prestige Link with 2001 standard thresholds Other external data – e.g. Univ. typologies; RAE panel membership Cleaning data Of 159 HEIs, 27 HEIs have only 1 UoA cf.mean 15 UoAs within HEI, max 53 (Manchester) The single UoA HEIs often have outlying GPAs Analyses of averages might excluding these HEIs Handling missing data Less conventionally missing data (admin dataset) But - not all HEI staff included within RAE; consider analysis accounting for number of excluded staff..?

7 Conventional RAE 2008 results for Univ. Essex

8 Alternative RAE 2008 measures for Univ. Essex (within- and between-subject standardisations)

9

10

11

12

13 RAE data manipulations example – practices, services and standards Practices Media/HEI announcements concentrate upon simplistic, unweighted, unstandardised rankings/averages Various alternative measures tell different stories – we found.. LSE outranks Cambridge Nursing ranks 6 least prestigious UoA from 67 Services Raw data available online: Relevant supplementary data: ; Standards RAE level documentation on grading criteria and approach, Software based Workflow approach (cf. Scott Long, 2009) oIn our paper we show Stata syntax for derived variables (

14 (ii) Some wider thoughts on data manipulation practices, services and standards Currently…, Practices are messy and painful Lack of replication and consistency in data manipulation tasks with complex survey data Few people relish data manipulations! Services exist but are under-exploited Standards are not agreed Ignoring standards no barrier to publication

15 Practices: apparent trends Deriving variables, handling missing data, cleaning data More interest in harmonisation and comparability Longitudinal and cross-national data Documentation challenges encourage simplifying approaches New data and analytical opportunities Increasing opportunities for enhancing data by linking at micro- or aggregate level Increasing availability of routines for missing values, extreme values Raising standards in secondary analysis of large scale surveys Inadequacy of simple analyses which ignore multivariate relations, missing data, multiprocess systems, hierarchical structures oData manipulations often conducted outside these considerations Desirability of replication

16 Services: key challenges Deriving variables, handling missing data, cleaning data Software issues Dominance of major proprietary database packages Other specialist/minority packages (e.g. MLwiN) Documentation / replication between packages..? Data security Few services can offer to let experts take over a dataset Approaches to reviewing data ought to avoid inspecting cases, duplicate copies Keeping up-to-date? oFinding data - need for search facilities [via metadata] oUpdating specialist advice E.g. of GEODE, occupational data out of date before completion NSIs strict focus on contemporary data

17 Standards: key requirements Deriving variables, handling missing data, cleaning data Need for documentation for replication Detailed accounts of process Citation of sources DAMES – to facilitate with metadata and process tools Resolving some difficult debates oApproaches to comparative research (measurement equivalence vs meaning equivalence) oNecessary standards for analysis/reporting on missing data oAppropriate approaches to extreme values, e.g. robust regressions

18 Forthcoming DAMES contributions Summer workshops on documenting manipulation and analysis of complex survey data To Stata and beyond.. Services for improving data manipulation activities Specialist data on occupations, ethnicity, education Specialist data on social care, mental health Tools for performing data manipulations (linking data and operationalising variables) Services for recording data manipulation activities Workflow modelling tools Metadata records for data linkages and variables Citation information