Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research.

Slides:



Advertisements
Similar presentations
1 Senn, Information Technology, 3 rd Edition © 2004 Pearson Prentice Hall James A. Senns Information Technology, 3 rd Edition Chapter 7 Enterprise Databases.
Advertisements

©2011 1www.id-book.com Evaluation studies: From controlled to natural settings Chapter 14.
Using Matrices in Real Life
Chapter 7 System Models.
Requirements Engineering Process
UK DATA ARCHIVE Louise Corti, ODAF April UK Data Archive an internationally-renowned centre of expertise in data acquisition, preservation, dissemination.
How do public policies affect individual households? Design and uses of EUROMOD: an EU-wide tax/benefit model Herwig Immervoll OECD IZA, Bonn ISER, University.
UNITED NATIONS REGIONAL WORKSHOP ON DATA DISSEMINATION AND COMMUNICATION VENUE: Amman, Jordan DATE: 9th September, 2013 Presenter: GODWIN ODEI GYEBI Statistical.
ASYCUDA Overview … a summary of the objectives of ASYCUDA implementation projects and features of the software for the Customs computer system.
28 April 2004Second Nordic Conference on Scholarly Communication 1 Citation Analysis for the Free, Online Literature Tim Brody Intelligence, Agents, Multimedia.
1 Validation & Measurement Methods for the PHARE Demonstrations R A Whitaker Validation Project Leader.
Exit a Customer Chapter 8. Exit a Customer 8-2 Objectives Perform exit summary process consisting of the following steps: Review service records Close.
Welcome School Improvement Advisory Committee Members We are happy youre here!
1 Term 2, 2004, Lecture 9, Distributed DatabasesMarian Ursu, Department of Computing, Goldsmiths College Distributed databases 3.
Chapter 12 Analysing quantitative data
Chapter 3 Critically reviewing the literature
ESDS user support materials and resources: how to use them Support Services Royal Statistical Society, London 13 February 2009.
Accessing longitudinal data via the UK Data Archive / ESDS Jack Kneeshaw NCDS summer school course, July 2005 ESDS Longitudinal.
Manipulating data: Deriving variables, handling missing data, and cleaning data - practices, services and standards Paul Lambert (Dept. Applied Social.
Managing data for social survey research: key issues and concerns Paul Lambert, Dept. Applied Social Science, Univ. Stirling 27 th January 2009 Presented.
Access to Economic and Social Data via the UK Data Archive Jack Kneeshaw UKDA.
An Introduction to the UK Data Archive and the Economic and Social Data Service November 2007 Jack Kneeshaw, UKDA.
ESDS Qualidata Libby Bishop, ESDS Qualidata Economic and Social Data Service UK Data Archive ESDS Awareness Day Friday 5 December 2003Royal Statistical.
The Economic and Social Data Service (ESDS) Karen Dennison UK Data Archive Improving access to government datasets 18 January 2007.
Accessing the MCS via the Economic and Social Data Service Jack Kneeshaw and Alasdair Crockett MCS workshop 20 November 2003 ESDS Longitudinal.
GEODE - NeSC workshop, Oct 2006 GEODE: Grid Enabled Occupational Data Environment Paul Lambert and Larry Tan University of Stirling
OMII-UK Steven Newhouse, Director. © 2 OMII-UK aims to provide software and support to enable a sustained future for the UK e-Science community and its.
For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.
For the e-Stat meeting of 6-7 April 2011 Paul Lambert / DAMES Node inputs 1)Updates on DAMES 2)Bringing DAMES inputs to e-Stat 3)Misc. feedback - Stat-JR.
Obesity e-Lab Enabling obesity research using the Health Surveys for England: The Obesity e-Lab project Dexter Canoy The University of Manchester
ESDS Using working with surveys: v.10/07 1 Further Applications of Linking and matching Anthony Rafferty & Jo Wathan Economic and Social Data Service (Government.
SADC Course in Statistics Session 4 & 5 Producing Good Tables.
SADC Course in Statistics Analysing Data Module I3 Session 1.
Cross-national data in DAMES and GE*DE Paul Lambert, University of Stirling Prepared for the Workshop on Cross-Nationally comparative social survey research,
DAMES - Data Management through e-Social Science 1 DAMES: Data Management through e-Social Science NCeSS Research Node University of Stirling / University.
Standardisation, Harmonisation and Measurement Paul Lambert, August 2009 Talk to the Data Management for Social Survey Research training workshop,
Workflows for Social Science Ken Turner Computing Science and Mathematics 31st January 2012.
Dealing with data on ethnicity: Principles and practice Paul Lambert, University of Stirling Talk presented to the DAMES Node workshop on Data on ethnicity.
DAMES, 31/JAN/2012, T6 Opportunities and prospects in social research Paul Lambert, 31 st January 2012 Talk to the seminar Data management in the social.
Longitudinal Workforce Analysis using Routinely Collected Data: Challenges and Possibilities Shereen Hussein, BSc MSc PhD Kings College London.
World Health Organization
Configuration management
Software change management
1 Quality Indicators for Device Demonstrations April 21, 2009 Lisa Kosh Diana Carl.
Campaign Overview Mailers Mailing Lists
INSERT BOOK COVER 1Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall. Exploring Microsoft Office Excel 2010 by Robert Grauer, Keith.
Prerequisites Recommended modules to complete before viewing this module 1. Introduction to the NLTS2 Training Modules 2. NLTS2 Study Overview 3. NLTS2.
Collections and services in the information environment JISC Collection/Service Description Workshop, London, 11 July 2002 Pete Johnston UKOLN, University.
A Virtual Research Environment for the Study of Documents and Manuscripts 1 1 Research administration Resource discovery Data creation, use and analysis.
Access to HE Diploma Grading. The Access to HE grading model unit grading all level 3 units (level 2 units will not be graded) no aggregate or single.
Chapter 10: The Traditional Approach to Design
Systems Analysis and Design in a Changing World, Fifth Edition
Module 12 WSP quality assurance tool 1. Module 12 WSP quality assurance tool Session structure Introduction About the tool Using the tool Supporting materials.
11. NLTS2 Documentation: Data Dictionaries. 1 Prerequisites Recommended modules to complete before viewing this module  1. Introduction to the NLTS2.
Organization Theory and Health Services Management
EndNote Download link: 1.
A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*,
GEODE - eSS Manchester, June 2006 Development of a Grid Enabled Occupational Data Environment GEODE – Paper presented.
Some comments on using research data in the social sciences Paul Lambert, School of Applied Social Science, University of Stirling, 25 March 2013.
GEODE - Durban ISA RC33, July 2006 Utilising a Grid Enabled Occupational Data Environment GEODE – Paper presented.
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
: LSS1 Longitudinal Studies Seminars: Longitudinal Analyses Using STATA Stirling University, Data and Variable Management Paul Lambert.
Tools of data analysis Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 2 on.
Secondary survey data Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 1 on ‘Dealing.
Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on.
Webinar 4: Academic tools of data analysis: Comparing SPSS, Stata and R and engaging with Higher Education institutions Scottish Civil Society Data Partnership.
Occupational data Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on ‘Dealing.
Making graphs with academic software tools (SPSS, Stata and R) Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership.
Accessing data – a user’s perspective
Presentation transcript:

Linking the DAMES & e-Stat Nodes Paul Lambert, 26 Feb 2010, Bristol, e-Stat review meeting DAMES is the Data Management through e-Social Science research Node,

2 1.Some background on DAMES 2. First thoughts on linking DAMES and e-Stat 3. Some proposals on usability / services

3 1) Data Management though e- Social Science DAMES – ESRC Node funded Aim: Useful social science provisions Specialist data topics – occupations; education qualifications; ethnicity; social care; health Mainstream packages and accessible resources Aim: To exploit/engage with existing DM resources In social science – e.g. ESDS, CESSDA In e-Science – e.g. OGSA-DAI; OMII

4 To us Data management means… the tasks associated with linking related data resources, with coding and re-coding data in a consistent manner, and with accessing related data resources and combining them within the process of analysis […DAMES Node..] Usually performed by social scientists themselves Pre-analysis tasks (though often revised/updated) Inputs also from data providers Usually a substantial component of the work process But may not be explicitly rewarded (and sometimes penalised) differentiate from archiving / controlling data itself differentiate from archiving / controlling data itself

5 Some components… Manipulating data Recoding categories / operationalising variables Linking data Linking related data (e.g. longitudinal studies) combining / enhancing data (e.g. linking micro- and macro-data) Secure access to data Linking data with different levels of access permission Detailed access to micro-data cf. access restrictions Harmonisation standards Approaches to linking concepts and measures (indicators) Recommendations on particular variable constructions Cleaning data missing values; implausible responses; extreme values

6 Example – recoding data

7 Example –Linking data Linking via ojbsoc00 : c1-5 =original data / c6 = derived from data / c7 = derived from

8 Matching files (deterministic) Complex data (complex research) is distributed across different files. In surveys, use key linking variables for... One-to-one matching SPSS: match files /file=file1.sav /file=file2.sav /by=pid. Stata: merge pid using file2.dta One-to-many matching (table distribution) SPSS: match files /file=file1.sav /table=file2.sav /by=pid. Stata: merge pid using file2.dta Many-to-one matching (aggregation) SPSS: aggregate outfile=file3.sav /meaninc=mean(income) /break=pid. Stata: collapse (mean) meaninc=income, by(pid) Many-to-Many matches Related cases matching

9 A bit of focus… I tend to emphasise two data management activities: 1)Variable constructions oCoding and re-coding values 2)Linking datasets oInternal and external linkages

10..plus the centrality of keeping clear records of DM activities Reproducible (for self) Replicable (for all) Paper trail for whole lifecycle Cf. Dale 2006; Freese 2007 In survey research, this means using clearly annotated syntax files (e.g. SPSS/Stata) Syntax Examples:

Principle DAMES services (current status) GESDE specialist data environments (prototypes) Occupations, educational qualifications, ethnicity Data curation tool (prototype) Data fusion tool (prototype) Secure data demonstrator for e-Health research (complete) Micro-simulation model for social care data (prototype) Training workshops and events (in progress) 11

GEMDE – Grid Enabled Specialist Data Environments 12

GEODE – Occupational data

Data curation tool 14 The curation tool obtains metadata and supports the storage and organisation of data resources in a more generic way

Data fusion tool 15

16 2. Linking DAMES and e-Stat High level vision is to ingrain data management functionality and uptake within e-Stat modelling capabilities -Using/adapting DAMES contributions -DAMES services for data linking -DAMES resources for recoding variables -Making replication central to the data story

Data and variables DAMES does not in general provide routes to new/alternative microdata, but to relevant supplementary data (e.g. aggregate data) Anything on educational qualifications, occupations, ethnicity is of particular interest Generic tools for merging micro-data Generic tools for other variable processes 17

Data oriented review Applied research perspective Range of data resources Accessing and documenting data resource options 18

The implementation for e-Stat This is mostly a blank space… …and weve not hitherto used Python Data curation tool and GEODE/GEEDE use IRODS GEMDE uses a bespoke SQL database Data fusion tool uses R (and some Stata) scripts accessed via a Liferay portal

20 3. A pitch for specific e-Stat facilities..harvest the best of data analysis packages from applied data perspective Replication in human readable syntax Something like Statas est store for multiple model comparisons Fluency in data oriented options Training resources in data

Est store demo here 21

Appendix items 22

23 Data file specificationVariable manipulation & analysis DAMES most common commands: Commands invoking other packages -> usedataset{UKDA_5151} -> usedatafile{individuals wave A} -> matchdata{individuals wave A;individuals wave B; link variable=pid; format=wide} -> SPSS{match files file=aindresp.sav /file=bindresp.sav /by=pid} -> SPSS{fre var=ajbrgsc} -> Stata{recode ageb 16/30=1 31/50=2 *=.} -> R{..} -> Stata{do $path2\part1_analysis.do} Model 1: Graphics Text interface Invoked manually or in response to manipulating graphs BHPS, wave A individuals BHPS wave B individuals. Analytical file Wave C Gender Current job RGSC Spouse CAMSIS Age (yrs) Age bands Spouse SOC

24 The significance of data management for social survey research (see The data manipulations described above are a major component of the social survey research workload Pre-release manipulations performed by distributors / archivists Coding measures into standard categories Dealing with missing records Post-release manipulations performed by researchers Re-coding measures into simple categories We do have existing tools, facilities and expert experience to help us…but we dont make a good job of using them efficiently or consistently So the significance of DM is about how much better research might be if we did things more effectively…

25 Some provocative examples for the UK… Social mobility is increasing, not decreasing! Popularity of controversial findings associated with Blanden et al (2004) Contradicted by wider ranging datasets and/or better measures of stratification position DM: researchers ought to be able to more easily access wider data and better variables Degrees, MScs and PhDs are getting easier! {or at least, more people are getting such qualifications} Correlates with measures of education are changing over time DM: facility in identifying qualification categories & standardising their relative value within age/cohort/gender distributions isnt, but should, and could, be widespread Black-Caribbeans are not disappearing! As the immigrant cohort ages, the Black-Caribbean group is decreasingly prominent due to return migration and social integration of immigrant descendants Data collectors under-pressure to measure large groups only DM: It ought to remain easy to access and analyse survey data on Black-Caribbeans, such as by merging survey data sources and/or linking with suitable summary measures

26 Comment – growing interest in data management..? Historically, references covering DM were few and far between Dale, A., Arber, S., & Procter, M. (1988). Doing Secondary Analysis. London: Unwin Hyman Ltd. Recently, theres been a small burst of relevant references Levesque, R., & SPSS Inc. (2008). Programming and Data Management for SPSS Statistics Chicago, Il.: SPSS Inc.. Long, J. S. (2009). The Workflow of Data Analysis Using Stata. Boca Raton: CRC Press. Treiman, D. J. (2009). Quantitative Data Analysis: Doing Social Research to Test Ideas. New York: Jossey Bass. growing interest re. documentation for replication Dale, A. (2006). Quality Issues with Survey Research. International Journal of Social Research Methodology, 9(2), Freese, J. (2007). Replication Standards for Quantitative Social Science: Why Not Sociology? Sociological Methods and Research, 36(2), 2007.

27 E-Science and Data Management E-Science isnt essential to good DM, but it has capacity to improve and support conduct of DM… 1.Concern with standards setting in communication and enhancement of data 2.Linking distributed/heterogeneous/dynamic data Coordinating disparate resources; interrogating live resources 3)Contribution of metadata tools/standards for variable harmonisation and standardisation 4)Linking data subject to different security levels 5)The workflow nature of many DM tasks