A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones, Guy Warner,

Slides:

Advertisements

Similar presentations

UK DATA ARCHIVE Louise Corti, ODAF April UK Data Archive an internationally-renowned centre of expertise in data acquisition, preservation, dissemination.

Advertisements

DDI for the Uninitiated ACCOLEDS /DLI Training: December 2003 Ernie Boyko Statistics Canada Chuck Humphrey University of Alberta.

The Economic and Social Data Service (ESDS) Kevin Schürer ESDS/UKDA ESDS Awareness Day 5 December 2003.

ESRS Data Policy ESDS role in its successful implementation Kristine Doronenkova,

New Services for Data Creators and Providers Louise Corti, Head ESDS Qualidata/ Outreach & Training Alasdair Crockett, ESDS Data Services Manager.

ESDS Qualidata Libby Bishop, ESDS Qualidata Economic and Social Data Service UK Data Archive ESDS Awareness Day Friday 5 December 2003Royal Statistical.

GEODE - NeSC workshop, Oct 2006 GEODE: Grid Enabled Occupational Data Environment Paul Lambert and Larry Tan University of Stirling

For the e-Stat meeting of 27 Sept 2010 Paul Lambert / DAMES Node inputs.

For the e-Stat meeting of 6-7 April 2011 Paul Lambert / DAMES Node inputs 1)Updates on DAMES 2)Bringing DAMES inputs to e-Stat 3)Misc. feedback - Stat-JR.

DAMES - Data Management through e-Social Science 1 DAMES: Data Management through e-Social Science NCeSS Research Node University of Stirling / University.

UKOLN is supported by: JISC Information Environment update Repositories and Preservation Programme meeting, October 24-25, 2006 Rachel Heery UKOLN

CLiP 2006: Literatures, Languages and Cultural Heritage in a digital world Building a Virtual Research Environment for the Humanities The JISC funded ‘Building.

Itembanking Infrastructure: A Proposal for a Decoupled Architecture Mhairi McAlpine and Linn van der Zanden Scottish Qualifications Authority CAA Conference.

E-Infrastructure for Social Science data: Obesity e-Lab & MethodBox Ian Dunlop 15/03/11

GEODE Workshop 16 th January 2007 Issues in e-Science Richard Sinnott University of Glasgow Ken Turner University of Stirling.

GEODE Project introduction and summary, 12/12/05 GEODE: Grid Enabled Occupational Data Environment GEODE Project introduction and summary, 12/12/05 Motivation.

Joint Information Systems Committee Supporting Higher and Further Education Development of an Information Environment for UK Learning and Teaching NOF-Digitise.

QCDgrid Technology James Perry, George Beckett, Lorna Smith EPCC, The University Of Edinburgh.

Data Management Development and Implementation: an example from the UK SLA Conference, Boston, June 2015 Geraldine Clement-Stoneham Knowledge and Information.

SCIENCE-DRIVEN INFORMATICS FOR PCORI PPRN Kristen Anton UNC Chapel Hill/ White River Computing Dan Crichton White River Computing February 3, 2014.

1 Yolanda Gil Information Sciences InstituteJanuary 10, 2010 Requirements for caBIG Infrastructure to Support Semantic Workflows Yolanda.

Distributed Access to Data Resources: Metadata Experiences from the NESSTAR Project Simon Musgrave Data Archive, University of Essex.

Connecting OurGrid & GridSAM A Short Overview. Content Goals OurGrid: architecture overview OurGrid: short overview GridSAM: short overview GridSAM: example.

Using ISO/IEC to Help with Metadata Management Problems Graeme Oakley Australian Bureau of Statistics.

Enabling E Research ANU Data Commons. What is it ? Building a repository for data sets o data can be deposited o updated o published to Research Data.

Statistics Canada’s Real Time Remote Access Solution 2011 MSIS Meeting – Karen Doherty May 2011.

Flexibility and user-friendliness of grid portals: the PROGRESS approach Michal Kosiedowski

GEODE - eSS Manchester, June 2006 Development of a Grid Enabled Occupational Data Environment GEODE – Paper presented.

DAME: Distributed Engine Health Monitoring on the Grid

Development of metadata in the National Statistical Institute of Spain Work Session on Statistical Metadata Genève, 6-8 May-2013 Ana Isabel Sánchez-Luengo.

ESDS resources for managing data Jack Kneeshaw Economic and Social Data Service University of Essex, 27 January 2009.

QCDGrid Progress James Perry, Andrew Jackson, Stephen Booth, Lorna Smith EPCC, The University Of Edinburgh.

11 CORE Architecture Mauro Bruno, Monica Scannapieco, Carlo Vaccari, Giulia Vaste Antonino Virgillito, Diego Zardetto (Istat)

Design of a Search Engine for Metadata Search Based on Metalogy Ing-Xiang Chen, Che-Min Chen,and Cheng-Zen Yang Dept. of Computer Engineering and Science.

Some comments on using research data in the social sciences Paul Lambert, School of Applied Social Science, University of Stirling, 25 March 2013.

Supported by EU projects 12/12/2013 Athens, Greece Open Data in Agriculture Hands-on with data infrastructures that can power your agricultural data products.

Metadata driven application for data processing – from local toward global solution Rudi Seljak Statistical Office of the Republic of Slovenia.

Community Mental Health Common Assessment Project (CMH CAP) Vendor Update Mary Lou Bozin Delivery Manager, CMH CAP.

DAME: A Distributed Diagnostics Environment for Maintenance Duncan Russell University of Leeds.

NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.

1 Earth System Modeling Framework Documenting and comparing models using Earth System Curator Sylvia Murphy: Julien Chastang:

Infrastructures for Social Simulation Rob Procter National e-Infrastructure for Social Simulation ISGC 2010 Social Simulation Tutorial.

Any data..! Any where..! Any time..! Linking Process and Content in a Distributed Spatial Production System Pierre Lafond HydraSpace Solutions Inc

A Practical Approach to Metadata Management Mark Jessop Prof. Jim Austin University of York.

Combining the strengths of UMIST and The Victoria University of Manchester “Use cases” Stephen Pickles e-Frameworks meets e-Science workshop Edinburgh,

Create Content Capture Content Review Content Edit Content Version Content Version Content Translate Content Translate Content Format Content Transform.

GEODE - Durban ISA RC33, July 2006 Utilising a Grid Enabled Occupational Data Environment GEODE – Paper presented.

CASE (Computer-Aided Software Engineering) Tools Software that is used to support software process activities. Provides software process support by:- –

Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.

Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing

Adrian Jackson, Stephen Booth EPCC Resource Usage Monitoring and Accounting.

Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland,

Fire Emissions Network Sept. 4, 2002 A white paper for the development of a NSF Digital Government Program proposal Stefan Falke Washington University.

© Geodise Project, University of Southampton, Integrating Data Management into Engineering Applications Zhuoan Jiao, Jasmin.

GSIM, DDI & Standards- based Modernisation of Official Statistics Workshop – DDI Lifecycle: Looking Forward October 2012.

AHM04: Sep 2004 Nottingham CCLRC e-Science Centre eMinerals: Environment from the Molecular Level Managing simulation data Lisa Blanshard e- Science Data.

Enabling Plant Sciences Research with the iPlant Discovery Environment and Condor Juan Antonio Raygoza Garay, Sonya Lowry, John Wregglesworth.

GEODE – Sharing Occupational Data Through The Grid Dr. Paul Lambert, Dr. Vernon Gayle, Prof. Ken Prandy, Prof. Richard Sinnott, Prof. Ken Turner, Koon.

Application of RDF-OWL in the ESG Ontology Sylvia Murphy: Julien Chastang: Luca Cinquini:

© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.

Developing GRID Applications GRACE Project

Linking data resources Paul Lambert, University of Stirling Presentation to the Scottish Civil Society Data Partnership Project (S-CSDP), Webinar 3 on.

Statistical process model Workshop in Ukraine October 2015 Karin Blix Quality coordinator

PARTHENOS-project.eu EOSC market demand for art, humanties and cultural heritage Amsterdam– EGI Conference– 7/4/2016 Franco Niccolucci Scientific Coordinator,

InSilicoLab – Grid Environment for Supporting Numerical Experiments in Chemistry Joanna Kocot, Daniel Harężlak, Klemens Noga, Mariusz Sterzel, Tomasz Szepieniec.

INTRODUCTION TO XSEDE. INTRODUCTION  Extreme Science and Engineering Discovery Environment (XSEDE)  “most advanced, powerful, and robust collection.

Advanced Higher Computing Science

DataNet Collaboration

An Overview of Data-PASS Shared Catalog

Colectica 5 A New Generation of Open Metadata Tools

Presentation transcript:

A Data Curation Application Using DDI: The DAMES Data Curation Tool for Organising Specialist Social Science Data Resources Simon Jones*, Guy Warner*, Paul Lambert**, Jesse Blum* * Computing Science, ** Applied Social Science University of Stirling, Stirling, Scotland, UK IASSIST 2010 Session: Automated Curation Tools and Services for Metadata Ithaca, NY 3 June 2010 DAMES: Data Management through e-Social Science

2 DAMES: Background  DAMES: Case studies, provision and support for data management in the social sciences  A 3 year ESRC research “node”: Feb 2008-Feb 2011  Funded by the Digital Social Research Strategy group  Driven by social science needs for support for advanced data management operations  “In practice, social researchers often spend more time on data management than any other part of the research process” (Lambert)  A ‘methodology’ of data management is relevant to ‘harmonisation’, ‘comparability’, ‘reproducibility’ in quantitative social science

3 DAMES: Themes  Four social science themes  Grid Enabled Specialist Data Environments: occupations; education; ethnicity  Micro-simulation on social care data  Linking e-Health and social science databases  Training and interfaces for data management support  Underlying computer science research themes  Metadata  Data curation  Data fusion  Workflow modelling  Data security

4 DAMES goals  To build on previous work on the GEODE project  Grid Enabled Occupational Data Environment  To build an integrated DAMES Grid-based portal:  Combining specialist occupational, ethnicity, educational, health data resources: GE*DE(“GESDE”)  Enabling a virtual community of soc sci researchers:  To deposit and search heterogeneous data resources  To access online services/‘tools’ that enable researchers to carry out repeatable and challenging data management techniques such as: fusion matching imputation …  Facilitating access is an important goal

5 GEODE

6 DAMES scenarios  Curation scenarios include:  Uploading occupational data to distribute across academic community  Recording data properties prior to undertaking data fusion involving a survey and an aggregate dataset  Fusion scenarios include:  Linking a micro-social survey with aggregate occupational information (deterministic link)  Enhancing a survey dataset with ‘nearest match’ explanatory variables (probabilistic link)

7 Key role for metadata  Metadata records are absolutely core to the functioning of the portal infrastructure  For adequate, searchable records for the heterogeneous resources (data tables, command files, notes and documentation)  To connect the resources and the data mgmt tools  To document the data sets resulting from application of the data mgmt tools: inputs, process, rationale,…  DAMES requirements:  (Micro-)data based, very general, lifecycle oriented, Grid friendly  DDI 3

8 The metadata "cycle" Processing Metadata Search

9 DAMES portal architecture overview Portal DAMES Resources External Dataset Repositories User Services Search Enact Fusion File Access Compute Resources Metadata Local Datasets (Note: Security omitted)

10 Curation Tool prototype The source data:

11

12

13

14

15

16

17

18

19

20

21

22

23 Also automatically uploaded to searchable eXist database

24 Example: The Fusion Tool  Scenario: A soc sci researcher wishes to fuse Scottish Household Survey data with privately collected study data:  Uses the data curation tool to upload the data  Uses the data fusion/imputation tool to select the data, identify corresponding variables, and to generate a derived dataset (held in the portal)  The metadata about this derived dataset is stored and (may be) made public through the portal  Another researcher can now search the portal (metadata) for SHS data and find the derived dataset  DAMES metadata handling must facilitate this process

25 The Fusion Tool prototype Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Metadata accessed

26 Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Metadata accessed

27 Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Skipped Metadata for result dataset

28 Job submission: Information flow Wizard Enactor Compute resources (Condor) subjob1 subjob2 User's local file store Resultant data DDI record notify (job id) fetch job submit JFDL/JSDL description.xml Further infrastructure

29 Fusion job flow description  The fusion/imputation is potentially resource greedy  Generalise to using a pool of computing resources  We use a Job Flow Description Language (JFDL) to submit the job to the computing resources pool  The JFDL job description includes references to:  Input data sets  Processing steps and their relationships  Outputs  This must be recorded in the DDI record for the resultant data set  We are automating the translation JFDL → DDI3

30 JSDL/JFDL DAMES::Fusion … A brief extract!

31 The DDI3 metadata... Donor Dataset Recipient Dataset [URI for the command file] … A fragmentary outline!

32 Technology – other components  Liferay portal  eXist  XML based database – ideal for storing DDI3 metadata  Condor  Job management  iRODS  Highly flexible filestore  Capable of running automated processes on file upload: e.g. metadata extraction (e.g. STATA files), JFDL → DDI3 translation, & transfer from file store to metadata store

33 Summary/Work in progress  The DAMES infrastructure is still very much in the development phase  We have a prototype outline portal, with prototype curation and fusion tools  We are firmly fixed on using DDI 3 as our metadata standard, but we are still:  Refining the JFDL  Refining the DDI3  Improving generation of DDI3 from JFDL  Improving searching and discovery of datasets Thank you!