Download presentation
Presentation is loading. Please wait.
Published byAdela Rodgers Modified over 9 years ago
1
Organising social science data – computer science perspectives Simon Jones Computing Science and Mathematics University of Stirling, Stirling, Scotland, UK Seminar: Data management in the social sciences and the contribution of the DAMES Node Stirling 31 January 2012 DAMES: Data Management through e-Social Science http://www.dames.org.uk
2
2 DAMES: Background DAMES: Case studies, provision and support for data management in the social sciences This talk: focusing on "support for data management" Infrastructure/tools Driven by social science needs for support for advanced data management operations “In practice, social researchers often spend more time on data management than any other part of the research process” (Lambert) A ‘methodology’ of data management is relevant to ‘harmonisation’, ‘comparability’, ‘reproducibility’ in quantitative social science
3
3 DAMES: Themes Enabling the (social science) researcher: To deposit, search and process heterogeneous data resources To access online services/‘tools’ that enable researchers to carry out repeatable and challenging data management techniques such as: fusion matching imputation … Facilitating access is an important goal Underlying computer science research themes Metadata Data curation Data management/processing Portals
4
4 Data management/processing scenarios Curation scenarios include: Uploading occupational data to distribute across academic community Recording data properties prior to undertaking data fusion involving a survey and an aggregate dataset Fusion scenarios include: Linking a micro-social survey with aggregate occupational information (deterministic link) Enhancing a survey dataset with ‘nearest match’ explanatory variables (probabilistic link) Other processes: recoding, operationalising, linking, cleaning…
5
5 Generic data flows Data set store Processing Data sets are deposited Data sets are selected Processing is configured Data set selection, and the configuration of processing jobs must be informed by knowledge about the data sets - metadata Result is saved
6
6 Key role for metadata Metadata records are absolutely core to the functioning of the portal infrastructure For adequate, searchable records for the heterogeneous resources (data tables, command files, notes and documentation) To connect the resources and the data mgmt tools To document the data sets resulting from application of the data mgmt tools: inputs, process, rationale,… DAMES requirements: (Micro-)data based, very general DDI (= Data Documentation Initiative)
7
7 DDI 2 – An XML language An interesting study 12 DAMES Portal Univ of Stirling July 29, 2010 <ddi2:grantNo source=" Financial_1 " agency=" Economic and Social Research Council "> RES-149-25-1066...
8
8 The metadata "cycle" Processing Metadata Search Data is mirrored by metadata Configure/ process Select Deposit/curate
9
9 DAMES portal architecture overview Portal DAMES Resources External Dataset Repositories User Services Search Enact Fusion File Access Compute Resources Metadata Local Datasets (Note: Security omitted)
10
10 Tools Since metadata must have a key role in data management… So tools for managing and exploiting the metadata have key role in the use and operation of the DAMES portal At deposit/curation For searching For informing the configuration of processing steps The following slides illustrate use of our tools
11
11 Curation Tool The source data:
12
12
13
13
14
14
15
15
16
16
17
17
18
18
19
19
20
20
21
21
22
22
23
23
24
24 Also automatically uploaded to searchable eXist database
25
25 Metadata searching
26
26 Browsing the search results
27
27 Fusion Tool prototype Scenario: A soc sci researcher wishes to fuse Scottish Household Survey data with privately collected study data: Uses the data curation tool to upload the data Uses the data fusion/imputation tool to select the data, identify corresponding variables, and to generate a derived dataset (held in the portal) The metadata about this derived dataset is stored and (may be) made public through the portal Another researcher can now search the portal (metadata) for SHS data and find the derived dataset DAMES metadata handling must facilitate this process
28
28 The Fusion Tool prototype Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Metadata accessed
29
29 Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Metadata accessed
30
30 Select datasets (recipient and donor) Select "common variables" Select variables to be imputed Select data fusion method Submit to fusion "enactor" Skipped Metadata for result dataset
31
31 Job submission: Information flow Wizard Enactor Compute resources (Condor) subjob1 subjob2 User's local file store Resultant data DDI record notify (job id) fetch job submit JFDL/JSDL description.xml Further infra- structure
32
32 Fusion job flow description We use a Job Flow Description Language (JFDL) to submit the job to the computing resources pool The JFDL job description includes references to: Input data sets Processing steps and their relationships Outputs
33
33 JSDL/JFDL DAMES::Fusion............ … A brief extract!
34
34 Technology – other components Liferay portal eXist XML based database – ideal for storing DDI metadata Condor Job management iRODS Highly flexible filestore Capable of running automated processes on file upload: e.g. metadata extraction (e.g. STATA files), JFDL → DDI translation, & transfer from file store to metadata store
35
35 Thank you!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.