Presentation is loading. Please wait.

Presentation is loading. Please wait.

How Can e-Social Science Promote the Re-Use of Data?

Similar presentations


Presentation on theme: "How Can e-Social Science Promote the Re-Use of Data?"— Presentation transcript:

1 How Can e-Social Science Promote the Re-Use of Data?
Rob Procter National Centre for e-Social Science The Grid and e-Social Science NCeSS: Background Research agenda Some examples: ConvertGrid How can e-SS benefit data providers and users Some issues Over to you … UPTAP Workshop 2007

2 The e-Science Vision “e-Science is about global collaboration in key areas of science and the next generation of infrastructure that will enable it.” (John Taylor, former DG, Research Councils) That infrastructure is the Grid: “ … a software infrastructure that enables flexible, secure, coordinated resource sharing among dynamic collections of individuals, institutions and resources” (Foster, Kesselman and Tuecke) The Grid is not just an enabler of visionary research, however, but can help researchers in more mundane ways. But, to be successful, the development of the Grid must be driven by researchers’ needs. I want to use the opportunity provided by this workshop to gather ideas from you about what those needs are with a specific focus on the (re-)use of data. If that sounds too abstract or technical, lets make it more concrete with some examples. UPTAP Workshop 2007

3 NCeSS Overview Launched in May 2004 to develop and promote UK e-Social Science. Unified Centre with distributed structure: Co-ordinating Hub: Manchester & UKDA Seven research Nodes located across UK Twelve small projects UPTAP Workshop 2007

4 NCeSS Overview Applications of e-Social Science: Social shaping:
Harnessing new kinds of research infrastructure and tools to tackle substantive problems and promote innovation in research methods Social shaping: Usability of new infrastructure and tools Socio-technical factors in their design, uptake and use Research and policy drivers, impacts UPTAP Workshop 2007

5 NCeSS 2006 Social Shaping CQeSS MoSeS PolicyGrid MiMeG Analysis OeSS
Disclosure Risk Assessment CeSDeMIDE GeSRM Intelligent Simulation MiMeG HeadTalk Analysis OeSS DReSS AGN enabled interviews Learning Disabilities Entangled Data Data chronicles Replayer Grid-enabled data collection Data GeoVUE GeODE Tools Infrastructure and services Hub Conceptual map. Emphasises the importance of NCeSS achieving synergies between the research projects. Research methods UPTAP Workshop 2007

6 Today’s Research Infrastructure
Heterogeneous resources with poor inter-operability and complex administrative arrangements. Data archive HPC Study Analysis HPC Researcher Experiment Computing Data archive Analysis HPC Doesn’t scale well and makes re-use and sharing of data and other research resources difficult. UPTAP Workshop 2007

7 Grid-Enabled Research Infrastructure
Grid middleware manages the interactions between users, and heterogeneous and distributed resources, providing seamless integration of data, analytic tools and compute resources. Data archive Study Social scientist Grid Middle- ware HPC Storage Analysis The grid is a new kind of research infrastructure based on distributed, interoperable and composable resources such as data and computational services. Social scientist Data archive HPC Computing Social scientist Analysis Experiment Storage UPTAP Workshop 2007

8 The Grid Dissected Tools to support collaboration between distributed researchers. Computational Grids for scalable, high-performance computation. Data Grids for accessing and integrating heterogeneous datasets. Sensor Grids for collecting real-time data. UPTAP Workshop 2007

9 Research and Policy Drivers
Census and population surveys Administrative data Longitudinal surveys Socio-medical data Business and economic data International macro/micro data Ageing population Migration Globalisation Childhood development            It will also involve linking multiple datasets. Provision of datasets through services such as UKDA is well established But re-use of datasets and especially their linking is inhibited by problems of heterogeneity Data linking and data fusion (where there are no common elements in the sets?) Multi-disciplinary, multi-scale problems: Ageing pop: social scientists of different kinds, healthcare researchers UPTAP Workshop 2007

10 Research and Policy Drivers
The range of research resources on offer to the social science community has never been greater. These include not only traditional research datasets, but new kinds of social data. However, the often highly distributed and heterogeneous character of these datasets makes it difficult to exploit them to their full potential. UPTAP Workshop 2007

11 Research and Policy Drivers
The data deluge in social sciences: WWW archive currently contains 55 billion Web pages or 2 petabytes (2x1015) of data and is growing at the rate of 20 terabytes (20x1012) per month Administrative and transactional data is generated on increasing scale as by product of our everyday activities: This data is complex and multi-dimensional Exploiting these new data sources to their full potential requires more sophisticated techniques for multi-media data discovery, linking and management, and more powerful services – such as text mining and social network analysis – for data extraction, annotation and analysis. These services are potentially computationally intensive and emphasise how social sciences can benefit from compute grids. But it is data grids that I want now to focus on. UPTAP Workshop 2007

12 Data Grids for Social Science
Data Grids are designed to provide unimpeded and integrated use of distributed, heterogeneous, autonomous data resources. Grid enabling a dataset creates new opportunities for (re-)use: enables users to integrate it with other datasets makes it possible to analyse the dataset using techniques that require the kind of computational power that is only feasible using the Grid (e.g., more complex models, more data points) standardisation of procedures and mechanisms used to access and update the dataset increase its shareability Automated analyses (i.e., analyses can be re-run automatically when databases are updated) Social data is currently under-analysed – Ian Diamond. UPTAP Workshop 2007

13 An Example Data Linkage Problem
Many research questions require combination of data from multiple geo-referenced datasets: E.g., Linking post coded data to census geography Conversion of data relating to different geographies to a common target geography is A complex time consuming task Requires a range of data handling/processing skills A major barrier to use! The data conversion process requires users to perform the following generic tasks: Extract and download data in different formats from a number of databases using different interfaces Convert each dataset to the desired target geography using geographical conversion tables Combine the converted sets into a single dataset for analysis These generic tasks can be automated. Is automation relevant where human judgment is critical? Automate the routine parts of the process UPTAP Workshop 2007

14 A Solution: ConvertGrid
ConvertGrid provides access to 225 UK-wide geography conversion tables between census, electoral, administrative, postal, health and statistical geographies derived from the AFPD. Facility to convert a researcher’s data from one set of geographical units to another (e.g., from postcode geography to heath geography). Extensible system - further conversion tables from any source can be incorporated. UPTAP Workshop 2007

15 ConvertGrid – Data Visualisation Interface
High average house price sales but low participation rates ConvertGrid is an example of a grid service – it runs on a server (in this case located at MIMAS) and not on the user’s desktop. However, this would normally be transparent to the user. Different services can be composed together to carry out more complex tasks. This slide Illustrates how ConvertGrid results can be fed into a visualisation service. Low average house price sales but high participation rates Ten minutes from start to finish Relationship between average house price sales (Experian) and percentage of year olds entering university (Neighbourhood Statistics & Census aggregate statistics). Contact Keith Cole for more information. UPTAP Workshop 2007

16 Supporting the Research Lifecycle
Share results and conclusions and discuss with collaborators Explore datasets and determine suitability Analyse results and compare with hypothesis Review literature and generate hypothesis Write papers Build models and execute them Publish papers Find datasets related to proposed area of work UPTAP Workshop 2007

17 Increasing (Re-)Use of Social Data
Removing barriers to more effective use of existing social data collections: Data providers (e.g., ONS, data archives) Data users Many researchers are both generators and users of data: Preparation of data for submission to data archives is not well rewarded so re-use suffers Removing barriers to use of new kinds of social data: Privacy and confidentiality of personal data UPTAP Workshop 2007

18 The Data Provider Perspective
Preparation procedures: Cleaning the data Generating derived variables Re-weighting Adding metadata Writing user documentation Maintenance: Managing changes in sampling frames, definitions, variables and questionnaire over time User support: Handling queries from users about concepts, meaning and linking waves Definitions: ILO definition of unemployment adopted in 1984, replacing LF definition. UPTAP Workshop 2007

19 The Data User Perspective
Discovering appropriate data: Determining what can be done with the data and how. Accessing the data: Are existing provisions, such as VMDLs, for access to confidential data adequate? Understanding how the data has been used to generate answers to other research questions: Provenance of results, links to publications Re-running statistical models, comparing results Ease and of use and quality of documentation: User manuals This meeting is intended to improve understanding between users and providers. Can we use technology to facilitate this further? E.g., Annual Population Survey? An Amazon for datasets? UPTAP Workshop 2007

20 The Data User Perspective
Data preparation: Selecting variables Linking waves Linking data sets Performing and possibly repeating analysis with different data. Interpreting and visualising results. Supporting the research lifecycle. Collaboration with other users and with data providers. This meeting is intended to improve understanding between users and providers. Can we use the Grid to facilitate this further? E.g., Annual Population Survey? An Amazon for datasets? UPTAP Workshop 2007

21 Contacting NCeSS and Getting Involved
Join our list: Participate in events: Agenda setting workshop on combining and sharing data, January 22nd-23rd, Manchester Annual conference UPTAP Workshop 2007


Download ppt "How Can e-Social Science Promote the Re-Use of Data?"

Similar presentations


Ads by Google