Grid Enabled Data Fusion for Calculating Poverty Measures Social Science Applications UK e-Science ALL HANDS MEETING 2006.

Slides:



Advertisements
Similar presentations
Delivering User Needs: A middleware perspective Steven Newhouse Director.
Advertisements

Dealing with confidential research information - Anonymisation techniques and access regulations to enable using and sharing research data Data Management.
The National Grid Service Mike Mineter.
DAMES - Data Management through e-Social Science 1 DAMES: Data Management through e-Social Science NCeSS Research Node University of Stirling / University.
Grid-Enabling Data: Sticking Plaster, Sellotape, & Chewing Gum? Colin C. Venters National Centre for e-Social Science University.
/ ConvertGrid: Grid Enabling Population Datasets Keith Cole National Centre for e-Social Science (NCeSS) & MIMAS University.
Copyright Information Here Junaid Arshad 1, Wei Jie 2, Andy Turner 1 University of Leeds 1, University of Manchester 2, UK Securing.
E-Social Science: scaling up social scientific investigations Alex Voss, Andy Turner (ESRC National Centre for e-Social Science) Gabor Terstyanszky, Gabor.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
Access routes to 2001 UK Census Microdata: Issues and Solutions Jo Wathan SARs support Unit, CCSR University of Manchester, UK
Modelling and Simulation for e-Social Science Mark Birkin School of Geography University of Leeds.
MoSeS meets NEC 10 th March 2008 MoSeSMoSeS Andy Turner
John Kewley e-Science Centre GIS and Grid Computing Workshop 13 th September 2005, Leeds Grid Middleware and GROWL John Kewley
SEE-GEO Meeting 20 th March 2008 NCeSS e-Infrastructure for the Social Sciences Project: Security and Geospatial Services Andy Turner
The MashMyData project Combining and comparing environmental science data on the web Alastair Gemmell 1, Jon Blower 1, Keith Haines 1, Stephen Pascoe 2,
Oxford eResearch Conference 2008 Paper Session 4A: NCeSS Oxford, UK, ( ) Experience of e-Social Science: A Case of Andy Turner and MoSeS Andy.
Individual and Household Level Estimates Based on 2001 UK Human Population Census Data Andy Turner CSAP Seminar on Microsimulation: Problems and Solutions.
CCG 1 MoSeS Introduction and Progress Report Andy Turner
School of Geography CENTRE FOR SPATIAL ANALYSIS AND POLICY e-Infrastructure for Large-Scale Social Simulation Mark Birkin Andy Turner.
E-Social Science: scaling up social scientific investigations Alex Voss, Andy Turner, Rob Procter National Centre for e-Social Science Gabor Terstyanszky,
Shirley Crompton Source: Rob Allan. Institutional Repository Subject Repository Data Producer Repository share resources solve bigger problems integrate.
GridSphere for GridLab A Grid Application Server Development Framework By Michael Paul Russell Dept Computer Science University.
Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.
Bootstrapping applied to t-tests
Globus 4 Guy Warner NeSC Training.
TPAC Digital Library Talk Overview Presenter:Glenn Hyland Tasmanian Partnership for Advanced Computing & Australian Antarctic Division Outline: TPAC Overview.
Problems with reuse – Increased maintenance costs; lack of tool support; not-invented- here syndrome; creating, maintaining, and using a component library.
Manchester Computing Supercomputing, Visualization & e-Science OntoGrid GridPrimer Training University of Manchester 18 th to 22 nd October 2004 ConvertGrid:
TeraGrid Gateway User Concept – Supporting Users V. E. Lynch, M. L. Chen, J. W. Cobb, J. A. Kohl, S. D. Miller, S. S. Vazhkudai Oak Ridge National Laboratory.
Implementing the services WATER FOR A HEALTHY COUNTRY FLAGSHIP SISS Workshop v2.3 Pavel Golodoniuc | Computer scientist 7 May 2013.
Supercomputing, Visualization & eScience1 e-Social Science Grid technologies for Social Science: the Seamless Access to Multiple Datasets (SAMD) project.
DISTRIBUTED COMPUTING
GT Components. Globus Toolkit A “toolkit” of services and packages for creating the basic grid computing infrastructure Higher level tools added to this.
Manchester Computing Supercomputing, Visualization & eScience Celia Russell, Stephen Pickles and Mike Jones Combining Data Workshop ESRC Research Methods.
EE325 Introductory Econometrics1 Welcome to EE325 Introductory Econometrics Introduction Why study Econometrics? What is Econometrics? Methodology of Econometrics.
Geodemographic modelling collaboration Alex Voss, Andy Turner Presentation to Academia Sinica Centre for Survey Research
The National Grid Service Guy Warner.
Grid Execution Management for Legacy Code Applications Grid Enabling Legacy Code Applications Tamas Kiss Centre for Parallel.
Service - Oriented Middleware for Distributed Data Mining on the Grid ,劉妘鑏 Antonio C., Domenico T., and Paolo T. Journal of Parallel and Distributed.
Web based Hydrology and Water Resources Information System for India
Holding slide prior to starting show. A Portlet Interface for Computational Electromagnetics on the Grid Maria Lin and David Walker Cardiff University.
NA-MIC National Alliance for Medical Image Computing UCSD: Engineering Core 2 Portal and Grid Infrastructure.
The National Grid Service Mike Mineter
FDT Foil no 1 On Methodology from Domain to System Descriptions by Rolv Bræk NTNU Workshop on Philosophy and Applicablitiy of Formal Languages Geneve 15.
© 2006 STEP Consortium ICT Infrastructure Strand.
SICSA student induction day, 2009Slide 1 Social Simulation Tutorial International Symposium on Grid Computing Taipei, Taiwan, 7 th March 2010.
Combining the strengths of UMIST and The Victoria University of Manchester “Use cases” Stephen Pickles e-Frameworks meets e-Science workshop Edinburgh,
NeSC Workshop - February /14 Study of User Priorities for e-Infrastructure for e-Research (SUPER) Steven Newhouse Jennifer Schopf Andrew Richards.
Cole David Ronnie Julio. Introduction Globus is A community of users and developers who collaborate on the use and development of open source software,
Next Steps.
The National Grid Service Mike Mineter.
Middleware for Campus Grids Steven Newhouse, ETF Chair (& Deputy Director, OMII)
Enabling e-Research in Combustion Research Community T.V Pham 1, P.M. Dew 1, L.M.S. Lau 1 and M.J. Pilling 2 1 School of Computing 2 School of Chemistry.
1 e-Science AHM st Aug – 3 rd Sept 2004 Nottingham Distributed Storage management using SRB on UK National Grid Service Manandhar A, Haines K,
Distributed Data Analysis & Dissemination System (D-DADS ) Special Interest Group on Data Integration June 2000.
CSPC 464 Fall 2014 Son Nguyen.  Attendance/Roster  Introduction ◦ Instructor ◦ Students  Syllabus  Q & A.
Development of e-Science Application Portal on GAP WeiLong Ueng Academia Sinica Grid Computing
VisPortal Project developer’s experience C.E.Siegerist, J. Shalf, E.W. Bethel NERSC/LBNL Visualization Group T.J. Jankun-Kelley, O. Kreylos, K.L. Ma CIPIC/UC.
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
John Kewley e-Science Centre All Hands Meeting st September, Nottingham GROWL: A Lightweight Grid Services Toolkit and Applications John Kewley.
Oman College of Management and Technology Course – MM Topic 7 Production and Distribution of Multimedia Titles CS/MIS Department.
Rob Allan Daresbury Laboratory NW-GRID Training Event 26 th January 2007 Next Steps R.J. Allan CCLRC Daresbury Laboratory.
GEODE – Sharing Occupational Data Through The Grid Dr. Paul Lambert, Dr. Vernon Gayle, Prof. Ken Prandy, Prof. Richard Sinnott, Prof. Ken Turner, Koon.
The National Grid Service Mike Mineter.
The National Grid Service Mike Mineter
1 Porting applications to the NGS, using the P-GRADE portal and GEMLCA Peter Kacsuk MTA SZTAKI Hungarian Academy of Sciences Centre for.
Virtual Organisations for Trials and Epidemiological Studies (VOTES) Overview VOTES is a pioneering project investigating the application of Grid technology.
OGSA-DAI.
DATA COLLECTION METHODS IN NURSING RESEARCH
Introductory Econometrics
Presentation transcript:

Grid Enabled Data Fusion for Calculating Poverty Measures Social Science Applications UK e-Science ALL HANDS MEETING 2006

Researchers frequently have to use more than one data set in order to obtain a more complete answer to their questions One data set may provide a large sample of the target population, but offer incomplete coverage of the topics of interest Another data set with coverage of the topics of interest may not sample the target population adequately Background Social Science ProblemAnd Policy Issue What do we know about ethnic minority economic welfare when it is disaggregated by group and geography Census data can lack direct measures of income Survey data yield minority samples that may be too small for meaningful results to be obtained

The statistical analysis involved belongs in the group of data linkage methods. As there are no identifiable common units in the problem considered here, it is one of statistical data fusion. The actual methodology used is taken from the poverty mapping literature. The survey data can be viewed as the donor data set, with the census based data being the recipient data set.

Data  The British Household Panel Survey (BHPS) provides the small scale survey data. BHPS is a longitudinal (panel) study with yearly waves.  The Sample of Anonymised Records (SARs) provides the large scale Census data. SARs are a random sample of individuals and households from the UK Census  Uses 1991 data because of projected confidentiality restrictions on the publicly available version of the 2001 SARs. 2% sample of individuals, 1% sample of households.

It is time consuming dealing with different data sets when they are available in a wide variety of user unfriendly formats. Need common and coherent variable definitions for the donor and recipient data sets. Data Issues Addressed by the Data Grid? The issues suggest a work practice which becomes messy when dealing with communication between the steps in the overall analysis: data extraction, computation, and results presentation. This is alleviated by hosting the data on a data grid.

Empirical Analysis 1.estimate a statistical model using the BHPS data taking account the heterogeneous nature of the survey data 2.use the results to provide income predictions for the SARs data uses parameter estimates from 1. 3.use the income predictions along with other results to estimate poverty measures and their standard errors. headcount (% below a given poverty line) poverty gap (% distance from a poverty line) 4.present these poverty measures by UK regions, SARs areas, GB profiled areas for ethnic groups (by gender if using individuals)

Statistical inference may be difficult in combined data problems. Underlying theoretical assumptions may be too strong. Calculation may be difficult. Simulation techniques may address these difficulties, however, these can be computational intensive. Here we use a so-called casewise bootstrap: resample from the BHPS data, repeating steps 1 & 2 and SARs sub-sample poverty measure calculation B times. Empirical Issues

Addressed by the Computational Grid? Yes. These techniques are embarrassingly parallel and well suited to implementation on High Performance Computers (HPC). Scatter the data over P processors, do B/P bootstrap replications per processor, gather the results.

1.Social scientists rarely have easy access to HPC or HTC resources: 2.Need to avoid deploying complex middleware stacks on end-user's computers. Implementation Issues staff, human capital, equipment Suggested the following solution Tap into established trends & ongoing investments in UK e-Science  National Grid Service (NGS), solves 1.  Web service technologies (portals), solves 2.

GEMEDA Project Architecture Diagram JSP Web interface + Results visualization applet NGS Oracle hosted datasets BHPSSARs Household SARs Individual GEMEDA Apache Tomcat Spring Framework NGS HPC computing nodes Manchester, Oxford, Leeds, CCLRC RAL GEMEDA parallelized analysis code grid service DAISGR (DAI Service Group Registry) Athens Server NGS MyProxy server FTP server OGSA-DAI Grid Security Infrastructure Globus Toolkit Core Axis Apache Tomcat

CAN WE REPLACE THIS

Visualization/ Results Presentation GIS style chloropleth map, –area colouring represents range of poverty measure with linked plot –boxplot style graphic of income for main category of interest for a chosen area. Requires mapping data –from EDINA, Athens authenticated Implemented as a Java applet –uses opensource GeoTools java library 2.

area toggle ethnic group buttons gender buttons

An Empirical Worker's Perspective Equipment. middle range HPCs (and software) are accessible to Social Scientists. Staff. maintenance, development and implementation of middleware. Human capital. knowledge and experience gained by researchers stays relevant and remains within the Social Science. What has an e-Research solution provided? No output clutter. Visualization allows output filtering.

What are the weaknesses of the present implementation?  Sustainability.  staff, maintenance,...  re-engineering (Globus WSRF, OGSA-DAI & SQL Server, GridSphere, GROWL...)  Reliability.  error reporting for distributed architecture, documentation  robustness of compute nodes, interoperability of middleware  Security.  Athens authentication, e-certifcates,...  firewalls,...  Modeling.  more variables, more data sets, repeat data sets (2001)  interactivity, visualization,...

Implementation Limitations  Having grid-enabled data does not mean you have the process of data cleaning or data manipulation grid-enabled.  Computational grids require compute resource brokerage (NGS is batch).  Supporting code much larger than business code, specialist elements of the project were easiest to develop.

Other Issues Related to User Engagement  Data disclosure controls.  Confidentiality restrictions.  Proprietorship.  Expertise of researchers within discipline areas.  Technical rather than substantive research agenda

RES Project Team Simon Peters, Ken Clark SoSS (economics) Pascal Ekin, Anja Le Blanc, Stephen Pickles Manchester Computing Project portal and service: Acknowledgements Celia Russell, Mike JonesSAMD Mark Birkin, Andy Turner Hydra I Grid (now MoSeS) Keith Cole ConvertGrid Matt FordNGS