4 th International Conference on e-Social Science: Workshop 5: Agent-Based Modelling for the Spatial-Social Sciences. 2008-06-18 Reconstruction of the.

Slides:



Advertisements
Similar presentations
Testing Relational Database
Advertisements

Will 2011 be the last Census of its kind in England and Wales? Roma Chappell, Programme Director Beyond 2011 Office for National Statistics, July 2011.
User views Jo Wathan SARs Support team
The Samples of Anonymised Records: Understanding Individual differences Mark Brown.
The Census Area Statistics Myles Gould Understanding area-level inequality & change.
Multiple Indicator Cluster Surveys Data Dissemination - Further Analysis Workshop Basic Concepts of Further Analysis MICS4 Data Dissemination and Further.
Multiple Indicator Cluster Surveys Data Interpretation, Further Analysis and Dissemination Workshop Basic Concepts of Further Analysis.
Topic 12 – Further Topics in ANOVA
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
E-Social Science: scaling up social scientific investigations Alex Voss, Andy Turner (ESRC National Centre for e-Social Science) Gabor Terstyanszky, Gabor.
Sample of Anonymised Records: User Meeting Propensity to migrate by ethnic group: 1991 & 2001 Paul Norman 1, John Stillwell 2 & Serena Hussain 2 School.
Learning and Teaching with the UK Census Developing the Collection of Historical and Contemporary Census Data and Materials into a Major Learning and Teaching.
GENESIS Web 2.0 Agent City Simulation: Establishing a user community and enabling collaborators to manipulate simulations and develop models Andy Turner.
Modelling and Simulation for e-Social Science Mark Birkin School of Geography University of Leeds.
Adding Census Geographical Detail into the British Crime Survey for Modelling Crime Charatdao Kongmuang Naresuan University, Thailand Graham Clarke and.
MoSeS meets NEC 10 th March 2008 MoSeSMoSeS Andy Turner
Alternative Futures – ASAP Research Cluster Seminar 16 th November 2005 MoSeS Starts for the Promised Land Andy Turner Outline –Introduction –Population.
An Internet Tool For Forecasting Land Use Change And Land Degradation In The Mediterranean Region Richard Kingston & Andy Turner University of Leeds UK.
A new crossover technique in Genetic Programming Janet Clegg Intelligent Systems Group Electronics Department.
International Symposium on Grid Computing 2010 Applications on Humanities & Social Sciences I Taipei, Taiwan ( ) GENESIS Social Simulation Modelling.
Marakas: Decision Support Systems, 2nd Edition © 2003, Prentice-Hall Chapter Chapter 1: Introduction to Decision Support Systems Decision Support.
Methods of Geographical Perturbation for Disclosure Control Division of Social Statistics And Department of Geography Caroline Young Supervised jointly.
Oxford eResearch Conference 2008 Paper Session 4A: NCeSS Oxford, UK, ( ) Experience of e-Social Science: A Case of Andy Turner and MoSeS Andy.
The NGS Roadshow Bath Geodemographic Modelling on the NGS Andy Turner
Individual and Household Level Estimates Based on 2001 UK Human Population Census Data Andy Turner CSAP Seminar on Microsimulation: Problems and Solutions.
CCG 1 MoSeS Introduction and Progress Report Andy Turner
School of Geography CENTRE FOR SPATIAL ANALYSIS AND POLICY e-Infrastructure for Large-Scale Social Simulation Mark Birkin Andy Turner.
E-Social Science: scaling up social scientific investigations Alex Voss, Andy Turner, Rob Procter National Centre for e-Social Science Gabor Terstyanszky,
Modelling and Simulation for e-Social Science (MoSeS) Mark Birkin, Martin Clarke, Phil Rees, Andy Turner, Belinda Wu (School of Geography) Haibo Chen (Institute.
Shirley Crompton Source: Rob Allan. Institutional Repository Subject Repository Data Producer Repository share resources solve bigger problems integrate.
An Introduction to Social Simulation Andy Turner Presentation as part of Social Simulation Tutorial at the.
Andy Turner On MoSeS 28 th March 2007 Andy Turner On MoSeS Andy Turner
MOSES: Modelling and Simulation for e-Social Science Mark Birkin, Martin Clarke, Phil Rees School of Geography, University of Leeds Haibo Chen, Institute.
Secondary Data Analysis Using the Census Stephen Drinkwater WISERD School of Business and Economics Swansea University.
Census.ac.uk Census Area Statistics and Casweb David Rawnsley Census Dissemination Unit (CDU) Mimas University of Manchester.
Population and places through time: Grid-square data and the NILS Ian Shuttleworth QUB and NILS-RSU.
Merging census aggregate statistics with postal code-based microdata Laine Ruus University of Toronto. Data Library Service ,
BHUTAN’S EXPERIENCE USE OF TECHNOLOGICAL TOOLS IN THE DISSEMINATION OF CENSUS DATA TASHI DORJEE NATIONAL STATISTICS BUREAU.
Constructing Individual Level Population Data for Social Simulation Models Andy Turner Presentation as part.
GEOG3025 Census and administrative data sources 2: Outputs and access.
Liesl Eathington Iowa Community Indicators Program Iowa State University October 2014.
The Application of the Concept of Uniqueness for Creating Public Use Microdata Files Jay J. Kim, U.S. National Center for Health Statistics Dong M. Jeong,
Record matching for census purposes in the Netherlands Eric Schulte Nordholt Senior researcher and project leader of the Census Statistics Netherlands.
1 Statistical Disclosure Control for Communal Establishments in the UK 2011 Census Joe Frend Office for National Statistics.
Plans for Access to UK Microdata from 2011 Census Emma White Office for National Statistics 24 May 2012.
2011 CENSUS Coverage Assessment – What’s new? OWEN ABBOTT.
Geodemographic modelling collaboration Alex Voss, Andy Turner Presentation to Academia Sinica Centre for Survey Research
Scot Exec Course Nov/Dec 04 Survey design overview Gillian Raab Professor of Applied Statistics Napier University.
Developing and improving data resources for social science research A strategic approach to data development and data sharing in the social sciences Peter.
New and easier ways of working with aggregate data and geographies from UK censuses Justin Hayes UK Data Service Census Support.
Comments: The Big Picture for Small Areas Alan M. Zaslavsky Harvard Medical School.
DTC Quantitative Methods Survey Research Design/Sampling (Mostly a hangover from Week 1…) Thursday 17 th January 2013.
Introduction to Spatial Microsimulation Dr Kirk Harland.
Evaluating Transportation Impacts of Forecast Demographic Scenarios Using Population Synthesis and Data Simulation Joshua Auld Kouros Mohammadian Taha.
Joint UNECE / Eurostat meeting on Population and Housing Censuses 7-9 July 2010, Geneva Disseminating Census information to maximise use and value Keith.
Copyright 2010, The World Bank Group. All Rights Reserved. Testing and Documentation Part II.
Creating Open Data whilst maintaining confidentiality Philip Lowthian, Caroline Tudor Office for National Statistics 1.
Coevolutionary Automated Software Correction Josh Wilkerson PhD Candidate in Computer Science Missouri S&T.
Disclosure Risk and Grid Computing Mark Elliot, Kingsley Purdam, Duncan Smith and Stephan Pickles CCSR, University of Manchester
Exploring Microsimulation Methodologies for the Estimation of Household Attributes Dimitris Ballas, Graham Clarke, and Ian Turton School of Geography University.
The 2011 Census: Estimating the Population Alexa Courtney.
HETUS Pilot Group 8 Privacy procedures and ethical issues Kimberly Fisher, Centre for Time Use Research – co-ordinator External consultant Kai Ludwigs.
Census 2011 – A Question of Confidentiality Statistical Disclosure control for the 2011 Census Carole Abrahams ONS Methodology BSPS – York, September 2011.
Using administrative data to produce official social statistics New Zealand’s experience.
DATA FOR EVIDENCE-BASED POLICY MAKING Dr. Tara Vishwanath, World Bank.
Implementing Dynamic Data Assimilation in the Social Sciences Andy Evans Centre for Spatial Analysis and Policy With: Jon Ward, Mathematics; Nick Malleson,
11 Measuring Disclosure Risk and Data Utility for Flexible Table Generators Natalie Shlomo, Laszlo Antal, Mark Elliot University of Manchester
Expanding the Role of Synthetic Data at the U.S. Census Bureau 59 th ISI World Statistics Congress August 28 th, 2013 By Ron S. Jarmin U.S. Census Bureau.
Advanced Higher Computing Science
Presentation transcript:

4 th International Conference on e-Social Science: Workshop 5: Agent-Based Modelling for the Spatial-Social Sciences Reconstruction of the entire UK population using microsimulation Andy Turner

Overview Introduction What Why How What next

Introduction The title is a bit odd and vague –“reconstruction… using microsimulation” I can only guess what this is. I don’t think this presentation addresses that. Hopefully it does address something relevant and of interest!

This presentation focuses on: Developing digital demographic data for the UK –A reconstruction of data which has existed for 2001 since around MoSeS Genetic Algorithm that attempts to reconstruct individual level data for every individual in the UK in 2001 How you can reconstruct the MoSeS reconstruction

What is MoSeS? Modelling and Simulation for e-Social Science – –e-Social Science being the application of e-Science concepts to social science problem domains e-Science is enhanced science that uses the Internet, software tools and structured information for collaborative work A first phase research node of NCeSS –Part of a UK collaborative partnership developing e-Social Science –The key part of it’s program of work is to develop an individually based demographic model of the UK for 2001 to 2031 MoSeS people

What am I on about and what do we want? UK demographic data reconstruction for The UK demographic data we want largely exists as 2001 human population (census) data, but it is not available as 2001 census output

Why do we want it? Reconstructed data is input into a dynamic model that operates at the individual and household level to simulate population change for MoSeS applications. Belinda and/or Mark will be talking about the dynamic model work later on It is theorised that in order to be realistic and of use in local service and transport planning, the demographic models have to operate at this individual and household level.

Enriching the base population Efforts are being made to enrich the census reconstruction with additional data from other sources (e.g. British Household Panel Survey) The results of this data integration are new constructions, data that has not previously existed. The idea is to add non-census variables to the base census data reconstruction. Chengchao Zuo is doing some of this work, but is not presenting it here.

Introducing Census Data In reconstructing the census data it is necessary to: –know some details of the available published data; –consider the different ways of doing it. So I’m going to describe the available census data and then introduce a couple of ways of reconstructing the individual census data for all individuals.

2001 UK Human Population Census: Scope and general characteristics Attempt to collect demographic data about all individuals in the UK at a specific time. Data collected via a paper form and digitised. Includes (in the region of) a hundred variables that detail each individual.

2001 UK Human Population Census: Key Units and references Data collected for households and communal establishments –For each household there is a household reference person (HRP) and there are some variables that inform of the relationships between each household individual and this HRP –Communal establishments include hospitals, hospices, prisons etc and in Scotland, residential schools. The definition and difference between households and communal establishments is important. Output Areas (OAs) –Smallest regions of aggregated data dissemination –Grouped into MSOA, Wards, Regions –New to 2001 –A typical OA might contain 300 people and about a hundred households and may contain a communal establishment.

Households

Communal Establishments

2001 UK Human Population Census: Anonymisation and the individual data Digitised data was anonymised –A new version was produced that had names and addresses removed. Data with names and addresses is more useful than the anonymised form, but due to various concerns the file that would link individual records with the name and address information is classified. In MoSeS we have not been concerned with trying to assign the correct names to our individual data. –It is the anonymised data that we are trying to reconstruct.

2001 UK Human Population Census: The individual data exists! The individual data are not available due to concerns over abuse of the data. –This is a legitimate concern, but it could be harmless to allow some way to link other data on names and addresses with this individual census data. This has been done for some epidemiological work It is not routine to do this even in controlled facilities AFAIK For similar reasons of concern the anonymised data is subjected to further obfuscation by Disclosure Control Measure (DCMs)

2001 UK Human Population Census: Variable aggregation For the different data products variables (e.g. age) are aggregated into groups differently. Consequently reconstruction is non-trivial. NB. Although the full address is removed from the data, for some outputs it is necessary to know which Output Area or higher spatial unit an individual is from.

2001 UK Human Population Census: Available census outputs Sample of Anonymised Records (SARs) and Small Area Microdata (SAM) Census Aggregate Statistics (CAS) Special Transport Statistics (STS) Special Migration Statistics (SMS) Longitudinal Study (LS) Commissioned Tables

HSAR The 2001 Household SAR is available for England and Wales only. 1% stratified sample of households household records individual records Individual records are available only for households with 11 or fewer residents There are 60 variables some of which are aggregated. –Age is in 2 year bands

ISAR The 2001 Individual SAR is for all of the UK. 3% Sample Records Includes people from the Communal Establishment Population (CEP) Very similar variables to HSAR, but some cruicial differences (e.g. Age)

CAS Census Aggregate Statistics Available at Output Area Level (and larger aggregate spatial units) for all the UK Various table types –Key Statistics –Univariate –Standard –Multivariate –Themed

2001 UK Human Population Census: DCMs again Disclosure control measures (DCM) on CAS add additional and unknown levels of error to the data –The Small Cell Adjustment Measure (SCAM) ensures that no count in any aggregate table that is disseminated is 1 or 2. This DCM is notorious for adding unwanted error (making the census very difficult to use) Among other issues it raises, it has the undesirable effect that counts from different tables that represent the same thing, will not necessarily match.

2 ways to reconstruct individual level data 1.Take the CAS and create synthetic individuals that match the aggregate characteristics 2.Select from the Individual and Household SAR populations such that the aggregate characteristics closely match those in the CAS

General limitations It is not possible to be sure that the data for individuals assigned to any location exactly matches the characteristics of the individuals that were there at the time of the census. In doing 1 it is possible to make a perfect match for every area, but in doing 2, it might not be possible for any area.

Option 1 (Synthetic Individuals) Constraints can be added to try to make the data reasonable –(e.g. someone aged 85 and with limiting long term illness probably does not work). –This is either arbitrary or non-trivial. There is no census data that can be used to inform if there exist individuals with the synthetically assigned characteristics (combination of age group, ethnicity, socio-economic group, educational attainment, health status etc...) except for the SAR, which is Option 2. Scales well in that it is not much more work to produce outputs for regions containing much larger populations.

Option 2 Selecting from the SARs It is too much to consider every combination of individuals from the SARs for the average Output Area (and there are OAs). Indeed, the number of combinations increases for regions with larger populations and greater numbers of households. –NAreas * (NRecords in SAR Population of area ) –Some heuristic or strategy is needed to help select a good solution.

Option 2 using a genetic algorithm to guide the search. Various ways to do this. An algorithm 1.Select Household Population (HP) from Household SAR records and Communal Establishment Population (CEP) from the Individual SAR a number of times 2.Measure performance 3.Select a number of the best performing sets 4.Breed these sets by swapping some HP and CEP 5.Repeat Steps 2 to 5 until convergence

Enhancements: Constraints 2 types of constraint –Control constraints These things must be met for a solution to be viable –From CAS003 constrain by age of HRP for HP –From CAS001 constrain by age for CEP –Optimisation constraints Can be any number of variables from the 60 or so in the SARs that are also in CAS Done in the performance measure Some are household population based Some total population based

Swapping records in breeding This becomes harder the more control constraints are applied –The aggregate constraint characteristics from the set being swapped must match those selected –Being able to swap multiple records is a big advantage More breadth of search Less chance of getting stuck in a local minima

HSAR ISAR Aggregate HPControl Characteristics Aggregate CEP Control Characteristics

Breeding parameters Need to not swap too much HP or CEP –Else optimisation is slow –Swapping a random amount each time is good, and swapping up to about a third of the HP and CEP seems OK Good to keep a diversity in the breeding population of solutions –Especially in the early iterations

Re-constraining There are a limit to the number of control constraints that can be used New optimisation constraints can be added and others removed by modifying the fitness function –e.g. For some applications it might be more important to get household composition right rather than socio-economic group

Results Sorry, no results to show here! Results for Leeds produced optimise constraining on household compoition, employment, health, age and gender. –The same type of result for the UK is nearly available A week away… I have produced graphs that indicate how well the results perform Maps of the residuals can also be produced and any spatial patterns may provide clues for improvement

Reconstructing the reconstructions Each HSAR record and ISAR record and Output Area have unique IDs and these can be publicly disseminated. Using a simple structure of two lists, one for the HP (either all records or just the HRP), the other for the CEP for each OA it is straightforward to recreate the result.

Plans in the near future Archive what we have done (results and code) and run for the UK again with some additional transport variables included in the optimisation. –Can be done by restarting from the previous best optimisation –Do some experiments with modifying the optimisation function during training.

MoSeS meets NEC 10 th March 2008 Acknowledgements and Thanks Thanks to MoSeS researchers, collaborators and funders. Thanks to all involved in eResearch for improving our hardware, software and data resources so that we can all do our bit to better understand and plan our future. Thank you for listening!

More Background on MoSeS follows in the next 6 slides

MoSeS meets NEC 10 th March 2008 Initial Tasks Develop methods to generate individual human population data for the UK from 2001 UK human population census data Develop a Toy Model –Dynamic agent based microsimulation modelling toolkit and apply it to simulate change in the UK Develop applications for –Health –Business –Transport

Challenges Grid enabling the data and tools Visualisation –Google Earth –Computer Games Collaboration Retaining a problem focus Design and Development

Generic MoSeS Approach MoSeS to date has approached Modelling and Simulation from a specific angle –Geographic –Demographic –Contemporary –About the UK –Targeted towards supporting a developing set of applications It is not a requirement to make it clear what steps can be followed by other Social Scientists wanting to Model and Simulate something different –However, the generic work of MoSeS should be relevant and we are working towards this

MoSeS Vision Suppose that computational power and data storage were not an issue what would you build? –SimCity org/wiki/SimCityhttp://en.wikipedia. org/wiki/SimCity For real on a national scale

MoSeS Rationale The idea is to provide planners, policy makers and the public with a tool to help them analyse the potential impacts and the likely effect of planning and policy changes. Example Application: –There may be a housing policy to do with joint ownership, taxation and planning restriction legislation that can be developed to alleviate problems to do with lack of affordable housing and workers without precipitating a crash in the housing market and economy as a whole –A balanced policy may be easier to develop by running a large number of simulations within a system like SimCity for real to understand the sensitivities involved

MoSeS First Steps The development of a national demographic model The development of 3 applications –Health care –Transport –Business The development of a portal interface to support the development and resulting applications by providing access to the data, models and simulations and presenting information to users (application developers) in a secure way