School of Geography FACULTY OF EARTH & ENVIRONMENT New disclosure threats in Census interaction data Presented at the 6 th International Conference on Population Geographies, Umeå, Sweden 17 th June 2011 Oliver Duke-Williams
What are interaction data? Migration data Journey to work data Journey to school data Interaction data are flow data Also referred to as ‘origin-destination’ data
Sources of interaction data Interaction data are derived from census questions that are asked in many countries What is your place of work? What was your usual address 1 year ago? Different time periods used in different countries These can (potentially) be used to derive detailed flow matrices Special Migration Statistics (SMS) Special Workplace Statistics (SWS) Special Travel Statistics (STS)
UK Census questions 2011
1991 SMSSet 2 White: No migrants Blue: 1-9 migrants Red: 10+ migrants LondonMetropolitan counties
Location tracing Mobile phone handsets (and other equipment) are often location aware: they can determine a location using GPS or cell tower triangulation This can be used to offer a variety of location based services Restaurant recommendation Local services etc
tripadviser.com
geocaching.com
getlondonreading.com
Social networking and location A potential use of location aware hardware is to add location data to social networking updates Geo-tagged Tweets Google Places FourSquare
fourwhere.com
4mapper.appspot.com
seekatweet.com
twittervision.com
The risks If third party location tracing makes it easy to monitor the location of an individual (possibly without their knowledge), what does it tell us about them? Association with particular locations Many adults spend a lot of time in particular locations: Their home Their workplace
Location trace examples
Can people be identified from location traces? Krumm (2007) – using volunteer donated records Using 2 weeks worth of in-car GPS data Try to determine the subject’s home location Given a location, try to determine personal attributes Source: Krumm (2007); Fig 2 Source: Krumm (2007); un-numbered table
How easy is to get location data? Data can be purchased Danezis et al (2005) Median value placed on data by students of £10 Cvrcek et al (2006) Variations by country, purpose and length of observation Data can be exchanged for chance to win Krumm (2007) Data donated for 1/100 chance to win a $200 MP3 player Data can be exchanged for services Location based apps etc. Data could be obtained without permission iPhone tracking etc.
Source:Crvcek et al (2006); Figure 1(a)
Uniqueness Combinations of characteristics make people unique 87% of Americans have unique combination of gender, birth date and ZIP-code (Sweeney, 2000) This includes spatial identifiers Area of residence is not unique Area of workplace is not unique A combination of home and workplace might be unique...
Workplace flow data Golle and Partridge (2009) Examined US Longitudinal Employer-Household Dynamics dataset Studied potential disclosure threat from generalised home and workplace location
Golle & Partridge (2007)
Assessing risk in the UK Workplace data What proportion of all workers are uniquely identifiable on the basis of their home and workplace location alone? Migration data What proportion of all migrants are uniquely identifiable on the basis of their origin and destination alone? What if we add extra attributes such as gender or age group?
1991 SMS 1991 SMS Set 1 Migration within and between 10,000+ wards Limited attribute detail 1991 SMS Set 2 Migration within and between 459 districts More attribute detail Subject to suppression Scope of both sets is Great Britain
Key1991 SMS SMS 2 broad age groups 1991 SMS 2 narrow age groups Origin, destination12.5%0.6% Origin, destination, sex21.5%1.7% Origin, destination, age group22.2%2.7%6.8% Origin, destination, age group, sex37.4%6.2%11.0% 1991 SMS Proportion of migrants uniquely disclosed by various keys, 1991 SMS
2001 SMS Data are released at different ‘Levels’ 2001 SMS Level 1 Migration within and between 426 Local Authorities 2001 SMS Level 2 Migration within and between 10,000+ wards 2001 SMS Level 3 Migration within and between 223,000+ Output Areas As the spatial detail increases, the attribute detail is reduced Scope of the data is United Kingdom Some variations in detail in different countries
2001 SMS Analysis is affected by disclosure control Small Cell Adjustment Methodology: small counts randomly adjusted Not applied to data for destinations in Scotland Persons Male Female All ages pensionable age1275 Pensionable age+514 Persons Male Female All ages pensionable age1275 Pensionable age+534 Persons Male Female All ages pensionable age1275 Pensionable age+734
2001 SMS Data set Proportion of all migrants Additional attributes Level 335.1%2 (age, sex) Level 211.2%3 (age, sex, ethnic group) Level 10.3%7 (age, sex, ethnic group, family status, household status, illness status, economic activity) Proportions of migrants uniquely disclosed by origin and destination, 2001 SMS, destinations in Scotland
2001 SWS / STS Published for same set of Levels as 2001 SMS SWS published for residences in England, Wales and Northern Ireland STS published for residences in Scotland Includes SWS components, plus Students’ travel to place of study Non-working, non-student population Similar disclosure control issues, but more confusing Small Cell Adjustment applied to residences in Scotland at Level 3 Not applied to residences in Scotland at Levels 1 and 2
Accommodating SCAM Much of the data is affected by SCAM, so values of ‘1’ are not seen We can identify general ‘small’ flows by looking at those with a revised total of 3 We can estimate the proportion of these that had originally had a value of ‘1’ By applying flow frequencies observed in un-modified Scottish data By generating a mean value across multiple affected tables
2001 SWS Data set Flows with total=3 Estimated flow = 1 [mean method][sts2 method] Level 3 (England and Wales)63.4%16.8%27.4% Level 2 (UK exc. Scotland)9.2%2.4%4.0% Level 1 (UK exc. Scotland)0.4%0.1%0.2% Proportion of all workers uniquely identifiable (or in small flows) from home and workplace locations only
Varying geography Spatial scale has an important effect on the results What happens if we vary the scale for only one end of the flow? e.g. Detailed workplace geography, but less detailed home geography? Three sets of flows constructed from mean flow data Ward-to-ward District-to-ward Government Office Region-to-ward
Asymmetric 2001 SWS results Data setFlow=1 Ward to ward2.4% District to ward0.5% GOR to ward0.03% Proportions of workers uniquely identifiable given home and workplace location only
Anonymity sets for asymmetric SWS
Existing asymmetric data sets As well as standard outputs, there are also commissioned outputs Some of these have been interaction data, including some asymmetric data sets Two data sets were studied, both showing commuting flows by mode of transport to Output Areas in Greater London
Flows to OAs in Greater London Flow=3 Estimated flow=1 [mean method] C0310 Origins: Wards in Greater London 26.9%7.0% C0311 Origins: Districts in E&W 8.8%2.3% C0311 – subset Origins: Districts in Greater London 4.6%1.2% Proportions of workers uniquely identifiable given home and workplace location only
Does it matter? What are the implications of these results? The 1991 and 2001 data sets are old No location traces from then are likely to exist Even if they did, the individuals may have moved, died, changed status etc. Any risk from location tracing will apply to future data sets e.g. from the 2011 Census Publishing data sets for detailed geographies may constitute a potential risk, but it is limited
Disclosure risks The most obvious risks are posed by the OA-OA flows However, there is little potential for attribute disclosure A simple headcount data set at this scale would allow modelling of coarser flows, but with little attribute disclosure risk The ward-to-ward flows pose a smaller risk Proposed record-swapping based disclosure control can introduce enough noise to reduce the risk The district-to-district flows pose no practical risk Statistical agencies should not be afraid of publishing detailed flow data at this level
Asymmetric data sets have important potential Can act as hybrid between flexible interaction data and area-based data Can show high level of spatial detail Could be published as complementary pairs Utility would depend on user satisfaction
Questions?
References Danezis, G., Lewis, S., Anderson, R.: How much is location privacy worth? In: Fourth Workshop on the Economics of Information Security. (2005) Cvrcek, D, Kumpost, M, Matyas V, and Danezis G. (2006). A study on the value of location privacy. In Proceedings of the 5th ACM workshop on Privacy in electronic society (WPES '06). ACM, New York, NY, USA, DOI= / Golle, P and Partridge K (2009) On the Anonymity of Home/Work Location Pairs, Pervasive Computing, Lecture Notes in Computer Science, vol 5538/2009, Springer Berlin / Heidelberg Krumm, J (2007) Inference Attacks on Location Tracks. In Proc. of Fifth International Conference on Pervasive Computing (Pervasive 2007), pp Sweeney, L (2000) Uniqueness of Simple Demographics in the U.S. Population, Laboratory for International Data Privacy, Carnegie Mellon University
Distribution of 2001 SWS Level 2 mean flows