How to combine data from multiple sources Frans Willekens, NIDI NTTS2017 Satellite Event ‘Measuring migration’ Brussels, 13 March 2017
Content Framework Data sources Census and sample survey Administrative data, incl. big data How to combine data from different sources? Conclusion
Migration statistics <- data on individual and location Identify individual: personal characteristics and proof of identity Approach person and ask (or ask proxy) Person approaches authorities and self-report (e.g. registration) Biometric identifiers (biometric authentication) Electronic identification (authentication): e-Verify Electronic ID card (eID) / Personal Identity Verification Card (e.g. LincPass, USA) RFID implant (Radio Frequency Identification)
Migration statistics <- data on individual and location Determine location of individual (and relocation): proof of residence Location ≠ residence Actual and usual residence (de facto / de jure) IP address GPS tracking (device ≠ person) In document Implanted microchip with GPS Info on individual and relocation Personal attributes Date of relocation Forced / Voluntary Reason for relocation Is relocation authorized by authorities?
Data source: census and sample survey Location Place of birth Place of previous residence Place of residence at given prior date Date Date of birth Date of census or survey Data reliability: good, but Natives and immigrants only (no emigrants except if proxy respondents are interviewed) Census: coverage ok, but few personal attributes Sample survey: few migrants unless migrants oversampled
Data source: Administrative data A. Civil registration Location : Address Date: date of birth and date of change of address /date of registration Data reliability: Self reporting: notify authorities of arrival and departure Registration: new address (arrival / immigration) Deregistration: old address (departure / emigration) Response: international cooperation Nordic countries Romania, Italy, Spain (Pisicã, 2016) USA-Canada
Data source: Administrative data B Data source: Administrative data B. Entry visa / residence permit / blue (green) card Authorization of stay Location: no address Date: starting date (visa issuing date) and ending data (visa expiration date or status expiration date (USA)) Other info: reason for entry; citizenship; location issued Data reliability Unauthorized (illegal) entry Visa overstay
Responses to visa overstays: e-borders Assumes eID / e-Passport A. Travel Authorisation System USA: Visa Waiver Program: Electronic System for Travel Authorization (ESTA) EU: EU Smart Border Initiative European Travel Information and Authorisation System (ETIAS) Visa Information System (VIS) (EU) Collects biographic and biometric info Identify prior to arrival if a traveller poses a security or migratory risk an exchange data with other countries
B. Verfication system USA EU NSEERS (National Security Entry/Exit System) (initiated in 2002 for citizens from ‘high-risk’ countries) (ended by Obama in 2016) EU-Visit (since 2004) Trump Executive Order 27/1/2017: Biometric Entry-Exit Tracking System for all travelers to the United States Visa Interview Security: “extreme vetting” (= ideological screening incl. pw social media) EU Entry/Exit System (EES): replaces manual stamping of passports Stores biographic and biometric info, including Date and place of entry and exit Four fingerprints and the facial image) Comparison of info in EES and VIS (EES and VIS connected) http://www.consilium.europa.eu/en/press/press-releases/2017/03/02-entry-exit-system/
Data sources: social networks and Google Location: geo-locator IP address ≠ residence (accurate identification of country) GPS location ≠ residence (accurate identification of country) Account holder Internet Service Provider (IPS) may give name and address Data not representative for general population Google: searches may signal outmigration intentions Review: Hughes et al. (2016) Inferring Migrations: Traditional Methods and New Approaches based on Mobile Phone, Social Media, and other Big Data. Feasibility study on Inferring (labour) mobility and migration in the European Union from big data and social media data. European Commission project #VT/2014/093
Data issues Definition of migration Coverage Usual residence Duration threshold Coverage Underreporting / undercount Accuracy of data collection
How to Combine data from different sources How to Combine data from different sources? Answer: use a model of the true migration flows Consequence: estimates or synthetic data
Data Synthetic data Model (Measurements) (Estimates) Census and Survey Administrative data Big data Data (Measurements) Synthetic data (Estimates) Model Other relevant information quantitative qualitative Distinguish between observations (data) and ’true’ migration flows Raymer, J., Wiśniowski, A., Forster, J.J., Smith, P.W.F., and Bijak, J. (2013). Integrated modeling of European migration. Journal of the American Statistical Association, 108(503):801-819
Approach: Model the true migration flows True flows are stochastic Model the stochastic proces: counting process Example: Poisson process Observations are realisations of the stochastic process Use all available data: quantitative and qualitative
Data generating process: counting process Count data: number of migrations n(t) Stochastic process {N(t), t ≥ 0}, with N(t) the ’true’ migration flows Simplest counting process: Poisson process One parameter, but varies by migrant category Unobserved heterogeneity: mover-stayer model Direction of migration: Origin - destination E(N) = λ var(N) = λ λ = 𝜇t
How to estimate model of true flows from data? Theory of counting processes Number of migrations (occurrences) by origin, destination, and attributes of migrants and stayers Exposure: Number of persons exposed Duration of exposure Parameter of model: migration rate
Use all relevant info: data and prior info Types of prior information Quantitative: primary data and auxiliary data Qualitative: expert opinions How to add prior information? Information as probability distributions Bayesian rule Prior
Observed vs ‘true’ migrations Migration model predicts ‘true’ number of migrations Measurement model quantifies difference between observed flows and ‘true’ flows (separate for origin and destination country) True flow = Observed flow * Correction factor
Conclusion Combine data from different sources Meta data are essential Use statistical models: Migration model Measurement model For missing data: use proxy respondents Emigration data
thank you Willlekens@nidi.nl