Presentation is loading. Please wait.

Presentation is loading. Please wait.

Geocoding of Population and Housing Census 2000: Lessons Learned

Similar presentations


Presentation on theme: "Geocoding of Population and Housing Census 2000: Lessons Learned"— Presentation transcript:

1 Geocoding of Population and Housing Census 2000: Lessons Learned
Dāvis Kļaviņš, GIS specialist |

2 Research on topics like migration and depopulation in space and time requires longitudinal geocoded dataset on population, employment and housing conditions. The most complete respective data can be obtained only within population censuses. Unlike data of the Population and Housing Census (PHC) 2011, the data of PHC 2000 were not geocoded so far, therefore their use in longitudinal research was limited.

3 Within the framework of the Eurostat grant project:
geocoding of PHC 2000 dataset was made; population indicators were published, including in 1×1 km grid for whole Latvia and in 100×100 m grid for cities under state jurisdiction, moreover in remotely accessible anonymized dataset for scientific use (similarly as PHC 2011 dataset is published); and local indicators of spatial association and Getis-Ord Gi* statistic were tested to analyse migration and depopulation data.

4 Geocoding of PHC 2000 dataset
Information on addresses from questionnaires was recognized optically, and it contains errors: House No. Street name /1 TARTU 11 11 TART4 11 TARTU 2A TA.RTU . 2A TAR+4 2A TART4 2A TARTU 2A TART-U -2A TARTU 5 5 TART4 5 TARTU 5 TARTU - 5 TARTU. ZA TARTU

5 Geocoding of PHC 2000 dataset
By linking information on address in PHC 2000 and in Population Register (PR) as of 1 March 2000, the probability that addresses match was determined. Address data in PR still contain format errors, but much less than in PHC 2000 dataset. Probabilities were calculated in R software, by using fields that form an address (street/house name, house No. etc.).

6 Geocoding of PHC 2000 dataset
For each field of all the records, the Levenshtein distance (minimum number of changes required to transform one word into the other) was calculated. Afterwards, it was divided by length of the longest of both words and subtracted from 1, thereby calculating the probability that the fields are equal. DAUGAVPIL.5 → DAUGAVPILS (distance is equal to 2) Probability=1− 2 11 ≈0.82

7 Geocoding of PHC 2000 dataset
To calculate the combined probability for each record, the probabilities were combined with weights. By using MS SQL Server and based on the form the address is written, addresses from PHC 2000 dataset in several groups (~30) were linked with 1) historical data from the State Address Register (SAR); or if linking was not possible, 2) PHC 2011 dataset. Linked datasets contain coordinates to be added to the PHC 2000 dataset.

8 Geocoding of PHC 2000 dataset
In some groups, calculated probabilities were used to substitute address information from questionnaires with that in the PR. E.g., if probability of address equality between the information from questionnaires and PR is >0.85, the address from PR was used for linking with the SAR (linked fields – village name, street name and house No.).

9 Geocoding of PHC 2000 dataset
For the remaining addresses, probability of their equality between PHC 2000 and PHC 2011 datasets was determined and under several conditions used to assume that the address from the PHC 2011 can be used. E.g., if probability of equality between house name in the questionnaire in the PHC 2000 dataset and village name in the PHC 2011 dataset is ≥0.7 (in the PHC 2000 dataset, only village name was given in the field of house name, and address from PHC 2011 can be used if according to PHC 2011 at least one person from the household lives at the same village).

10 Geocoding of PHC 2000 dataset
For addresses which could not be linked automatically and cover ~ persons (10%), probabilities of address equality between PHC 2000 and SAR were determined. The 10 highest probabilities for each address were used for manual linking (all if the highest probability was equal for more than 10 addresses). The number of records to be linked manually is considerably smaller than ~ because of the grouping by address (initially those were ~37 000, but in some cases addresses had to be split into households or persons).

11 Geocoding of PHC 2000 dataset
With an aim to facilitate manual linking, a software was developed, in which apart from the SAR addresses with the highest probabilities of being equal also access to the following sources was given: scanned questionnaires; all addresses surveyed by the interviewer; registered places of residence (until 2016) and address in the PHC 2011.

12 Geocoding of PHC 2000 dataset
Persons % of population Linked automatically 88 Linked manually (incl., linked automatically without coordinates, more than one address linked automatically and errors corrected in automatic linking) 12 Address cannot be geocoded or existence is doubtful at the critical moment of the PHC 510 Source codes of scripts generated in the course of the project are added to the report and published at the GISCO site on EC Extranet Wiki.

13 Data publishing 1×1 km grid of Latvia:
population number (already published in PHC 2011 and population estimates of 2016 and 2017), sex (already published in PHC 2011), age groups (0–14, 15–64 and 65+, already published in PHC 2011). 100×100 m grid of the cities under state jurisdiction: population number (already published in PHC 2011).

14 Data publishing Neighbourhoods of Riga, territorial units and administrative territories based on the boundaries in force at the beginning of 2018 (published also in PHC 2011 and population estimates of 2016 and 2017): population number, sex, age groups (0–14, 15–24, 25–49, 50–64 and 65+), average age, citizenship, ethnicity.

15 Data publishing Situation in usually resident population at neighbourhoods of Riga, territorial units and administrative territories on two dates based on the boundaries in force at the beginning of 2018 (choice of the PHC 2000 and 2011 and population estimates of 2016 and 2017): lives within the same territory, moved to another territory in Latvia, moved from another territory in Latvia, moved abroad, moved from abroad, passed away, was born.

16 Data publishing Publishing of variables obtained via questionnaires is suspended due to the measurement and processing errors revealed. By informing about data quality problems, anonymized data are provided to researchers as remotely accessible scientific use dataset, though. Dataset includes identifiers ensuring compatibility not only between datasets of the PHC 2000 and 2011 and population estimates of 2016 and 2017, but also administrative registers (e.g., State Revenue Service data).

17 Share of permanent residents aged 18 and over having higher education (PHC 2000, %)

18 One of the persons whose level of education differs in the PHC 2000 database and questionnaire

19 Areas with very high or low share of usually resident population aged 18 and over with higher education or doctorate degree (PHC 2011)

20 Analysis of migration and depopulation data
Migration and depopulation was planned to be analysed by using grid data and local indicators of spatial association (LISA) to detect spatial autocorrelation at a local scale and Getis-Ord Gi* statistic to identify hotspots and coldspots.

21 Spatial clusters and outliers of absolute change in usually resident population; 2000–2017 (differential Local Moran’s I implementation in GeoDa) As the algorithm is based on absolute change, the towns and their surroundings with higher population dominate – population decreases in the centre, but increases outside the town (which is well-known fact). High-high clusters are formed also where most populous territories of allotment gardens are located as many parcels have been converted into permanently residential ones. Because clusters are based on the average value of population change, not zero, clusters with high values may include negative population change. No information is given about countryside, because it is statistically insignificant due to smaller population size.

22 Analysis of migration and depopulation data
Relative population change in LISA or Getis-Ord Gi* statistic was not used because increase from zero is undefined and, unlike absolute change, positive and negative relative change is not symmetric (e.g., increase from 10 to 50 is 400%, but decrease from 50 to 10 only -80%, thus the results would be positively biased). Hence, also use of grid data for detection of correlation between population change and other variables (which are restricted to the ones from PR only due to identified processing and measurement errors) is limited and thus omitted.

23 Change in usually resident population (%); 2000–2017
To provide more information, which is also easier to interpret (increase or decrease starts from zero), map of relative population change was published instead.

24 Spatial clusters and outliers of relative change in usually resident population and absolute change in average age (Bivariate Local Moran’s I implementation in GeoDa) Red – population decreased not as much as average, but ageing significantly; light red – population decreased and ageing not as much as average; blue – population decreased significantly, but ageing not as much as average; light blue – population decreased and ageing significantly. Unlike data about grid cells, relative population change within territorial units is not as high or low, and no areas are unpopulated, therefore it makes more sense to use LISA or Getis-Ord Gi* statistic.

25 Areas in Riga where share of usually resident population in the total number of city population has increased or decreased significantly; 2000–2017 To create publishable visualizations of population change hotspots and coldspots using grid data, area of interest was reduced to the cities of state jurisdiction – small enough to use the share of usually resident population in the total city population and its absolute change.

26


Download ppt "Geocoding of Population and Housing Census 2000: Lessons Learned"

Similar presentations


Ads by Google