Chris Dibben University of Edinburgh Linking historical administrative data
Context History of very important contributions: –Dutch Famine Birth Cohort Study – epigenetics, thrifty phenotype –Överkalix study – epigenetics, sex differences –UK Longitudinal Study – health inequalities
Two new developmental projects Scottish Mental Surveys 1932 and 1947 Scottish civil registration data New cohorts for people now in old age
The ‘Scottish Mental Survey’
1947 Scottish Mental Survey 1939 register Birth 1936 ED code, address, household members: marital status, occupation The Scottish Longitudinal study Scottish morbidity records 1939 books recorded the date of death (up to 1980) linkage to the death database (1974 onwards) Education Employment
Early life environment Hospitalisation Mortality Birth Age Year Mental ability 11 School Achievement (time estimated) 1947 Occupation (estimated) Detailed household/ individual information
Background – Scottish vital events Civil registration of births, deaths and marriages in Scotland began on 1 January 1855 All historical vital events records have been converted into digital image format with a supporting index Modern vital events data (from 1974 onwards) are available electronically
Digitising Scotland Approximately 50 million occupation strings, 8 million causes of death Classify occupations to Historical International Standard Classification of Occupations (HISCO) Cause of death to a modified ICD10 Each with a location
Historical Geocoding GEOCODING TOOL + = + GEOMETRY FEATURES YearHistorical address 2010Ladywell House, Ladywell Road, Edinburgh, EH12 7T 1910Ladywell House, Ladywell Street, Edinburgh 1810Ladywell House, Ladywell Street, Edinburgh 1710Ladywell House, Lady[vv]ell Street, Edinburgh Postcode change Without postcode Interpretation error Change of road networks (new road replace old) over time Change of road names over time Interpretation errors from the address digitisation GEOMETRY FEATURES GEOMETRY FEATURES GEOMETRY FEATURES
Challenges Significant methodological issues: –How can we consistently code occupational data so that researchers can explore changing patterns and trends? –How can we automate this process so that the majority of records do not need to be manually coded?
Digitising Scotland Records of births, marriages and deaths recorded in Scotland from 1855 to present day.
14
15
16
17
18
Experimental Dataset Use a dataset with similar content for experiments 60,000 records from the Cambridge Family History Study (records from ) Occupation descriptions and associated HISCO codes HISCO coding done by historians Dataset contains 330 different HISCO codes 19
20 HISCO Hierarchy Example
Classification Example String from recordGold Standard Classification Automatic Classification Output Farm horseman62460 Horse Worker Shoe maker80110 Shoemaker, General Fireman (railway)98330 Railway Steam- Engine Fireman Fireman58100 Fire-Fighter Stationer41000 Working Proprietors (Wholesale and Retail Trade) Paper and Paperboard product makers 21
Classification Example String from recordGold Standard Classification Automatic Classification Output Farm horseman62460 Horse Worker Shoe maker80110 Shoemaker, General Fireman (railway)98330 Railway Steam- Engine Fireman Fireman58100 Fire-Fighter Stationer41000 Working Proprietors (Wholesale and Retail Trade) Paper and Paperboard product makers 22
Approach Text analysis Supervised machine learning –Apache Mahout framework. Combination of these techniques. 23
Supervised Machine Learning Training DataMachine Learning Unseen Data Prediction Model Predicted Classification 24 Prediction Model
Supervised Machine Learning Training Data Machine Learning Unseen Data Prediction Model Predicted Classification 25 Prediction Model Farm horseman62460 Shoe maker80110 Fireman58100 Stationer41000
Supervised Machine Learning Training DataMachine Learning Unseen Data Prediction Model Predicted Classification 26 Prediction Model Farm horseman62460 Shoe maker80110 Fireman58100 Stationer41000 Farm horseman Boot maker Fireman Painter
Supervised Machine Learning Training DataMachine Learning Unseen Data Prediction Model Predicted ClassificationPrediction Model Farm horseman62460 Shoe maker80110 Fireman58100 Stationer41000 Farm horseman Boot maker Fireman Painter ? Prediction Model
100% Asthma Miners asthma spasmodic collier's miner's miners asthma dropsy bronchial
Creation of a fully-linked vital events database for the whole Scotland back to Present Vital Events (24 million births, deaths and marriages) Digital Images + Index Vital Events Database Vital Events Database Fully-linked Vital Events Database
Large scale family reconstruction studies and Pedigrees
Gottfredsson, Magnús, et al. "Lessons from the past: familial aggregation analysis of fatal pandemic influenza (Spanish flu) in Iceland in 1918."Proceedings of the National Academy of Sciences (2008):
Acknowledgments The Digitising Scotland project is funded by ESRC; The support from National Records of Scotland is also gratefully acknowledged.