Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving opportunities for modernisation Kenneth Chu and Claude Poirier Geneva, Switzerland, April 2015
What is Machine Learning (ML) Application of artificial intelligence in which algorithms use available information to process (or assist the processing of) statistical data 20 applications were reported. 18/11/2015 Statistics Canada Statistique Canada 2 CodingEditingLinkageCollection
Why should we consider ML ? Relatively new discipline of computer science No needs for probabilistic models Less stringent for the BIG Data era NSOs should all explore the use of ML 18/11/2015 Statistics Canada Statistique Canada 3
Classes of ML Ex.1: Logistic regression [statistics] Training data: Binary response (0:1) and predictors Maximum likelihood leads to model parameters Resulting model is used to predict responses Ex.2: Support Vector Machines [non-statistics] Training data: Binary response (0:1) and predictors Hyperplanes in the space of predictors separate responses SVM optimisation problem comes from geometry Decision trees, neural networks, Bayesian networks 18/11/2015 Statistics Canada Statistique Canada 4 SUPERVISED ML
Classes of ML 18/11/2015 Statistics Canada Statistique Canada 5 UNSUPERVISED ML Ex.1: Principal Component Analysis [statistics] PCA summarizes a set of data by finding orthogonal sub-spaces that represent most of the variation There is no longer a response variable in the setting Ex.2: Cluster Analysis [non-statistics] CA seeks to determine grouping in given data Again, there are no response variables in the setting
Applications Automated Coding Bayesian classifier (Germany): Occupation coding CASCOT (United Kingdom): Occupation coding Indexing utility (Ireland): Individual consumption SVM (New Zealand): Occupation and Qualification 18/11/2015 Statistics Canada Statistique Canada 6
Applications Data Editing Bayesian Networks (Eurostat): Voting intentions Classification Trees (Portugal): Foreign trade data Cluster Analysis (USA): Census of agriculture CART (New Zealand): Census of population Random Forests (New Zealand): Donor imputation Association Analysis (New Zealand): Edit rules 18/11/2015 Statistics Canada Statistique Canada 7
Applications Record Linkage Neither like coding, nor editing Quality of linkages depends on pre-processing more than matching No applications of Machine Learning in official statistics were listed 18/11/2015 Statistics Canada Statistique Canada 8
Applications Other areas – Data collection Classification Tree (USA): Non-response prediction Classification Tree (USA): Reporting errors Naïve Bayes text mining (Italy): Web scraping K-nearest neighbours (Hungary): Tax audit Image Processing (Canada): Remote sensing 18/11/2015 Statistics Canada Statistique Canada 9
Concluding remarks Several machine learning applications Gap in the area of record linkage Attention required outside statistical paradigms Next: Applying Machine Learning on BIG Data Will this be possible only on a case-by-case basis? 18/11/2015 Statistics Canada Statistique Canada 10
Thank you Merci For more information,Pour plus d’information, please contact:veuillez contacter : 18/11/2015 Statistics Canada Statistique Canada 11