Machine Learning in R and its use in the statistical offices

Slides:



Advertisements
Similar presentations
The blue and green colors are actually the same.
Advertisements

R for Classification Jennifer Broughton Shimadzu Research Laboratory Manchester, UK 2 nd May 2013.
Integrated Instance- and Class- based Generative Modeling for Text Classification Antti PuurulaUniversity of Waikato Sung-Hyon MyaengKAIST 5/12/2013 Australasian.
Data Mining Classification: Alternative Techniques
An Overview of Machine Learning
A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.
Scikit-learn: Machine learning in Python
WEKA Evaluation of WEKA Waikato Environment for Knowledge Analysis Presented By: Manoj Wartikar & Sameer Sagade.
Lazy Learning k-Nearest Neighbour Motivation: availability of large amounts of processing power improves our ability to tune k-NN classifiers.
Math 5364 Notes Chapter 5: Alternative Classification Techniques Jesse Crawford Department of Mathematics Tarleton State University.
Learning Programs Danielle and Joseph Bennett (and Lorelei) 4 December 2007.
Jump to first page The objective of our final project is to evaluate several supervised learning algorithms for identifying pre-defined classes among web.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Exercise Session 10 – Image Categorization
1 © Goharian & Grossman 2003 Introduction to Data Mining (CS 422) Fall 2010.
Predicting Income from Census Data using Multiple Classifiers Presented By: Arghya Kusum Das Arnab Ganguly Manohar Karki Saikat Basu Subhajit Sidhanta.
Data mining and machine learning A brief introduction.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Appendix: The WEKA Data Mining Software
An Example of Course Project Face Identification.
Use of web scraping and text mining techniques in the Istat survey on “Information and Communication Technology in enterprises” Giulio Barcaroli(*), Alessandra.
Algorithms: The Basic Methods Witten – Chapter 4 Charles Tappert Professor of Computer Science School of CSIS, Pace University.
LOGO Ensemble Learning Lecturer: Dr. Bo Yuan
1 SUPPORT VECTOR MACHINES İsmail GÜNEŞ. 2 What is SVM? A new generation learning system. A new generation learning system. Based on recent advances in.
Comparison of Bayesian Neural Networks with TMVA classifiers Richa Sharma, Vipin Bhatnagar Panjab University, Chandigarh India-CMS March, 2009 Meeting,
Machine Learning with Weka Cornelia Caragea Thanks to Eibe Frank for some of the slides.
Combining multiple learners Usman Roshan. Bagging Randomly sample training data Determine classifier C i on sampled data Goto step 1 and repeat m times.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
Machine Learning Documentation Initiative Workshop on the Modernisation of Statistical Production Topic iii) Innovation in technology and methods driving.
1 STAT 5814 Statistical Data Mining. 2 Use of SAS Data Mining.
Classification Derek Hoiem CS 598, Spring 2009 Jan 27, 2009.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Overview of the final test for CSC Overview PART A: 7 easy questions –You should answer 5 of them. If you answer more we will select 5 at random.
Advanced Analytics on Hadoop Spring 2014 WPI, Mohamed Eltabakh 1.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall DM Finals Study Guide Rodney Nielsen.
COMP24111: Machine Learning Ensemble Models Gavin Brown
Algorithms Emerging IT Fall 2015 Team 1 Avenbaum, Hamilton, Mirilla, Pisano.
1 Learning Bias & Clustering Louis Oliphant CS based on slides by Burr H. Settles.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Data Analysis with R. Many data mining methods are also supported in R core package or in R modules –Kmeans clustering: Kmeans() –Decision tree: rpart()
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
Usman Roshan Dept. of Computer Science NJIT
Combining Models Foundations of Algorithms and Machine Learning (CS60020), IIT KGP, 2017: Indrajit Bhattacharya.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
R Data Mining for Insurance Retention Modeling
Machine Learning Models
Data Mining 101 with Scikit-Learn
The Elements of Statistical Learning
Source: Procedia Computer Science(2015)70:
COMP61011 : Machine Learning Ensemble Models
Overview of Supervised Learning
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Machine Learning Week 1.
Classifying enterprises by economic activity
Figure 1.1 Rules for the contact lens data.
Lecture 9: Entity Resolution
TOP DM 10 Algorithms C4.5 C 4.5 Research Issue:
Introduction to Data Mining, 2nd Edition
Machine Learning with Weka
Objectives Data Mining Course
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Support Vector Machine _ 2 (SVM)
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Derek Hoiem CS 598, Spring 2009 Jan 27, 2009
FOUNDATIONS OF BUSINESS ANALYTICS Introduction to Machine Learning
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

Machine Learning in R and its use in the statistical offices stat.unido.org v.todorov@unido.org

Outline Machine learning and R R packages Machine learning in official statistics Top 10 algorithms References

What I talk about when I talk about Machine Learning 3

R and R packages What makes R so useful? The users can extend and improve the software or write variations for specific tasks. The R package mechanism allows packages written for R to add advanced algorithms, graphs, machine learning and and mining techniques Each R package provides a structured standard documentation including code application examples

R and R packages ## Naive Bayes example > install.packages('e1071', dependencies = TRUE) > library(class) > library(e1071) > data(iris) > pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

R and R packages > classifier <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(classifier, iris[,-5]), iris[,5]) setosa versicolor virginica setosa 50 0 0 versicolor 0 47 3 virginica 0 3 47

Machine Learning for Official Statistics Automatic Coding Editing and Imputation Record Linkage Other Methods

Automatic coding Automatic coding via Bayesian classifier: caret, klaR Automatic occupation coding via CASCOT: algorithm not described Automatic coding via open-source indexing utility: ? Automatic coding of census variables via SVM: e1071 (interface to libsvm)

Editing and Imputation Categorical data imputation via neural networks and Bayesian networks: neuralnet, gRain, bnlearn, deal Identification of error-containing records via classification trees: rpart, tree, caret Imputation donor pool screening via cluster analysis: class, klaR, cluster, kmeans(), hclust() Imputation via Classification and Regression Trees (CART): rpart, caret, RWeka Determination of imputation matching variables via Random Forests: randomForest Creation of homogeneous imputation classes via CART: rpart Derivation of edit rules via association analysis: arules

Record Linkage Weighting vector classification: The last major step in record linkage or record de-duplication could be understood as a classification problem In R: rpart, bagging() in package ipred, ada, functions svm() and nnet() in package e1071

Other Methods Questionnaire consolidation via cluster analysis: class, klaR, cluster... Forming non-response weighting groups via classification trees: rpart, tree, caret Non-respondent prediction via classification trees: rpart, tree, caret Analysis of reporting errors via classification trees: rpart, tree, caret Substitutes for surveys via internet scraping: scrapeR, rvest Tax evader detection via k-nearest neighbours: class, kknn Crop yield estimation via image processing on satellite imaging data: is this ML?

Fernandez-Delgado, Cernadas, Barro (2014) Do we Need Hundreds of Classiers to Solve Real World Classication Problems? Fernandez-Delgado, Cernadas, Barro (2014) Evaluate 179 classifiers arising from 17 families on 121 data sets By far best are random forests and SVM with Gaussian kernel Most of the best classiffiers are implemented in R and tuned using caret seems the best alternative to select a classier implementation

Top 10 ML/DM Algorithms Xindong Wu and Vipin Kumar (2009) C4.5 – generates classifiers expressed as decision trees or ruleset form K-Means – simple iterative method to partition a given dataset into a userspecified number of clusters, k SVM – support vector machines Apriori - derive association rules EM - Expectation–Maximization algorithm PageRank - produces a static ranking of Web pages AdaBoost – Ensemble learning kNN - k-nearest neighbor classification Naive Bayes – simple classifier, applying the Bayes‘ theorem with independence assumptions between the features CART - Classification and Regression Trees C4.5 builds decision trees from a set of training data, using the concept of information entropy . The algorithm was developed by Ross Quinlan (1993) and is an an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and thus C4.5 is often referred to as a statistical classifier.

The best R packages for ML e1071: Naive Bayes, SVM, latent class analysis rpart: regression trees RandomForest: RF gbm: generalized boosting models kernlab: SVM caret: Classification and Regression Training neuralnet: neural networks CRAN Task View: Machine Learning & Statistical Learning

Machine learning books