Machine Learning in R and its use in the statistical offices

Machine Learning in R and its use in the statistical offices
stat.unido.org

Outline Machine learning and R R packages
Machine learning in official statistics Top 10 algorithms References

What I talk about when I talk about Machine Learning
3

R and R packages What makes R so useful?
The users can extend and improve the software or write variations for specific tasks. The R package mechanism allows packages written for R to add advanced algorithms, graphs, machine learning and and mining techniques Each R package provides a structured standard documentation including code application examples

R and R packages ## Naive Bayes example
> install.packages('e1071', dependencies = TRUE) > library(class) > library(e1071) > data(iris) > pairs(iris[1:4], main = "Iris Data (red=setosa,green=versicolor,blue=virginica)", pch = 21, bg = c("red", "green3", "blue")[unclass(iris$Species)])

R and R packages > classifier <- naiveBayes(iris[,1:4], iris[,5]) > table(predict(classifier, iris[,-5]), iris[,5]) setosa versicolor virginica setosa versicolor virginica

Machine Learning for Official Statistics
Automatic Coding Editing and Imputation Record Linkage Other Methods

Automatic coding Automatic coding via Bayesian classifier: caret, klaR
Automatic occupation coding via CASCOT: algorithm not described Automatic coding via open-source indexing utility: ? Automatic coding of census variables via SVM: e1071 (interface to libsvm)

Editing and Imputation
Categorical data imputation via neural networks and Bayesian networks: neuralnet, gRain, bnlearn, deal Identification of error-containing records via classification trees: rpart, tree, caret Imputation donor pool screening via cluster analysis: class, klaR, cluster, kmeans(), hclust() Imputation via Classification and Regression Trees (CART): rpart, caret, RWeka Determination of imputation matching variables via Random Forests: randomForest Creation of homogeneous imputation classes via CART: rpart Derivation of edit rules via association analysis: arules

Record Linkage Weighting vector classification:
The last major step in record linkage or record de-duplication could be understood as a classification problem In R: rpart, bagging() in package ipred, ada, functions svm() and nnet() in package e1071

Other Methods Questionnaire consolidation via cluster analysis: class, klaR, cluster... Forming non-response weighting groups via classification trees: rpart, tree, caret Non-respondent prediction via classification trees: rpart, tree, caret Analysis of reporting errors via classification trees: rpart, tree, caret Substitutes for surveys via internet scraping: scrapeR, rvest Tax evader detection via k-nearest neighbours: class, kknn Crop yield estimation via image processing on satellite imaging data: is this ML?

Fernandez-Delgado, Cernadas, Barro (2014)
Do we Need Hundreds of Classiers to Solve Real World Classication Problems? Fernandez-Delgado, Cernadas, Barro (2014) Evaluate 179 classifiers arising from 17 families on 121 data sets By far best are random forests and SVM with Gaussian kernel Most of the best classiffiers are implemented in R and tuned using caret seems the best alternative to select a classier implementation

Top 10 ML/DM Algorithms Xindong Wu and Vipin Kumar (2009)
C4.5 – generates classifiers expressed as decision trees or ruleset form K-Means – simple iterative method to partition a given dataset into a userspecified number of clusters, k SVM – support vector machines Apriori - derive association rules EM - Expectation–Maximization algorithm PageRank - produces a static ranking of Web pages AdaBoost – Ensemble learning kNN - k-nearest neighbor classification Naive Bayes – simple classifier, applying the Bayes‘ theorem with independence assumptions between the features CART - Classification and Regression Trees C4.5 builds decision trees from a set of training data, using the concept of information entropy . The algorithm was developed by Ross Quinlan (1993) and is an an extension of Quinlan's earlier ID3 algorithm. The decision trees generated by C4.5 can be used for classification, and thus C4.5 is often referred to as a statistical classifier.

The best R packages for ML
e1071: Naive Bayes, SVM, latent class analysis rpart: regression trees RandomForest: RF gbm: generalized boosting models kernlab: SVM caret: Classification and Regression Training neuralnet: neural networks CRAN Task View: Machine Learning & Statistical Learning

Machine learning books

Machine Learning in R and its use in the statistical offices

Similar presentations

Presentation on theme: "Machine Learning in R and its use in the statistical offices"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Machine Learning in R and its use in the statistical offices

Similar presentations

Presentation on theme: "Machine Learning in R and its use in the statistical offices"— Presentation transcript:

Similar presentations

About project

Feedback