Решение задач Data Mining. R и Hadoop. Классификация Decision Tree  Исходные данные >names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length“ [4]

Slides:



Advertisements
Similar presentations
R for Classification Jennifer Broughton Shimadzu Research Laboratory Manchester, UK 2 nd May 2013.
Advertisements

Math 5364 Notes Chapter 4: Classification
Exemples instructifs… Représentations graphiques.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6a, February 25, 2014, SAGE 3101 kNN, K-Means, Clustering and Bayesian Inference.
Decision Trees in R Arko Barman With additions and modifications by Ch. Eick COSC 4335 Data Mining.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 5a, February 24, 2015 Weighted kNN, ~ clustering, trees and Bayesian classification.
Machine Learning in R and its use in the statistical offices
Decision Tree Dr. Jieh-Shan George YEH
Neural Nets How the brain achieves intelligence 10^11 1mhz cpu’s.
1 Cluster Analysis EPP 245 Statistical Analysis of Laboratory Data.
Supervised Learning & Classification, part I Reading: W&F ch 1.1, 1.2, , 3.2, 3.3, 4.3, 6.1*
Clustering Dr. Jieh-Shan George YEH
RHadoop rev
© Prentice Hall1 DATA MINING Introductory and Advanced Topics Part II Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist.
Arko Barman Slightly edited by Ch. Eick COSC 6335 Data Mining
1 Multivariate Analysis and Discrimination EPP 245 Statistical Analysis of Laboratory Data.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
to win Kaggle Data Mining Competitions
Big Data Analytics with R and Hadoop
Tree-Based Methods (V&R 9.1) Demeke Kasaw, Andreas Nguyen, Mariana Alvaro STAT 6601 Project.
A quick introduction to R prog. 淡江統計 陳景祥 (Steve Chen)
A Brief Introduction to R Programming Darren J. Fitzpatrick, PhD The Bioinformatics Support Team 27/08/2015.
1 Running Clustering Algorithm in Weka Presented by Rachsuda Jiamthapthaksin Computer Science Department University of Houston.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 8b, March 21, 2014 Using the models, prediction, deciding.
 Create a PowerPoint from template using R software R and ReporteRs package Isaac Newton1/4.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 6b, February 28, 2014 Weighted kNN, clustering, more plottong, Bayes.
IE 585 Competitive Network – Learning Vector Quantization & Counterpropagation.
1 Peter Fox Data Analytics – ITWS-4963/ITWS-6965 Week 11a, April 7, 2014 Support Vector Machines, Decision Trees, Cross- validation.
JBR1 Support Vector Machines Classification Venables & Ripley Section 12.5 CSU Hayward Statistics 6601 Joseph Rickert & Timothy McKusick December 1, 2004.
Weka Just do it Free and Open Source ML Suite Ian Witten & Eibe Frank University of Waikato New Zealand.
Cluster Analysis.
Apache Mahout Qiaodi Zhuang Xijing Zhang.
Machine Learning (ML) with Weka Weka can classify data or approximate functions: choice of many algorithms.
Regression Dr. Jieh-Shan George YEH
Neural networks – Hands on
Given a set of data points as input Randomly assign each point to one of the k clusters Repeat until convergence – Calculate model of each of the k clusters.
Decision Tree Lab. Load in iris data: Display iris data as a sanity.
Diagonal is sum of variances In general, these will be larger when “within” class variance is larger (a bad thing) Sw(iris[,1:4],iris[,5]) Sepal.Length.
Airlinecount CSCE 587 Spring Preliminary steps in the VM First: log in to vm Ex: ssh vm-hadoop-XX.cse.sc.edu -p222 Where: XX is the vm number assigned.
1Module #: Title of Module. Machine Learning 101.
CLASSIFICATION: LOGISTIC REGRESSION Instructor: Dr. Chun Yu School of Statistics Jiangxi University of Finance and Economics Fall 2015.
1 Peter Fox Data Analytics – ITWS-4600/ITWS-6600 Week 5a, February 23, 2016 Weighted kNN, clustering, “early” trees and Bayesian.
16BIT IITR Data Collection Module If you have not already done so, download and install R from download.
Common Linear & Classification for Machine Learning using Microsoft R
Using the models, prediction, deciding
R workshop for Advanced users
MapReduce Compiler RHadoop
Clustering CSC 600: Data Mining Class 21.
Erich Smith Coleman Platt
CS 235 Decision Tree Classification
Discriminant Analysis
Rahi Ashokkumar Patel U
Even More on Dimensions
Airlinecount CSCE 587 Fall 2017.
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Peter Fox and Greg Hughes Data Analytics – ITWS-4600/ITWS-6600
K-Means Lab.
Weka Free and Open Source ML Suite Ian Witten & Eibe Frank
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Introduction to Hadoop and Spark
KMeans Clustering on Hadoop Fall 2013 Elke A. Rundensteiner
Classification and clustering - interpreting and exploring data
Lab weighted kNN, decision trees, random forest (“cross-validation” built in – more labs on it later in the course) Peter Fox and Greg Hughes Data Analytics.
Big Data, Bigger Data & Big R Data
Graphing Linear Equations
Copyright: Martin Kramer
Data Analytics – ITWS-4600/ITWS-6600/MATP-4450
Arko Barman COSC 6335 Data Mining
Using Clustering to Make Prediction Intervals For Neural Networks
Big Data Technology: Introduction to Hadoop
Presentation transcript:

Решение задач Data Mining. R и Hadoop

Классификация Decision Tree  Исходные данные >names(iris) [1] "Sepal.Length" "Sepal.Width" "Petal.Length“ [4] "Petal.Width" "Species" > attributes(iris$Species) $levels [1] "setosa" "versicolor" "virginica“ … 2

Классификация Decision Tree  Подготовка данных > set.seed(1234) > ind <- sample(2, nrow(iris), replace=TRUE, + prob=c(0.7, 0.3)) > trainData <- iris[ind==1,] > testData <- iris[ind==2,] 3

Классификация Decision Tree  Построение модели > library(party) > myFormula <- Species ~ Sepal.Length + + Sepal.Width + Petal.Length + Petal.Width > iris_ctree <- ctree(myFormula, data=trainData) 4

Классификация Decision Tree  Применение модели > table(predict(iris_ctree), trainData$Species) setosa versicolor virginica setosa versicolor virginica

Классификация Decision Tree  Применение модели >test_predicted <- predict(iris_ctree, + newdata=testData) > table(test_predicted, testData$Species) setosa versicolor virginica setosa versicolor virginica

Классификация Decision Tree  Визуализация >plot(iris_ctree) 7

Классификация Decision Tree  Визуализация >plot(iris_ctree, type="simple") 8

Классификация Random Forest  Построение модели > library(randomForest) > rf <- randomForest(Species ~., + data=trainData, ntree=100, proximity=TRUE) > table(predict(rf), trainData$Species) 9

Классификация Random Forest  Информация о модели >print(rf) Call: randomForest(formula = Species ~., data = trainData, ntree = 100, proximity = TRUE) Type of random forest:classification Number of trees: 100 No. of variables tried at each split: 2 OOB estimate of error rate: 4.46% 10

Классификация Random Forest  Информация о модели >print(rf) Confusion matrix: setosa versicolor virginica

Классификация Random Forest  Альтернативная реализация >library(party) >cf <- cforest(Species ~., data=trainData, + control=cforest_unbiased(mtry=2,ntree=100)) > table(predict(cf), trainData$Species) 12

Классификация  Наиболее используемые пакеты  Decision trees: rpart, party  Random forest: randomForest, party  SVM: e1071, kernlab  Neural networks: nnet, neuralnet, RSNNS  Performance evaluation: ROCR 13

Регрессия  Подготовка данных > year <- rep(2008:2010, each=4) > quarter <- rep(1:4, 3) > cpi <- c(162.2, 164.6, 166.5, 166.0, , 167.0, 168.6, 169.5, , 172.1, 173.3, 174.0) > plot(cpi, xaxt="n", ylab="CPI", xlab="") > axis(1, labels=paste(year,quarter,sep="Q"), + at=1:12, las=3) 14

Регрессия  Подготовка данных 15

Регрессия  Корреляция > cor(year, cpi) [1] > cor(quarter, cpi) [1]

Регрессия  Линейная модель > fit <- lm(cpi ~ year + quarter) > print(fit) Call: lm(formula = cpi ~ year + quarter) Coefficients: (Intercept) year quarter

Регрессия  Линейная модель > summary(fit) 18

Регрессия  Линейная модель >residuals(fit) … 19

Регрессия  Линейная модель > data2011 <- data.frame(year=2011, + quarter=1:4) > cpi2011 <- predict(fit, newdata=data2011) > style <- c(rep(1,12), rep(2,4)) > plot(c(cpi, cpi2011), xaxt="n", ylab="CPI", + xlab="", pch=style, col=style) > axis(1, at=1:16, las=3, + labels=c(paste(year,quarter,sep="Q"), + "2011Q1", "2011Q2", "2011Q3", "2011Q4")) 20

Регрессия  Линейная модель 21

Кластеризация. K-Means  Построение модели > iris2 <- iris > iris2$Species <- NULL > kmeans.result <- kmeans(iris2, 3) > table(iris$Species, kmeans.result$cluster) setosa versicolor virginica

Кластеризация. K-Means  Визуализация > plot(iris2[c("Sepal.Length", "Sepal.Width")], + col = kmeans.result$cluster) > # plot cluster centers > points(kmeans.result$centers[,c("Sepal.Length", + "Sepal.Width")], col = 1:3, pch = 8, cex=2) 23

Кластеризация. K-Means  Визуализация 24

Кластеризация  Наиболее используемые пакеты и функции  k-means: kmeans(), kmeansruns()  k-medoids: pam(), pamk()  Hierarchical clustering: hclust(), agnes(), diana()  DBSCAN: fpc  BIRCH: birch 25

R и Hadoop  Необходимые пакеты  Rhadoop  rmr2 – реализует интерфейс MapReduce  rhdfs – взаимодействие с HDFS  Rhbase – взаимодействие с Hbase  Rhive - взаимодействие с Apache Hive 26

R и Hadoop  library(rmr2)  map <- function(k, lines) {  words.list <- strsplit(lines, "nns")  words <- unlist(words.list)  Return(keyval(words, 1))  }  reduce <- function(word, counts) {  keyval(word, sum(counts))  } 27

R и Hadoop  wordcount <- function(input, output = NULL) {  mapreduce(input = input, output = output, input.format = "text",  map = map, reduce = reduce)  }  ## Submit job  out <- wordcount(in.file.path, out.file.path) 28

Ссылки 29