Lecture Data Mining in R 732A44 Programming in R.

Slides:



Advertisements
Similar presentations
Challenges in Estimating Demand Plasticity Dawit Mulugeta Merchandising AutoZone Manager, Merchandising Analysis.
Advertisements

- Word counts - Speech error counts - Metaphor counts - Active construction counts Moving further Categorical count data.
Institute for Paper, Pulp and Fiber Technology 1 Verena Feirer Österreichische Statistiktage Two Distribution Families for Modelling Over- and Underdispersed.
Psychology 202b Advanced Psychological Statistics, II March 3, 2011.
© Copyright 2000, Julia Hartman 1 An Interactive Tutorial for SPSS 10.0 for Windows © by Julia Hartman Binomial Logistic Regression Next.
Introduction to the General Linear Model (GLM) l 1 quantitative variable & 1 2-group variable l 1a  main effects model with no interaction l 1b  interaction.
Statistics 350 Lecture 16. Today Last Day: Introduction to Multiple Linear Regression Model Today: More Chapter 6.
Data Mining Packages in R: logistic regression and SVM Jiang Du March 2008.
Pris mot antal rum. Regression Analysis: Price versus Rooms The regression equation is Price = Rooms Predictor Coef SE Coef T P Constant.
BIOST 536 Lecture 18 1 Lecture 18 – Multinomial and Ordinal Regression Models.
Data mining and statistical learning, lecture 3 Outline  Ordinary least squares regression  Ridge regression.
Introduction the General Linear Model (GLM) l 2 quantitative variables l 3a  main effects model with no interaction l 3b  interaction model l 2 categorical.
Acceleration. MEANING FORMULA = A = Acceleration V f = V i = t=
Review of Lecture Two Linear Regression Normal Equation
SAS Lecture 5 – Some regression procedures Aidan McDermott, April 25, 2005.
MATH 3359 Introduction to Mathematical Modeling Project Multiple Linear Regression Multiple Logistic Regression.
1 Experimental Statistics - week 10 Chapter 11: Linear Regression and Correlation.
Machine Learning CSE 681 CH2 - Supervised Learning.
Formula? Unit?.  Formula ?  Unit?  Formula?  Unit?
Generalized Linear Models All the regression models treated so far have common structure. This structure can be split up into two parts: The random part:
Math 22 Introductory Statistics Chapter 8 - The Binomial Probability Distribution.
Linear Model. Formal Definition General Linear Model.
Data Mining: Neural Network Applications by Louise Francis CAS Annual Meeting, Nov 11, 2002 Francis Analytics and Actuarial Data Mining, Inc.
1 GLM I: Introduction to Generalized Linear Models By Curtis Gary Dean Distinguished Professor of Actuarial Science Ball State University By Curtis Gary.
Examples of Designed Experiments With Nonnormal Responses
The Binomial Theorem. (x + y) 0 Find the patterns: 1 (x + y) 1 x + y (x + y) 2 (x + y) 3 x 3 + 3x 2 y + 3xy 2 + y 3 (x + y) 4 (x + y) 0 (x + y) 1 (x +
Joe,naz,hedger.  Factor each binomial if possible. Solution: Factoring Differences of Squares.
Lectures 15,16 – Additive Models, Trees, and Related Methods Rice ECE697 Farinaz Koushanfar Fall 2006.
Last lecture Gamma distribution. Gamma density Last lecture Bernoulli and binomial distribution Application of binomial distriubiton.
GenABEL: an R package for Genome Wide Association Analysis
Statistics 2: generalized linear models. General linear model: Y ~ a + b 1 * x 1 + … + b n * x n + ε There are many cases when general linear models are.
Remembering way back: Generalized Linear Models Ordinary linear regression What if we want to model a response that is not Gaussian?? We may have experiments.
Regression Dr. Jieh-Shan George YEH
732G21/732G28/732A35 Lecture 4. Variance-covariance matrix for the regression coefficients 2.
Topics, Summer 2008 Day 1. Introduction Day 2. Samples and populations Day 3. Evaluating relationships Scatterplots and correlation Day 4. Regression and.
1 Experimental Statistics - week 11 Chapter 11: Linear Regression and Correlation.
1-1 Logistics Management LSM 730 Dr. Khurrum S. Mughal Lecture 2.
Dinga.com.au : Angling Equipment at Australia’s Lowest Prices! Featured Product $
1. A quiz consists of true and false questions. The ratio of the number of true questions to the number of false questions is 4:3. About what percent of.
EXAMPLE FORMULA DEFINITION 1.
EXAMPLE 3-1 Estimating the Cost of a College Degree
A priori violations In the following cases, your data violates the normality and homoskedasticity assumption on a priori grounds: (1) count data  Poisson.
An Interactive Tutorial for SPSS 10.0 for Windows©
Generalized Linear Models
Lecture 3: Linear Regression (with One Variable)
A GACP and GTMCP company
Pictures and cars.
Yield Curve and Stock Return
مبررات إدخال الحاسوب في رياض الأطفال
Quantitative Methods What lies beyond?.
Order Pairs An introduction.
7-8 Notes for Algebra 1 Recursive Formulas.
Constantly forgotten Hein Stigum Presentation, data and programs at:
البيئة السياسية للإدارة الدولية
دعوه للتعرف علي موارد مصر و زياره اماكنها و التعرف علي
(& Generalized Linear Models)
DCAL Stats Workshop Bodo Winter.
Car Race – Linear Sequences
Quantitative Methods What lies beyond?.
Optimal scaling for a logistic regression model with ordinal covariates Sanne JW Willems, Marta Fiocco, and Jacqueline J Meulman Leiden University & Stanford.
example 3 Miles Per Gallon
Soc 3306a Lecture 11: Multivariate 4
1/18/2019 ST3131, Lecture 1.
21twelveinteractive.com/ twitter.com/21twelveI/ facebook.com/21twelveinteractive/ linkedin.com/company/21twelve-interactive/ pinterest.com/21twelveinteractive/
Pre-Operational Equipment Check
Linear models in Epidemiology
2/28/2019 Exercise 1 In the bcmort data set, the four-level factor cohort can be considered the product of two two-level factors, say “period” (
Substituting into formulae
Regression and Categorical Predictors
Purchasing a Vehicle 6.1a.
Presentation transcript:

Lecture Data Mining in R 732A44 Programming in R

Logistic regression: two classes Consider Logistic model with one predictor X=Price of the car Y=Equipment Logistic model Use function glm(formula, family, data) – Formula: Response~Model Model consists of a+b (addition), a:b (interaction terms, a*b (addition and interaction). All predictors – Family: specify binomial 732A44 Programming in R

Logistic regression: two classes reg<-glm(X3...Equipment~Price.in.SEK., family=binomial, data=mydata); 732A44 Programming in R

Logistic regression: several predictors Data about contraceptive use – Several analysis plots can be obtained by plot(lrfit) – Response: matrix success/failure 732A44 Programming in R

Logistic regression Further comments Nominal logistic regressions (library mlogit, function mlogit) Stepwise model selection: step() function. Prediction: predict() function 732A44 Programming in R

Smoothing splines Minimize a penalized sum of squared residuals where λ is smoothing parameter. λ=0 : any function interpolating data λ=+  : least squares line fit 732A44 Programming in R

Smoothing splines smooth.spline(x, y, df, spar, cv,…) – Df degrees of freedom – Spar: penalty parameter – CV= TRUE=GCV FALSE=CV NA= no CV plot(m2$Kilometer,m2$Price, main="df=40"); res<-smooth.spline( m2$Kilometer, m2$Price,df=40); lines(res, col="blue"); plot(m2$Kilometer,m2$Price, main="df=40"); res<-smooth.spline( m2$Kilometer, m2$Price,df=40); lines(res, col="blue"); 732A44 Programming in R

Generalized additive models A function of the expected response is additive in the set of inputs, i.e., Example: Nonlinear logistic regression of a binary response 732A44 Programming in R

GAM gam(formula,family=gaussian,data,method="GCV.Cp" select=FALSE, sp) – Method: method for selection of smoothing parameters – Select: TRUE – variable selection is performed – Sp: smoothing parameters (maximal df) – Formula: usual terms and spline terms s(…) Library: mgcv Car properties Predict.gam() can be used for predictions bp<-gam(MPG~s(WT, sp=2)+s(SP, sp=1),data=m3) vis.gam(bp, theta=10, phi=30); 732A44 Programming in R

GAM Smoothing components plot(bp, pages=1) 732A44 Programming in R

Principal components analysis Idea: Introduce a new coordinate system (PC1, PC2, …) where The first principal component (PC1) is the direction that maximizes the variance of the projected data The second principal component (PC2) is the direction that maximizes the variance of the projected data after the variation along PC1 has been removed … In the new coordinate system, coefficients corresponding to the last principal components are very small  can take away this columns 732A44 Programming in R PC1 PC2

Principal components analysis princomp(x,...) m4<-m3; m4$MODEL<-c(); res<-princomp(m4); loadings(res); plot(res); biplot(res); summary(res); 732A44 Programming in R

Decision trees 732A44 Programming in R X1 X2 01X11 01 <9>=9 <16<7>=16>=7 <15>=15

Regression tree example 732A44 Programming in R

Training-validation-test Training-validation (60/40) If training-validation-test is required, use similar strategy sub <- sample(nrow(m2), floor(nrow(m2) * 0.6)) training <- m2[sub, ] validation <- m2[-sub, ] 732A44 Programming in R

Decision trees by CART Growing a full tree Library ”tree”. Create tree: tree(formula, data, subset, split = c("deviance", "gini"),…) – Subset: if subset of cases needs to be used for training – Split: splitting criterion – More parameters with control parameter Prune tree with help of validation set: prune.tree(tree, newdata, method = c("deviance", "misclass”),…) Prune tree with cross-validation: cv.tree(object, FUN = prune.tree, K = 10,...) – K is number of folds in cross-validation 732A44 Programming in R

Classification trees: CART sub <- sample(nrow(m5), floor(nrow(m5) * 0.6)) training <- m5[sub, ] validation <- m5[-sub, ] mytree<-tree(Area~.-Region-X,data=training); summary(mytree) plot(mytree,type="uniform"); text(mytree,cex=0.5); Example: Olive oils in Italy 732A44 Programming in R

Classification trees: CART Dependence of the misclassification rate on the length of the tree: treeseq1<-prune.tree(mytree, newdata=validation,method="misclass") plot(treeseq1); title("Validation"); treeseq2<-cv.tree(mytree, method="misclass") plot(treeseq2); title("CV"); 732A44 Programming in R

Regression trees: CART mytree2<-tree(eicosenoic~linoleic+linolenic+palmitic+palmitoleic,data=training); mytree3<-prune.tree(mytree2, best=4) #totally 4 leaves print(mytree3) summary(mytree3) plot.tree(mytree3) text(mytree3) 732A44 Programming in R

Decision trees: other techniques Conditional inference trees Library: party CART, another library ”rpart” training$X<-c(); training$Area<-c(); mytree4<-ctree(Region~.,data=training); print(mytree4) plot(mytree4, type= "simple");# gives nice plots 732A44 Programming in R

Neural network Input nodes, input layer [Hidden nodes, Hidden layer(s)] Output nodes, output layer Weights Activation functions Combination functions 732A44 Programming in R x1x1 x2x2 xpxp z1z1 z2z2 zMzM … … f1f1 fKfK …

Neural networks Feed –forward NNs Library: neuralnet neuralnet(formula, data, hidden = 1, rep = 1, startweights = NULL, algorithm = "rprop+", err.fct = "sse", act.fct = "logistic", linear.output = TRUE,…) – Hidden: vector showing amount of hidden neurons at each layer – Rep: amount of runs of network – Startweights: starting weights – Algorithm: ”backprop”, ”rpprop+”, ”sag”, ”slr” – Err.fct: any function +”sse”+”ce” (cross-entropy) – Act.fct:any function+”logistic”+”tanh” – Linear.output: TRUE, if no activation at the output confidence.interval(x, alpha = 0.05) Confidence intervals for weights compute(x, covariate) Prediction plot(x,…) plot given neural network 732A44 Programming in R

Neural networks Example mynet<-neuralnet( Region~eicosenoic+linoleic+linolenic+palmitic, data=training, rep=5, hidden=c(2,2),act.fct="tanh") plot(mynet); mynet$result.matrix 732A44 Programming in R

Neural networks Prediction with compute() Finding misclassification rate: table(true_values,predicted values) – not only for neural networks Another package, ready for qualitative response (classical nnet): mynet1<-nnet( Region~eicosenoic+linoleic, data=training, size=3); coef(mynet1) predict(mynet1, data=validation); 732A44 Programming in R

Clustering Purpose is to identify groups of observations into intput space (separated) – K-means – Hierarchical – Density-based 732A44 Programming in R

K-means Amount of seeds K should be given Starting seed positions needed kmeans(x, centers, iter.max = 10, nstart = 1) – X: data frame – Centers: either ”K” value or set of initial cluster centers – Iter.max: maximum number of iterations res<-kmeans(data.frame (m5$linoleic, m5$eicosenoic),2); 732A44 Programming in R

K-means One way to visualize plot(m5$linoleic, m5$eicosenoic, col=res$cluster); points(res$centers[,1], res$centers[,2], col = 1:2, pch = 8, cex=2) 732A44 Programming in R

Hierarchical clustering Agglomerative – Place each point into a single cluster – Merge nearest clusters until you get 1 cluster Meaning of ”two objects are close”? – Measure of proximity (ex: quantiative vars, Euclidian distance) Similarity measure s rs (=1 if same object, <1 otherwise) – Ex: correlation Dissimilarity measure δ rs (=0 if same object, >0 otherwise) – Ex: euclidian distance 732A44 Programming in R

Hierarchical clustering hclust(d, method = "complete", members=NULL) – D: dissimilarity measure – Method: ”ward”, "single", "complete", "average", "mcquitty", "median" or "centroid". Returned: a tree showing merging sequence cutree(tree, k = NULL, h = NULL) – K: number of clusters to make – H: at which level to cut Returned: cluster index 732A44 Programming in R

Hierarchical clustering Example x<-data.frame(m5$linolenic, m5$eicosenoic); m5_dist<-dist(x); m5_dend<-hclust(m5_dist, method="complete") plot(m5_dend); 732A44 Programming in R

Hierarchical clustering Example  DO NOT forget to standardize! clust=cutree(m5_dend, k=2); plot(m5$linoleic, m5$eicosenoic, col=clust); 732A44 Programming in R

Density-based clustering Kernel-based density estimation. Library: pdfcluster pdfCluster(x, h = h.norm(x), hmult = 0.75,…) – X: Data to be partitioned – h: a vector of smoothing parameters – Hmult: shrinkage factor x<-data.frame(m5$linolenic, m5$eicosenoic); res<-pdfCluster(x); plot(res) 732A44 Programming in R

Reference refcard-data-mining.pdf 732A44 Programming in R