Statistical Models and Machine Learning Algorithms

Slides:



Advertisements
Similar presentations
Algorithms and applications
Advertisements

Data Mining Tools Overview Business Intelligence for Managers.
Unsupervised Learning Clustering K-Means. Recall: Key Components of Intelligent Agents Representation Language: Graph, Bayes Nets, Linear functions Inference.
Unsupervised Learning
Indian Statistical Institute Kolkata
Copyright (c) Bani K. Mallick1 STAT 651 Lecture #18.
Chapter 3 Simple Regression. What is in this Chapter? This chapter starts with a linear regression model with one explanatory variable, and states the.
Semi-Supervised Clustering Jieping Ye Department of Computer Science and Engineering Arizona State University
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Algorithms for Data Analytics Chapter 3. Plans Introduction to Data-intensive computing (Lecture 1) Statistical Inference: Foundations of statistics (Chapter.
B.Ramamurthy. Data Analytics (Data Science) EDA Data Intuition/ understand ing Big-data analytics StatsAlgs Discoveries / intelligence Statistical Inference.
B. RAMAMURTHY EAP#2: Data Mining, Statistical Analysis and Predictive Analytics for Automotive Domain CSE651C, B. Ramamurthy 1 6/28/2014.
Data mining and machine learning A brief introduction.
Machine Learning CSE 681 CH2 - Supervised Learning.
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
CHAPTER 2 Statistical Inference, Exploratory Data Analysis and Data Science Process cse4/587-Sprint
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
CSE 5331/7331 F'07© Prentice Hall1 CSE 5331/7331 Fall 2007 Machine Learning Margaret H. Dunham Department of Computer Science and Engineering Southern.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
Review of fundamental 1 Data mining in 1D: curve fitting by LLS Approximation-generalization tradeoff First homework assignment.
Clustering Instructor: Max Welling ICS 178 Machine Learning & Data Mining.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Data Mining and Decision Support
6.S093 Visual Recognition through Machine Learning Competition Image by kirkh.deviantart.com Joseph Lim and Aditya Khosla Acknowledgment: Many slides from.
Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.
Cluster Analysis What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods.
Data Summit 2016 H104: Building Hadoop Applications Abhik Roy Database Technologies - Experian LinkedIn Profile:
Gilad Lerman Math Department, UMN
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Data Transformation: Normalization
MATH-138 Elementary Statistics
Semi-Supervised Clustering
Clustering CSC 600: Data Mining Class 21.
Chapter 7. Classification and Prediction
Data Mining: Concepts and Techniques
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Constrained Clustering -Semi Supervised Clustering-
Erich Smith Coleman Platt
Machine Learning I & II.
10701 / Machine Learning.
CH. 1: Introduction 1.1 What is Machine Learning Example:
Basic machine learning background with Python scikit-learn
Data Mining Lecture 11.
Data Science Process Chapter 2 Rich's Training 11/13/2018.
AIM: Clustering the Data together
Statistical Methods For Engineers
Prepared by Lee Revere and John Large
Collaborative Filtering Matrix Factorization Approach
Ying shen Sse, tongji university Sep. 2016
Algorithms for Data Analytics
Clustering.
DataMining, Morgan Kaufmann, p Mining Lab. 김완섭 2004년 10월 27일
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Revision (Part II) Ke Chen
1/18/2019 ST3131, Lecture 1.
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
Statistical Models and Machine Learning Algorithms --Review
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Statistical Models and Machine Learning Algorithms B. Ramamurthy Rich's Big Data Anlytics Training 12/1/2018

Review of Last Class R and RStudio Data science process Predictive analytics: Linear regression- statistical method Simple but powerful and popular Twitter vs CDC Ebola data K-means clustering machine learning algorithm Application: customer segmentation K-NN classification machine learning algorithm Prediction of a class through training and learning Rich's Big Data Anlytics Training 12/1/2018

What is in pch in R? Rich's Big Data Anlytics Training 12/1/2018

Transforming data into analytics Strategies/decisions Vertical domain Probability-Statistics-Stochastic Randomness, Uncertainty Machine learning algorithms Food services data Results Decisions Diagnosis Strategies Horizontal domain Social/ media data/ web data CSE651C, B. Ramamurthy 6/28/2014

The Data Science Process (contd.) Raw data collected Exploratory data analysis Machine learning algorithms; Statistical models Build data products Communication Visualization Report Findings Make decisions Micro-level data strategy Data is processed Data is cleaned 1 2 3 4 5 6 7 Rich's Training 12/1/2018

The Seven Steps Collect data Prepare data Clean data EDA Visualize + understand Fit Models/apply algorithms + analyze + predict + understand (deeper) + visualize Build data products Rich's Big Data Anlytics Training 12/1/2018

EDA to Data Product Build data tool Fit model Deploy on the cloud Extract parameters Deploy on the cloud Use it Rich's Big Data Anlytics Training 12/1/2018

Today’s Lab We will illustrate three algorithms using simple exercises You may need to install several packages within R Studio, as and when needed We will also explore 6 of the 7 steps in the data science process using two applications Rich's Big Data Anlytics Training 12/1/2018

Plots, Graphs and Models Plots, graphs, maps provide the ability to visualize data and the results of model fitting Models provide you parameters that you can use to Get a deeper understanding of the data characteristics Build data tools based on the parameter so that you automate the analysis Algorithms drive the model (fitting) of your data For example, if your guess is a linear relationship for the data, then use linear regression; if you want to see clusters in your data you can use K-means; if you want to classify the data into classes then you may want to use K-NN Rich's Big Data Anlytics Training 12/1/2018

Algorithms Three types of algorithms: Data munging or data engineering or data wrangling: mapping raw and dirty data to a format that is usable in tools for analysis. By some estimates (NYT) this is about 50% of a data scientists time. Start-ups to do this kind of work: re-package data and sell it. Optimization algorithms for parameter estimation: Least squares, Newton’s methods, Stochastic gradients descent R has many functions readily available do to this Example: lsfit(x, y, … tolerance = 1e-07,…) Machine Learning algorithms Used to predict, classify and cluster Rich's Big Data Anlytics Training 12/1/2018

Linear Regression Vey simple, easy to understand Expresses the relationship between two variables/attributes You assume a linear relationship between an outcome variable (dependent variable, response variable/ label) and the predictor(s) (independent variables, features, explanatory variables) Examples: company sales vs money spent on ads Number of friends vs time spent on social media Rich's Big Data Anlytics Training 12/1/2018

Linear regression (contd) y = f(x) where y = β0 + β1 x For this revenue table you can see y = 25x How about this head(data1)? Not so obvious Need to “fit the model” Subscribers x Revenue y 5 125 10 250 15 375 20 500 7 276 3 43 4 83 6 136 10 417 Rich's Big Data Anlytics Training 12/1/2018

Fitting the model Least squares method to improve the fit We will now look at R exercise for linear regression and extract the parameters Fitted model can be evaluated and can be refined or updated. Many variations of linear regression exists and can be used as the need arises for these. Rich's Big Data Anlytics Training 12/1/2018

Multiple Regression Regression with multiple predictors y = β0+ β1.x1 + β2.x2+.. + Є We are considering data table/frame of one dependent variable y, and multiple independent variables. Model fitting using R. model<- lm(y ~ x0+x1+x2) model <-lm(y ~ x0+x1+x2*x3) where predictors x2 and x3 have interaction between them. Example: y : price of oil ~ x1= # full oil storage tanks, x2= #half tanks, x3= #empty tanks Rich's Big Data Anlytics Training 12/1/2018

Synthetic data You can generate synthetic data to explore, learn and/or hypothesize and evaluate a certain model Usually the generating functions for random number is “r” followed by distribution function. Examples: runif(10,0.0,100.0) # generate 10 numbers between 0.0 and 100.0 : uniform distribution rpois(100,56)# generate 100 numbers with mean of 56 according to Poisson distribution rnorm(100,mean=56,sd=9) # generate 100 numbers with mean 56, std.dev 9, according to Normal distribution Rich's Big Data Anlytics Training 12/1/2018

Synthetic Data Generate 30 random numbers in a given range. x<-sample(0:100, 30) Non-numerical data: sts<- sample(state.name,12) Try this twice and observe the randomness You can set seed to avoid pseudo randomness set.seed(5) You can also save the synthetic data generated for later use in another experiment by using the command: write.csv(data6, “mydata.csv”, row.names=FALSE) Rich's Big Data Anlytics Training 12/1/2018

Machine Learning Methods Learning: three major categories of algorithms Unsupervised Supervised Semi-supervised Another categorization: generative vs discriminative Clustering: putting items together in clusters Given a collection of items, bin them into groups with similar characteristics. K-means Classification: label a set of items based on prior knowledge of classes Given a collection of items, label them according to their properties K-NN : semi-supervised Training set, test set, unlabeled data labeled data Rich's Big Data Anlytics Training 12/1/2018

K-means K-means is unsupervised: no prior knowledge of the “right answer” Goal of the algorithm is to determine the definition of the right answer by finding clusters of data Kind of data: social data, survey data, medical data, SAT scores, customer data Assume data {age, gender, income, state, household, size}, your goal is to segment the users. Lets understand k-means using an example.

K-means Algorithm Choose the number of clusters k (by some requirement) The centers (centroids/means) are initialized to some value (by some strategy) Choosing centroid itself can be special step/algorithm; it can be random. Then the algorithm proceeds in two steps: Reassign all the points in the data to the closest centroid thus creating clusters around the centroid Recalculate the centroid based on this assignment The above reassign-recalculate process continues until there is no change in centroid values, or points stop switching clusters. Input data will have to be in a table/matrix with each column representing a (influencing) feature Rich's Big Data Anlytics Training 12/1/2018

Theory Behind K-means K-means searches for the minimum sum of squares assignment; It minimizes unnormalized variance (=total_SS) by assigning points to cluster centers. In order for k-means to converge, two conditions are considered: Re-assigning points reduces the sum of squares Re-computing the mean reduces the sum of squares As there is only finite number of combinations, you cannot infinitely reduce this value and the algorithm must converge at some point to a local optimum. Good measure of convergence: between_SS/total_SS Rich's Big Data Anlytics Training 12/1/2018

K-means Theory (contd.) Functional principal component analysis K-means algorithm is not distance based; it minimizes the variances to arrive a cluster. That’s why the axes are labeled by the discriminating component value. Rich's Big Data Anlytics Training 12/1/2018

Lets examine an example {Age, income range, education, skills, social, paid work} Lets take just the age { 23, 25, 24, 23, 21, 31, 32, 30,31, 30, 37, 35, 38, 37, 39, 42, 43, 45, 43, 45} Classify this data using K-means Lets assume K = 3 or 3 groups Give me a guess of the centroids?

Results for K=3 Rich's Big Data Anlytics Training 12/1/2018

R code age<-c(23, 25, 24, 23, 21, 31, 32, 30,31, 30, 37, 35, 38, 37, 39, 42, 43, 45, 43, 45) clust<-kmeans(age,centers=3) plotcluster(age,clust$cluster) Rich's Big Data Anlytics Training 12/1/2018

Discussion on K-means We discussed the problem with a small data set. Obviously the R code given in the previous slide is same irrespective of the size Typically for really big-data and also for building a data tool, We will take the parameters from the K-means model fitted or generated by R code and Write a program that can automate the K-means clustering Kmeans(x, centers, iter.max=10, algorithm=c(“Hattigan-wong”, “Lyoyd”, “Forgy”, “MacQueen”)) Rich's Big Data Anlytics Training 12/1/2018

R Data sets Anderson’s Iris dataset is a very popular dataset available with R package The iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa , versicolor, and virginica (coded 1, 2 and 3 respectively, some times) Rich's Big Data Anlytics Training 12/1/2018

K-means application: Customer Segmentation Clustering (& classification) are important data analysis methods for customer segmentation for understanding your customers and target your business practices. Ex: Obama’s election 2008: 230 M voters, 59% voted, 2.2 M (between 18-29), 66% voted for Obama, 1.5M; in some states like Indiana this accounted for the small 1% margin! Election that was won by computing and social media. Rich's Big Data Anlytics Training 12/1/2018

K-NN Semi-supervised ML You know the “right answers” or at least data that is “labeled”: training set Set of objects have been classified or labeled (training set) Another set of objects are yet to be labeled or classified (test set) Your goal is to automate the processes of labeling the test set. Intuition behind k-NN is to consider most similar items --- similarity defined by their attributes, look at the existing label and assign the object a label.

K-NN Classification A set of objects have been classified or labeled in some way Other similar objects with unknown class need to be labeled / classified Goal of a classification algorithm (K-NN) is to automatically label these objects NN stands for nearest neighbor How do you determine the closeness? Euclidian distance; color, shape or similar features. Rich's Big Data Anlytics Training 12/1/2018

Example2 Label1 Label2 What is class of unknown item indicated by green circle? K = 3 K =5 Rich's Big Data Anlytics Training 12/1/2018

Intuition (K = 3) Label1 Label2 What is class of unknown object indicated by green circle? K = 3 The unknown object gets 2 votes and 1 vote label of unclassified object is Rich's Big Data Anlytics Training 12/1/2018

Intuition (K = 5) Label1 Label2 What is the class of unknown object indicated by green circle? K = 5 The unknown object gets 2 votes and 3 votes  label of unknown object is Rich's Big Data Anlytics Training 12/1/2018

Discussion on K-NN How do you select K, the number of voters/objects with known labels? Three phases: training, test and classification/labeling of actual data Lets generate some synthetic data and experiment with K-NN. Rich's Big Data Anlytics Training 12/1/2018

Different Values of K Rich's Big Data Anlytics Training 12/1/2018

K-NN application: Credit Rating Predict credit rating as “good” or “bad” based on a set of known data Rich's Big Data Anlytics Training 12/1/2018

R code for the previous example age<-c(25,35,45,20,37,48,67,90,85) # good rating income<-c(40,60,80,20,120,90,70,40,60) plot(income~age, pch=17, col="red") xn<-c(52,23,40,60,48,33,67,89,34,45) #bad rating yn<-c(68,95,82,100,220,150,100,120,110,96) points(yn~xn,pch=15, col="green") grid(nx=NULL,ny=NULL) x<-46 # predict the rating of this customer y<-90 points(y~x, pch=8) legend("topright", inset=.05, pch=c(17,19),c("good","bad"), col=c("red","green"), horiz=TRUE) Rich's Big Data Anlytics Training 12/1/2018

Application of the K-NN process Decide on your similarity or distance metric Split the original data into training and test data Pick an evaluation metric (misclassification rate is a good one) Run k-NN a few and check the evaluation metric Optimize k by picking one with the best evaluation measure Once k is chosen, create the new test data with no labels and predict the “class” for the test data Rich's Big Data Anlytics Training 12/1/2018

Discussion How to apply these methods and techniques to data problems at your work? See exercises at the end of the lab handout. When you launch on a data analytics project here are some of the questions to ask: What are the basic variables? What are underlying processes? What influences what? What are the predictors? What causes what? What do you want to know?: This needs a domain expert Rich's Big Data Anlytics Training 12/1/2018

Summary We covered 3 of the 10 top data mining algorithms: see this paper. We studied: Linear regression Clustering algorithm K-means Classification algorithm K-NN Powerful methods for predicting and discovering intelligence R /Rstudio provides convenient functions to model your datasets for these algorithms Rich's Big Data Anlytics Training 12/1/2018

References N. Harris, Visualizing K-means Clustering, http://www.naftaliharris.com/blog/visualizing-k-means-clustering/, 2014. Example figure, http://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm, 2014 R Nutshell online version of the book: http://web.udl.es/Biomath/Bioestadistica/R/Manuals/r_in_a_nutshell.pdf Rich's Big Data Anlytics Training 12/1/2018