Naïve Bayes and Logistic Regression & Classification

Slides:



Advertisements
Similar presentations
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Advertisements

Logistic Regression Chapter 5, DDS. Introduction What is it? – It is an approach for calculating the odds of event happening vs other possibilities…Odds.
Chapter 4: Linear Models for Classification
Chapter 2: Pattern Recognition
Data-intensive Computing Algorithms: Classification Ref: Algorithms for the Intelligent Web 6/26/20151.
Naïve Bayes Chapter 4, DDS. Introduction Classification Training set  design a model Test set  validate the model Classify data set using the model.
Data Mining Techniques
Overview DM for Business Intelligence.
Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
6/28/2014 CSE651C, B. Ramamurthy1.  Classification is placing things where they belong  Why? To learn from classification  To discover patterns  To.
6/7/2014 CSE6511.  What is it? ◦ It is an approach for calculating the odds of event happening vs other possibilities…Odds ratio is an important concept.
Chapter 9 – Classification and Regression Trees
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 11 Statistical Techniques. Data Warehouse and Data Mining Chapter 11 2 Chapter Objectives  Understand when linear regression is an appropriate.
Data Science and Big Data Analytics Chap 4: Advanced Analytical Theory and Methods: Clustering Charles Tappert Seidenberg School of CSIS, Pace University.
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
1Ellen L. Walker Category Recognition Associating information extracted from images with categories (classes) of objects Requires prior knowledge about.
DATA MINING WITH CLUSTERING AND CLASSIFICATION Spring 2007, SJSU Benjamin Lam.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Linear Models & Clustering Presented by Kwak, Nam-ju 1.
Data-intensive Computing Algorithms: Classification Ref: Algorithms for the Intelligent Web 7/10/20161.
Rodney Nielsen Many of these slides were adapted from: I. H. Witten, E. Frank and M. A. Hall Data Science Algorithms: The Basic Methods Clustering WFH:
AP CSP: Cleaning Data & Creating Summary Tables
Who am I? Work in Probabilistic Machine Learning Like to teach 
Naïve Bayes CSE651C, B. Ramamurthy 6/28/2014.
What Is Cluster Analysis?
Data-intensive Computing Algorithms: Classification
Semi-Supervised Clustering
XLMiner – a Data Mining Toolkit
Machine Learning Logistic Regression
School of Computer Science & Engineering
CSSE463: Image Recognition Day 11
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
Machine Learning. k-Nearest Neighbor Classifiers.
Vincent Granville, Ph.D. Co-Founder, DSC
Data Science Process Chapter 2 Rich's Training 11/13/2018.
Instance Based Learning (Adapted from various sources)
Machine Learning Logistic Regression
Naïve Bayes CSE487/587 Spring /17/2018.
K Nearest Neighbor Classification
Naïve Bayes CSE651 6/7/2014.
CSSE463: Image Recognition Day 11
Revision (Part II) Ke Chen
Prepared by: Mahmoud Rafeek Al-Farra
Revision (Part II) Ke Chen
DATA MINING Introductory and Advanced Topics Part II - Clustering
Supporting End-User Access
iSRD Spam Review Detection with Imbalanced Data Distributions
Advanced Artificial Intelligence Classification
MIS2502: Data Analytics Clustering and Segmentation
MIS2502: Data Analytics Clustering and Segmentation
Clustering Wei Wang.
Naïve Bayes CSE487/587 Spring2017 4/4/2019.
Nearest Neighbors CSC 576: Data Mining.
Text Categorization Berlin Chen 2003 Reference:
Speech recognition, machine learning
Logistic Regression Chapter 7.
Statistical Models and Machine Learning Algorithms --Review
CSSE463: Image Recognition Day 11
CSSE463: Image Recognition Day 11
Chapter 4, Doing Data Science
Speech recognition, machine learning
Logistic Regression 10/13/2019.
Naïve Bayes Classifier
Is Statistics=Data Science
Presentation transcript:

Naïve Bayes and Logistic Regression & Classification B.Ramamurthy Rich's Big Data Analytics Training 11/10/2018

Outline Review last class’s methods We will answer Jim’s question about K-means More details on K-means Issues with K-means Partitioning Around Medoids (PAM): we will work through the data science process using data from world bank; Supervised machine learning approaches Naïve Bayes Logistic regression Discuss with several applications We will review at a high level the concept of “classification” and its relevance to business intelligence We will also introduce the Shiny package of R for building web applications I am also going to provide some data strategy recommendations as we move along. This is the most code-intensive session of all; we are done with R by the end of this session. Rich's Big Data Analytics Training 11/10/2018

K-means: Issues Popular clustering methods for data using some distance measure Clusters around centers that are means of the clusters: need not be a data point! As you observed the clusters may not be unique between runs of K-means since the cluster analysis starts with k randomly chosen centroids, a different solution can be obtained each time the function is invoked. Use the set.seed() function to guarantee that the results are reproducible. Additionally, K-means clustering approach can be sensitive to the initial selection of centroids. The kmeans() function has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart=25 will generate 25 initial configurations. This approach is often recommended. Rich's Big Data Analytics Training 11/10/2018

K-means Solution #set.seed(100) age<-c(23, 25, 24, 23, 21, 31, 32, 30,31, 30, 37, 35, 38, 37, 39, 42, 43, 45, 43, 45) #clust<-kmeans(age,centers=3) clust<-kmeans(age, center=3, nstart=25) plotcluster(age,clust$cluster) clust # try this with set.seed() and nstart approach # you should see the same cluster centers and clusters Rich's Big Data Analytics Training 11/10/2018

Categorical data K-means clustering does not work with “categorical data” Example: Cluster the countries around the world by categories decided by many attributes. This data by world bank contains numerical information as well as categorical data such as income levels, regions etc. We will work through a complete example1. Outcome of this exercise is countries clustered into 12 clusters; decided by the combination of various economic indicators. Observe and study the clusters for different years 2013, 2011 On to PAM algorithm details. Rich's Big Data Analytics Training 11/10/2018

PAM4 Initialize: randomly select (without replacement) k of the n data points as the medoids Associate each data point to the closest medoid. ("closest" here is defined using any valid distance metric, most commonly Euclidean distance, Manhattan distance or Minkowski distance) For each medoid m For each non-medoid data point o Swap m and o and compute the total cost of the configuration Select the configuration with the lowest cost. Repeat steps 2 to 4 until there is no change in the medoid. Rich's Big Data Analytics Training 11/10/2018

Exercise 1: World Bank Data For this exercise we will work with World Bank Data already available in package WDI. This data has been downloaded and is available in the data folder of today’s zip file. Our goal is to make 12 clusters out of the countries based on factors such as income level, lending level, region etc. We will spend a lot of time cleaning up and filtering the data before we do a one-liner PAM clustering. Data Strategy: designate a team member as data-wrangler who will “tame” the data to the form that can be processed easily Clustering results in a model; Simple plot of cluster to too complex to read or for visual communication/discussion Rich's Big Data Analytics Training 11/10/2018

World Bank Data Since we are dealing with country information, we can map the clusters on a world map. We get the world map for this from another data source of world bank. The files are included in your zip folder. We will develop a data frame form the clusters and plot it using ggplot on a world map for quick engaging display of the clusters. ggplot2 is a highly useful and popular plotting /graphing package Once R script is developed for a year (say 2011), you can reuse it for any other year just by changing the parameter year. Rich's Big Data Analytics Training 11/10/2018

Clustering Countries by World Bank Data 2011 2013 Rich's Big Data Analytics Training 11/10/2018

More on Classification Classification is placing things where they belong to discover patterns such as like-minded people, customers with similar tastes… Classification relies on a priori reference structures that divide the space of all possible data points into a set of classes that are not overlapping. Rich's Big Data Analytics Training 11/10/2018

Classification examples in daily life Restaurant menu: appetizers, salads, soups, entrée, dessert, drinks,…products Library of congress (LIC) system classifies books according to a standard scheme Injuries and diseases classification is physicians and healthcare workers Classification of all living things: eg., Home Sapiens (genus, species) Classifications of products by UPC code or some such attribute Rich's Big Data Analytics Training 11/10/2018

Categories of classification algorithms With respect to underlying technique two broad categories: Statistical algorithms Regression for forecasting Bayes classifier depicts the dependency of the various attributes of the classification problem. Structural algorithms Rule-based algorithms: if-else, decision trees Distance-based algorithm: similarity, nearest neighbor Neural networks Rich's Big Data Analytics Training 11/10/2018

Classifiers Rich's Big Data Analytics Training 11/10/2018

Life Cycle of a classifier: training, testing and production Rich's Big Data Analytics Training 11/10/2018

Training Stage Provide classifier with data points for which we have already assigned an appropriate class. Purpose of this stage is to determine the parameters Rich's Big Data Analytics Training 11/10/2018

Validation Stage Testing or validation stage we validate the classifier to ensure credibility for the results. Primary goal of this stage is to determine the classification errors. Quality of the results should be evaluated using various metrics Training and testing stages may be repeated several times before a classifier transitions to the production stage. Rich's Big Data Analytics Training 11/10/2018

Production stage The classifier(s) is used here in a live production system. It is possible to enhance the production results by allowing human-in-the-loop feedback. The three steps are repeated as we get more data from the production system. Data Strategy: Configuring these three stages will be the responsibility of a team member who is a domain expert knowledgeable about the data being classified, classes needed etc. Rich's Big Data Analytics Training 11/10/2018

Advantages and Disadvantages Distance-based ones work well for low-dimensionality space: one or two features How about classifying a data set with large number of features? Chapter 4 discusses two methods: Naïve Bayes and Logistic regression Rich's Big Data Analytics Training 11/10/2018

Naïve Bayes Naïve Bayes classifier One of the most celebrated and well-known classification algorithms of all time. Probabilistic algorithm Typically applied and works well with the assumption of independent attributes, but also found to work well even with some dependencies. Handles multiple features (Think of this as columns in your relational table) Rich's Big Data Analytics Training 11/10/2018

Overview2 Two classes: Binary classification Our goal is to learn how to correctly classify into two types of classes: Yes or No, 0 or 1, will click or not click, will buy this product or not, recommend or not, good or bad First step: we need to devise a model of the function f : we use Naïve Bayes or logistic regression Given this model f , classify data into two classes {0,1} Binary classification If f(x) = p, p>0.5 then class 0, else class 1 is a typical application of the method. Rich's Big Data Analytics Training 11/10/2018

Bayesian Inference Intuition: H - Hypothesis, E - Evidence 𝑃(𝐻|𝐸) = 𝑃 𝐸 𝐻 ∗ 𝑃(𝐻) 𝑃(𝐸) Posterior probability is proportional to likelihood times prior Can be extended to multiple features. likelihood prior posterior Rich's Big Data Analytics Training 11/10/2018

Example1 for Naïve Bayes A rare disease with 1% probability (prior) We have highly sensitive and specific test that is 99% positive for sick patients 99% negative for non-sick If a patient tests positive, what is probability that he/she is sick? Approach: patient is sick : sick, tests positive + P(sick/+) = P(+/sick) P(sick)/P(+)= 0.99*0.01/(0.99*0.01+0.99*0.01) = ½ = 0.5 Rich's Big Data Analytics Training 11/10/2018

Example2 for NB: very popular and common use Enron emails: 1500 spam emails, 3672 good emails For a given word “meeting”, it is known that spam emails contain the word 16 times, good emails contain the word “meeting” 153 times “Learn” from this by applying Bayes rule. Now you get an email with the word “meeting”. What is the probability that this email with word “meeting” is a spam? When do you classify this as spam? Rich's Big Data Analytics Training 11/10/2018

Classification Review: Training set  design a model Test set  validate the model Classify data set using the model Goal of classification: to label the items in the set to one of the given/known classes For spam filtering it is binary class: spam or not spam(good) Rich's Big Data Analytics Training 11/10/2018

Why not use methods in Chapter 3? Linear regression is about continuous variables, not about binary classification K-means /PAM are for clustering where there is no prior information about classes K-NN cannot accommodate multi-features: curse of dimensionality (K-NN performs well for few dimensions) For spam classification: : 1 distinct word 1 feature 10000 words 10000 features! What are we going to use? Naïve Bayes Rich's Big Data Analytics Training 11/10/2018

Spam Filter for individual words Classifying mail into spam and not spam: binary classification Lets say if we get a mail with --- you have won a “lottery” right away you know it is a spam. We will assume that is if a word qualifies (?) to be a spam then the email is a spam… Rich's Big Data Analytics Training 11/10/2018

Further discussion Lets call good emails “good” P(good) = 1- P(spam) P(word) = P(word|spam)P(spam) + P(word|good)P(good) P(spam|word) = P(word|spam)P(spam) P(word) Rich's Big Data Analytics Training 11/10/2018

Sample data Enron data: https://www.cs.cmu.edu/~enron Enron employee emails A small subset chosen for EDA 1500 spam, 3672 ham Test word is “meeting”…that is, your goal is label a email with word “meeting” as spam or good (not spam) What is your intuition? Now prove it using Bayes Rich's Big Data Analytics Training 11/10/2018

Calculations P(spam) = 1500/(1500+3672) = 0.29 P(ham) = 0.71 P(meeting|spam) = 16/1500= 0.0106 P(meeting|ham) = 15/3672 = 0.0416 P(meeting) = P(meeting|spam)P(spam) + P(meeting|ham)P(ham) = 0.0106 *0.29 + 0.0416+0.71= 0.03261 P(spam|meeting) = P(meeting|spam)*P(spam)/P(meeting) = 0.0106*0.29/0.03261 = 0.094  9.4% Rich's Big Data Analytics Training 11/10/2018

Discussion On to demo in R: see the lab handout Exercise 2. Bayesian analysis determined that the email with word “meeting” is a spam with 9.4% probability. Your data strategy? What is the threshold for qualification to be a spam? At UB it is 50%; your strategy could be 60%, a little relaxed or somewhat stringent 40%, that is when a email classifies to be a spam with 40% you intercept it and throw it into spam folder. The single word approach can be easily transformed into a multi-word phrase based Bayesian classification Rich's Big Data Analytics Training 11/10/2018

UB Strategy Quarantined Spam: Discarded Spam Incoming email messages that are 50% - 98% likely to be spam are held in your Blocked Messages folder for 28 days before the spam quarantine service automatically deletes them. Quarantined spam is not delivered to your mailbox, so it does not count toward your quota. Discarded Spam Incoming email messages that are 99% - 100% likely to be spam are automatically deleted. Outgoing email messages (either generated at UB or forwarded through UB) that are 80% - 99% likely to be spam are automatically deleted before reaching their destinations. Rich's Big Data Analytics Training 11/10/2018

Exercise 3: Predicting the behavior of our congressional representatives We studied: Naïve Bayes Rule Application to spam filtering in emails Work the example/understand the example discussed in class: the disease one, a spam filter.. Now lets look at an example using data on congressional votes on several issues (data from 1984, but nothing has changed!) This model we develop could be applied any data that conforms to this template. Once again this data is readily available in a package called mlbench (for machine learning benchmarks) Rich's Big Data Analytics Training 11/10/2018

Predicting behavior using Naïve Bayes We use existing record of congressional votes to build a model Goal is to label the voting record as belonging to a Democrat or Republican. Of course, we need to clean it up/reframe it Then compare the predicted classification (two classes: democrat or republican) to the actual class. Next we take a sample synthetic data containing a secret ballot and guess the voter class.. Is it still a secret ballot, when machines can learn who you are? We also plot (histogram) the voting record between two arbitrary issues: V10: V11 (missiles, immigration), with actual classes and predicted classes. (n:n, y:n, n:y, y:y). We can also do other complex data analytics: understand how they voted. On to the demo…exercise Rich's Big Data Analytics Training 11/10/2018

Logistic Regression What is it? Why are we studying it? It is an approach for calculating the odds of event happening vs other possibilities…Odds ratio is an important concept We will discuss odds ratio with examples Why are we studying it? To use it for classification It is a discriminative classification vs Naïve Bayes’ generative classification scheme (what is this?) Linear (continuous).. Logistic (categorical): Logit function bridges this gap According to experts [3] logistics regression classification has better error rates in certain situations than Naïve Bayes (eg. large data sets, in the context of big data?) Rich's Big Data Analytics Training 11/10/2018

Logistic Regression Predict if a patient has a given disease (we did this using Bayes) (binary classification using a variety of data like age, gender, BMI, blood tests etc.) if a person will vote Democratic or Republican the odds of a failure (or success) of a process, system or a product A customer’s propensity to purchase a product: they bought product {A, X and Y}, did not buy {B,C}, will they buy D? YES or NO ? Odds of a person staying in the workforce Odds of a homeowner defaulting on a loan Rich's Big Data Analytics Training 11/10/2018

Basics Basic function is: logit  logistic regression Definition: logit(p) = log( 𝑝 1−𝑝 ) = log(p) – log(1-p) The logit function takes x values in the range [0,1] and transforms them to y values along the entire real line Inverse logit does the reverse, takes a x value along the real line and transforms them in the range [1,0] Rich's Big Data Analytics Training 11/10/2018

Demo on R Do an exploration (EDA) of the data Observe the outcome, if sigmoid S-shaped curve Fit the logistic regression model use the fit/plot to classify Exercise 4: We have collected data about brand recognition. Our sample of subjects is in the age group 19-30, and they answer “yes” or “no” to a question. (Somewhat like the taste test of sodas). The data shows, <age, # of yes, # of subjects in that age>: one set of data R1 before a marketing campaign, another set R2 after the campaign (Pre and Post) You will see repeated entry for age, since this is data collected from several places. We have two regression curves. Which one is better? What is your interpretation? This is for small data of 25, how about big data? Can replicate the model for big data too. Rich's Big Data Analytics Training 11/10/2018

Plot : Pre and Post Rich's Big Data Analytics Training 11/10/2018

R Code data1 <- read.csv(file.chooser(), header=T) summary(data1) head(data1) glm.out = glm(cbind(R2, Total-R2) ~Age, family=binomial(logit), data=data1) plot(R2/Total ~ Age, data = data1) p2<-lines(data1$Age, glm.out$fitted, col="red") title(main="Brand Recognition Data: Logistic Regression Line") grid(nx=NULL, ny=NULL) summary(glm.out) Rich's Big Data Analytics Training 11/10/2018

Understanding Probability vs odds 0.001 0.001001 0.01 0.010101 : odds (1:100) 0.5 1 (1:1) 0.6 1.5 (1.5:1) or (3:2) 0.8 4 (4:1)  0.8/1 – 0.8 0.8/0.24 0.9 9 (9:1) 0.9999 9999 (9999:1) Rich's Big Data Analytics Training 11/10/2018

Odds Ratio Example from 4/16/2014 news article Woods is still favored to with the U.S. Open. He and Rory McIlroy are each 10/1 favorites on online betting site, Bovada. Adam Scott has the next best odds at 12/1….. How to interpret this? 𝑇𝑖𝑔𝑒𝑟 𝑤𝑜𝑜𝑑𝑠 𝑤𝑖𝑙𝑙 𝑛𝑜𝑡 𝑤𝑖𝑛 𝑇𝑖𝑔𝑒𝑟 𝑤𝑜𝑜𝑑𝑠 𝑤𝑖𝑙𝑙 𝑤𝑖𝑛 = 10 1 𝑅𝑜𝑟𝑦 𝑀𝑐𝑖𝑙𝑟𝑜𝑦 𝑤𝑖𝑙𝑙 𝑛𝑜𝑡 𝑤𝑖𝑛 𝑅𝑜𝑟𝑦 𝑀𝑐𝑖𝑙𝑟𝑜𝑦 𝑤𝑖𝑙𝑙 𝑤𝑖𝑛 = 10 1 𝐴𝑑𝑎𝑚 𝑆𝑐𝑜𝑡𝑡 𝑤𝑖𝑙𝑙 𝑛𝑜𝑡 𝑤𝑖𝑛 𝐴𝑑𝑎𝑚 𝑆𝑐𝑜𝑡𝑡 𝑤𝑖𝑙𝑙 𝑤𝑖𝑛 = 12 1 Woods is also the favorite to win the Open Championship at Hoylake in July. He's 7/1 there. 𝑇𝑖𝑔𝑒𝑟 𝑤𝑜𝑜𝑑𝑠 𝑤𝑖𝑙𝑙 𝑛𝑜𝑡 𝑤𝑖𝑛 𝑇𝑖𝑔𝑒𝑟 𝑤𝑜𝑜𝑑𝑠 𝑤𝑖𝑙𝑙 𝑤𝑖𝑛 = 7 1 Rich's Big Data Analytics Training 11/10/2018

Multiple Features click url1 url2 url3 url4 url5 1 Here we are interested in finding out if the user will click or not click: predict based on training data.. Fit the model using the command fit1 <- glm(click~ url1+url2 + url3+url4+url5, data=“train”, family=binomial(logit)) It will give you a probability that can then be used to predict/classify Rich's Big Data Analytics Training 11/10/2018

How to select your classifier? If you have a continuous variable then use linear regression. If you have discrete values for variables (yes/no data) you may use logistic regression. You don’t have much information about the classes in your data: use clustering: K-means for numeric, PAM for categorical If you have information about the classes (training and test data) then use K-NN for one or two features, Bayesian for multi-features. Rich's Big Data Analytics Training 11/10/2018

Shiny Package of R Shiny package allows you to develop applications using R script. A shiny-based web application has two major components: ui.R and server.R UI is for specifying the user input layout, components and variables for transporting data between the UI and server. Server inputs the variable values from the UI, uses R’s capabilities (packages, commands) and computes the results and display the results on the UI. See http://shiny.rstudio.com/ for some amazing examples. Rich's Big Data Analytics Training 11/10/2018

Summary We studied clustering categorical data in PAM. Customer segmentation for targeted marketing We discussed the acclaimed Naïve Bayesian and its application for classification. Robust classification approach Predicting diseases, text classification (good/bad chatter), sentiment analysis We also discussed logistic regression. Recommendation systems What factors influence the sale of product Data strategy: Identify major functions associated with data analytics and match it to a team member. Rich's Big Data Analytics Training 11/10/2018

References J. P. Lander. R For Everyone: Advanced Analytics and Graphics, Addison-Wesley. 2014. M. Hauskrecth, Supervised Learning, CS2710, Univ.of Pittsburgh, 2014. A.Ng and M.Jordon. On discriminative vs. generative classifiers: A comparison of logistic regression and naïve Bayes, NIPS 2001. Sergios Theodoridis & Konstantinos Koutroumbas (2006). Pattern Recognition 3rd ed. p. 635 Rich's Big Data Analytics Training 11/10/2018