CH5 Data Mining Classification Prepared By Dr. Maher Abuhamdeh

Slides:

Advertisements

Similar presentations

Naïve-Bayes Classifiers Business Intelligence for Managers.

Advertisements

Data Mining Classification: Naïve Bayes Classifier

Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.

Data Mining with Naïve Bayesian Methods

Classification and Regression. Classification and regression  What is classification? What is regression?  Issues regarding classification and regression.

Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.

Algorithms for Classification: The Basic Methods.

Knowledge Discovery & Data Mining process of extracting previously unknown, valid, and actionable (understandable) information from large databases Data.

DATA MINING : CLASSIFICATION. Classification : Definition  Classification is a supervised learning.  Uses training sets which has correct answers (class.

Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.

Bayesian Networks. Male brain wiring Female brain wiring.

Data Mining CS157B Fall 04 Professor Lee By Yanhua Xue.

1 Data Mining Lecture 5: KNN and Bayes Classifiers.

Naive Bayes Classifier

Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.

Classification Techniques: Bayesian Classification

Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.

 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.

Algorithms for Classification: The Basic Methods.

Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.

Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.

Bayesian Classification

Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.

Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Classification & Prediction — Continue—. Overfitting in decision trees Small training set, noise, missing values Error rate decreases as training set.

DATA MINING LECTURE 10b Classification k-nearest neighbor classifier

Classification Today: Basic Problem Decision Trees.

Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.

Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL Classification COMP Seminar BCB 713 Module Spring 2011.

BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.

Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.

DATA MINING TECHNIQUES (DECISION TREES ) Presented by: Shweta Ghate MIT College OF Engineering.

Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.

Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.

Bayesian inference, Naïve Bayes model

Data Mining: Introduction

Data Science Algorithms: The Basic Methods

Algorithms for Classification:

Naïve Bayes Classifier

Naive Bayes Classifier

Data Science Algorithms: The Basic Methods

Classification.

Data Mining: Introduction

Bayes Net Learning: Bayesian Approaches

Decision Tree Saed Sayad 9/21/2018.

Bayesian Classification Using P-tree

Sangeeta Devadiga CS 157B, Spring 2007

Classification Techniques: Bayesian Classification

Bayesian Classification

MIS2502: Data Analytics Classification using Decision Trees

Prepared by: Mahmoud Rafeek Al-Farra

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Generative Models and Naïve Bayes

Data Mining: Concepts and Techniques (3rd ed.) — Chapter 8 —

Naive Bayes Classifier

MIS2502: Data Analytics Classification Using Decision Trees

Bayesian Classification

NAÏVE BAYES CLASSIFICATION

Data Mining CSCI 307, Spring 2019 Lecture 6

Naïve Bayes Classifier

Presentation transcript:

CH5 Data Mining Classification Prepared By Dr. Maher Abuhamdeh

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Classification vs. Prediction predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values

Classification Example categorical categorical continuous class Test Set Learn Classifier Model Training Set

Classification: Application 1 Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.

Classification: Application 2 Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.

Classification: Application 3 Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997

Classification: Application 4 Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). 3000 images with 23,040 x 23,040 pixels per image. Approach: Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

Classifying Galaxies Early Class: Attributes: Intermediate Late Courtesy: http://aps.umn.edu Early Class: Stages of Formation Attributes: Image features, Characteristics of light waves received, etc. Intermediate Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

Illustrating Classification Task

KNN : K –Nearest Neighbor method KNN approach K-Nearest Neighbor method Given a new test query q, we try to find the k closest training queries to it in terms of Euclidean distance. train a local ranking model online using the neighboring training queries (Ranking SVM) rank the documents of the test query using the trained local model

X1 = Acid Durability (seconds) Example We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Y = Classification 7 Bad 4 3 Good 1

Example (Cont.) Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is? Suppose use K = 3 Calculate the distance between the query-instance and all the training samples .

X1 = Acid Durability (seconds) Example (Cont.) Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root) X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance 7 (7-3)2+(7-7)2=16 4 (7-3)2+(4-7)2=25 3 (3-3)2+(4-7)2=9 1 (1-3)2-(4-7)2=13

X1 = Acid Durability (seconds) Example (Cont.) 3. Sort the distance and determine nearest neighbors based on the K-th minimum distance X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance Sort Class 7 √(7-3)2+(7-7)2=4 3 Bad 4 √(7-3)2+(4-7)2=5 4 3 √(3-3)2+(4-7)2=3 1 Good 1 √(1-3)2-(4-7)2=3.6 2

X1 = Acid Durability (seconds) Example (Cont.) 4. Gather the category of the nearest 3 neighbors. In red color. X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance Sort Y Class 7 √(7-3)2+(7-7)2=4 3 Bad 4 √(7-3)2+(4-7)2=5 4 3 √(3-3)2+(4-7)2=3 1 Good 1 √(1-3)2-(4-7)2=3.6 2

Example (Cont.) 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is included in Good category. If k=1 we chose the nearest point which has a good class

Example Age Loan Default 25 40,000 JD N 35 60,000 JD 45 80,000 JD 20 52 18,000 JD 23 95,000 JD Y 40 62,000 JD 60 100,000 JD 48 221,000 JD 33 150,000 JD We can now use the training set to classify an unknown case (Age=48 and Loan=142,000 JD) using Euclidean distance. If K=1 then the nearest neighbour is the last case in the training set with Default=Y.

Example Age Loan Default Distance Rank 25 40,000 JD N 102,000 35 82,000 45 80,000 JD 62,000 20 20,000 JD 122,000 120,000 JD 22,000 2 52 18,000 JD 124,000 23 95,000 JD Y 47,000 40 62,000 JD 80,000 60 100,000 JD 42,000 3 48 221,000 JD 78,000 33 150,000 JD 8,000 1 D = Sqrt[(48-33)^2 + (142000-150000)^2] = 8000.0 >> Default=Y With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y

Standardized Distance One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars,JD,…. and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.

1 2 3 With K=3, there are two Default=N and one Default=Y out of three closest neighbours. The prediction for the unknown case is Default=N

Standardized AGE Min age = 20 Max age 60 X= (x-min) / (max – min) X= (25 – 20) / (60 – 20) = 5/4 = 0.125 Standardized loan Min = 18,000 Max = 212,000 X= (40,000 – 18,000) / ( 212,000 – 18,000) = 0.11

Pseudo for the basic k-nn Input: D = {(x1,c1),...,(xN,cN)} x = (x1,...,xn) new instance to be classified FOR each labelled instance (xi,ci) calculate d(xi,x) Order d(xi,x) from lowest to highest, (i = 1,...,N) Select the K nearest instances to x: Dxk Assign to x the most frequent class in Dxk END

Advantage and disadvantage The KNN algorithm provides good accuracy on many domains Easy to understand and implemented Very quickly The KNN algorithm can estimate complex target concept locally.

Disadvantage The KNN has large storage requirement because it has to store all the data The KNN is slow for huge data because all the training instance have to be visited. Need to determine value of parameter k (number of nearest neighbours)

Naïve Bayesian

Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

Naïve Bayesian Classifier: Training Dataset C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

Bayesian Theorem: Basics Let X be a data sample (“evidence”): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), (posteriori probability), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability E.g., X will buy computer, regardless of age, income, … P(X): probability that sample data is observed P(X|H) (likelyhood), the probability of observing the sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X is 31..40, medium income

Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem Informally, this can be written as posteriori = likelihood x prior/evidence Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes

Towards Naïve Bayesian Classifier Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes’ theorem Since P(X) is constant for all classes, only needs to be maximized

Derivation of Naïve Bayes Classifier A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is

Probabilities for weather data Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 Hot High 4 False 6 9 5 Overcast Mild Normal 1 True Rainy Cool 2/9 3/5 2/5 3/9 4/5 6/9 9/14 5/14 4/9 0/5 1/5 witten&eibe

Day Outlook Temp Humidity Windy Play 1 Sunny Hot High False No 2 True 3 Overcast Yes 4 Rainy Mild 5 Cool Normal 6 7 8 9 10 11 12 13 14

Probabilities for weather data Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 Hot High 4 False 6 9 5 Overcast Mild Normal 1 True Rainy Cool 2/9 3/5 2/5 3/9 4/5 6/9 9/14 5/14 4/9 0/5 1/5 Outlook Temp. Humidity Windy Play Sunny Cool High True ? A new day: Likelihood of the two classes For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053 For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206 Conversion into a probability by normalization: P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205 P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795

senior 2/4 1/3 middle 2/4 1/3 yes 3/4 1/3 yes no Age Income Has a car Buy/class senior middle yes youth low no junior high Age Income Has a car Yes No Yes No senior junior youth 2 1 1 1 middle low high 2 1 0 2 2 0 yes no 3 1 1 2 senior 2/4 1/3 middle 2/4 1/3 yes 3/4 1/3 yes no junior 1/4 1/3 low 0/4 2/3 no 1/4 2/3 youth 1/4 1/3 high 2/4 0/3 4/7 3/7

Example on Naïve Bayes Cont. Age Income Has a car Buy youth middle no ? Likelihood for class (yes) = 1/4 * 2/4 * 1/4 = 0.03125 Likelihood for class (no) = 1/3 * 1/3 * 2/3 = 0.07407 P(X|Ci)*P(yes) = 0.03125 * 4/7 = 0.17 P(X|Ci)*P(no) = 0.074 * 3/7 = 0.317 Therefore, X belongs to class (“youth, middle, no” = “no”)

Bayes’s rule Probability of event H given evidence E : A priori probability of H : Probability of event before evidence is seen A posteriori probability of H : Probability of event after evidence is seen from Bayes “Essay towards solving a problem in the doctrine of chances” (1763) Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England witten&eibe

Naïve Bayes for classification Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent witten&eibe

Weather data example Outlook Temp. Humidity Windy Play Sunny Cool High True ? Evidence E Probability of class “yes” witten&eibe

Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: Outlook Temp. Humidity Windy Play ? Cool High True Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238 Likelihood of “no” = 1/5  4/5  3/5  5/14 = 0.0343 P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41% P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59% witten&eibe

The “zero-frequency problem” What if an attribute value doesn’t occur with every class value? (e.g. “Humidity = high” for class “yes”) Probability will be zero! A posteriori probability will also be zero! (No matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) witten&eibe

Avoiding the 0-Probability Problem Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their “uncorrected” counterparts

Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean  Standard deviation  Then the density function f(x) is Karl Gauss, 1777-1855 great German mathematician

Day Outlook Temp Humidity Windy Play 1 Sunny 85 False No 2 80 90 True 3 Overcast 83 86 Yes 4 Rainy 70 96 5 68 6 65 7 64 8 72 95 9 69 10 75 11 12 13 81 14 71 91

Statistics for weather data Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 9 5 Overcast 4 69, 70, 72, 80, 70, 75, 90, 91, True Rainy 72, … 85, … 80, … 95, … 2/9 3/5  =73  =75  =79  =86 6/9 2/5 9/14 5/14 4/9 0/5  =6.2  =7.9  =10.2  =9.7 3/9 Example density value: witten&eibe

Classifying a new day A new day: Missing values during training are not included in calculation of mean and standard deviation Outlook Temp. Humidity Windy Play Sunny 66 90 true ? Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036 Likelihood of “no” = 3/5  0.0291  0.0380  3/5  5/14 = 0.000136 P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9% P(“no”) = 0.000136 / (0.000036 + 0. 000136) = 79.1% witten&eibe

Naïve Bayes Text Classification

Naïve Bayes Text Classification P(wordi|class)= (Tct +λ )/(Nc+ λV) Where : Tct: The number of times the word occurs in that category C Nc : The number of words in category C V: The size of the vocabulary table (# unique words) λ : The positive constant, usually 1, or 0.5 to avoid zero probability.

You have a set of reviews (documents ) and a classification TEXT CLASS 1 I loved the movie + 2 I hated the movie - 3 A great movie, good movie 4 Poor acting Great acting, a good movie

First we need to extract unique words I, loved, the, movie, hated, a, great, poor ,acting, good. We have 10 unique words Doc I loved the movie hated a great poor acting good class 1 + 2 - 3 4 5

Take documents with positive outcomes loved the movie hated a great poor acting good class 1 + 3 2 5 P(+) = 3/5 = 0.6 Compute P(I |+); P(loved |+); P(the |+); P(movie |+); P(hated |+) ; P(a |+); P(great |+); P(poor |+); P(acting |+); P(good |+);

Let P(wk |+) = (nk +1) / (n + size of unique words) n: the number of words in the (+) case =14 nk =the number of times word k occurs in these cases (+) P(l |+): (1+1) /(14+10)= 0.0833 P(the |+): (1+1) /(14+10)= 0.0833 P(a |+): (2+1) /(14+10)= 0.125 P(acting |+): (1+1) /(14+10)= 0.0833

P(hated |+): (0+1) /(14+10)= 0.0417 P(loved |+): (1+1) /(14+10)= 0.0833 P(movie|+): (4+1) /(14+10)= 0.2 P(great |+): (2+1) /(14+10)= 0.125 P(good |+): (2+1) /(14+10)= 0.125 P(poor |+): (0+1) /(14+10)= 0.0417

Take documents with positive outcomes loved the movie hated a great poor acting good class 2 1 - 4 P(-) = 2/5 = 0.4 Compute P(I |-); P(loved |-); P(the |-); P(movie |-); P(hated |-) ; P(a |-); P(great |-); P(poor |-); P(acting |-); P(good |-);

P(l |-): (1+1) /(6+10)= 0.125 P(loved |-): (0+1) /(6+10)= 0.0625 P(movie |-): (1+1) /(6+10)= 0.125 P(great |-): (0+1) /(6+10)= 0.0625 P(the |-): (1+1) /(6+10)= 0.125 P(hated |-): (1+1) /(6+10)= 0.125 P(acting |-): (1+1) /(6+10)= 0.125

P(a |-): (0+1) /(6+10)= 0.0625 P(good |-): (0+1) /(6+10)= 0.0625 P(poor |-): (1+1) /(6+10)= 0.125 Vwb = arg max P(vj) πwϵ words P(W| Pvj) V= value or class Let’s classify a new sentence I hated the poor acting

If vj = +; = P(+)* P(I |+)*P(hated |+)* P(the |+)* P(poor |+) * P(acting |+) = 0.6*0.0833*0.0417*0.0833*0.0417*0.0833 6.03*10-7

If vj = -; = P(-)* P(I |-)*P(hated |-)*P(the |-)*P(poor |-) * P(acting |-) = 0.4*0. 125*0.125*0.125*0.125*0.125 1.22*10-5 Since 1.22*10-5 > 6.03*10-7 Sentence will be classified as a negative