Presentation is loading. Please wait.

Presentation is loading. Please wait.

CH5 Data Mining Classification Prepared By Dr. Maher Abuhamdeh

Similar presentations


Presentation on theme: "CH5 Data Mining Classification Prepared By Dr. Maher Abuhamdeh"— Presentation transcript:

1 CH5 Data Mining Classification Prepared By Dr. Maher Abuhamdeh

2 Classification: Definition
Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

3 Classification vs. Prediction
predicts categorical class labels classifies data (constructs a model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new data Prediction: models continuous-valued functions, i.e., predicts unknown or missing values

4 Classification Example
categorical categorical continuous class Test Set Learn Classifier Model Training Set

5 Classification: Application 1
Direct Marketing Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-phone product. Approach: Use the data for a similar product introduced before. We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute. Collect various demographic, lifestyle, and company-interaction related information about all such customers. Type of business, where they stay, how much they earn, etc. Use this information as input attributes to learn a classifier model.

6 Classification: Application 2
Fraud Detection Goal: Predict fraudulent cases in credit card transactions. Approach: Use credit card transactions and the information on its account-holder as attributes. When does a customer buy, what does he buy, how often he pays on time, etc Label past transactions as fraud or fair transactions. This forms the class attribute. Learn a model for the class of the transactions. Use this model to detect fraud by observing credit card transactions on an account.

7 Classification: Application 3
Customer Attrition/Churn: Goal: To predict whether a customer is likely to be lost to a competitor. Approach: Use detailed record of transactions with each of the past and present customers, to find attributes. How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc. Label the customers as loyal or disloyal. Find a model for loyalty. From [Berry & Linoff] Data Mining Techniques, 1997

8 Classification: Application 4
Sky Survey Cataloging Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory). 3000 images with 23,040 x 23,040 pixels per image. Approach: Segment the image. Measure image attributes (features) - 40 of them per object. Model the class based on these features. Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find! From [Fayyad, et.al.] Advances in Knowledge Discovery and Data Mining, 1996

9 Classifying Galaxies Early Class: Attributes: Intermediate Late
Courtesy: Early Class: Stages of Formation Attributes: Image features, Characteristics of light waves received, etc. Intermediate Late Data Size: 72 million stars, 20 million galaxies Object Catalog: 9 GB Image Database: 150 GB

10 Illustrating Classification Task

11 KNN : K –Nearest Neighbor method
KNN approach K-Nearest Neighbor method Given a new test query q, we try to find the k closest training queries to it in terms of Euclidean distance. train a local ranking model online using the neighboring training queries (Ranking SVM) rank the documents of the test query using the trained local model

12

13 X1 = Acid Durability (seconds)
Example We have data from the questionnaires survey (to ask people opinion) and objective testing with two attributes (acid durability and strength) to classify whether a special paper tissue is good or not. Here is four training samples X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Y = Classification 7 Bad 4 3 Good 1

14 Example (Cont.) Now the factory produces a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7. Without another expensive survey, can we guess what the classification of this new tissue is? Suppose use K = 3 Calculate the distance between the query-instance and all the training samples .

15 X1 = Acid Durability (seconds)
Example (Cont.) Coordinate of query instance is (3, 7), instead of calculating the distance we compute square distance which is faster to calculate (without square root) X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance 7 (7-3)2+(7-7)2=16 4 (7-3)2+(4-7)2=25 3 (3-3)2+(4-7)2=9 1 (1-3)2-(4-7)2=13

16 X1 = Acid Durability (seconds)
Example (Cont.) 3. Sort the distance and determine nearest neighbors based on the K-th minimum distance X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance Sort Class 7 √(7-3)2+(7-7)2= Bad 4 √(7-3)2+(4-7)2= 3 √(3-3)2+(4-7)2= Good 1 √(1-3)2-(4-7)2=

17 X1 = Acid Durability (seconds)
Example (Cont.) 4. Gather the category of the nearest 3 neighbors. In red color. X1 = Acid Durability (seconds) X2 = Strength (kg/square meter) Square Distance Sort Y Class 7 √(7-3)2+(7-7)2= Bad 4 √(7-3)2+(4-7)2= 3 √(3-3)2+(4-7)2= Good 1 √(1-3)2-(4-7)2=

18 Example (Cont.) 5. Use simple majority of the category of nearest neighbors as the prediction value of the query instance We have 2 good and 1 bad, since 2>1 then we conclude that a new paper tissue that pass laboratory test with X1 = 3 and X2 = 7 is included in Good category. If k=1 we chose the nearest point which has a good class

19 Example Age Loan Default 25 40,000 JD N 35 60,000 JD 45 80,000 JD 20
52 18,000 JD 23 95,000 JD Y 40 62,000 JD 60 100,000 JD 48 221,000 JD 33 150,000 JD We can now use the training set to classify an unknown case (Age=48 and Loan=142,000 JD) using Euclidean distance. If K=1 then the nearest neighbour is the last case in the training set with Default=Y.  

20 Example Age Loan Default Distance Rank 25 40,000 JD N 102,000 35
82,000 45 80,000 JD 62,000 20 20,000 JD 122,000 120,000 JD 22,000 2 52 18,000 JD 124,000 23 95,000 JD Y 47,000 40 62,000 JD 80,000 60 100,000 JD 42,000 3 48 221,000 JD 78,000 33 150,000 JD 8,000 1 D = Sqrt[(48-33)^2 + ( )^2] =   >> Default=Y With K=3, there are two Default=Y and one Default=N out of three closest neighbors. The prediction for the unknown case is again Default=Y  

21 Standardized Distance
One major drawback in calculating distance measures directly from the training set is in the case where variables have different measurement scales or there is a mixture of numerical and categorical variables. For example, if one variable is based on annual income in dollars,JD,…. and the other is based on age in years then income will have a much higher influence on the distance calculated. One solution is to standardize the training set as shown below.

22 1 2 3 With K=3, there are two Default=N and one Default=Y out of three closest neighbours. The prediction for the unknown case is Default=N

23 Standardized AGE Min age = 20 Max age 60 X= (x-min) / (max – min) X= (25 – 20) / (60 – 20) = 5/4 = 0.125 Standardized loan Min = 18,000 Max = 212,000 X= (40,000 – 18,000) / ( 212,000 – 18,000) = 0.11

24 Pseudo for the basic k-nn
Input: D = {(x1,c1),...,(xN,cN)} x = (x1,...,xn) new instance to be classified FOR each labelled instance (xi,ci) calculate d(xi,x) Order d(xi,x) from lowest to highest, (i = 1,...,N) Select the K nearest instances to x: Dxk Assign to x the most frequent class in Dxk END

25 Advantage and disadvantage
The KNN algorithm provides good accuracy on many domains Easy to understand and implemented Very quickly The KNN algorithm can estimate complex target concept locally.

26 Disadvantage The KNN has large storage requirement because it has to store all the data The KNN is slow for huge data because all the training instance have to be visited. Need to determine value of parameter k (number of nearest neighbours)

27 Naïve Bayesian

28 Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities Foundation: Based on Bayes’ Theorem. Performance: A simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree Incremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct — prior knowledge can be combined with observed data Standard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured

29 Naïve Bayesian Classifier: Training Dataset
C1:buys_computer = ‘yes’ C2:buys_computer = ‘no’ Data sample X = (age <=30, Income = medium, Student = yes Credit_rating = Fair)

30 Bayesian Theorem: Basics
Let X be a data sample (“evidence”): class label is unknown Let H be a hypothesis that X belongs to class C Classification is to determine P(H|X), (posteriori probability), the probability that the hypothesis holds given the observed data sample X P(H) (prior probability), the initial probability E.g., X will buy computer, regardless of age, income, … P(X): probability that sample data is observed P(X|H) (likelyhood), the probability of observing the sample X, given that the hypothesis holds E.g., Given that X will buy computer, the prob. that X is , medium income

31 Bayesian Theorem Given training data X, posteriori probability of a hypothesis H, P(H|X), follows the Bayes theorem Informally, this can be written as posteriori = likelihood x prior/evidence Predicts X belongs to C2 iff the probability P(Ci|X) is the highest among all the P(Ck|X) for all the k classes

32 Towards Naïve Bayesian Classifier
Let D be a training set of tuples and their associated class labels, and each tuple is represented by an n-D attribute vector X = (x1, x2, …, xn) Suppose there are m classes C1, C2, …, Cm. Classification is to derive the maximum posteriori, i.e., the maximal P(Ci|X) This can be derived from Bayes’ theorem Since P(X) is constant for all classes, only needs to be maximized

33 Derivation of Naïve Bayes Classifier
A simplified assumption: attributes are conditionally independent (i.e., no dependence relation between attributes): This greatly reduces the computation cost: Only counts the class distribution If Ak is categorical, P(xk|Ci) is the # of tuples in Ci having value xk for Ak divided by |Ci, D| (# of tuples of Ci in D) If Ak is continous-valued, P(xk|Ci) is usually computed based on Gaussian distribution with a mean μ and standard deviation σ and P(xk|Ci) is

34 Probabilities for weather data
Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 Hot High 4 False 6 9 5 Overcast Mild Normal 1 True Rainy Cool 2/9 3/5 2/5 3/9 4/5 6/9 9/14 5/14 4/9 0/5 1/5 witten&eibe

35 Day Outlook Temp Humidity Windy Play 1 Sunny Hot High False No 2 True 3 Overcast Yes 4 Rainy Mild 5 Cool Normal 6 7 8 9 10 11 12 13 14

36 Probabilities for weather data
Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 Hot High 4 False 6 9 5 Overcast Mild Normal 1 True Rainy Cool 2/9 3/5 2/5 3/9 4/5 6/9 9/14 5/14 4/9 0/5 1/5 Outlook Temp. Humidity Windy Play Sunny Cool High True ? A new day: Likelihood of the two classes For “yes” = 2/9  3/9  3/9  3/9  9/14 = For “no” = 3/5  1/5  4/5  3/5  5/14 = Conversion into a probability by normalization: P(“yes”) = / ( ) = 0.205 P(“no”) = / ( ) = 0.795

37 senior 2/4 1/3 middle 2/4 1/3 yes 3/4 1/3 yes no
Age Income Has a car Buy/class senior middle yes youth low no junior high Age Income Has a car Yes No Yes No senior junior youth middle low high yes no senior / / middle / / yes / / yes no junior / / low / / no / /3 youth / / high / / /7 3/7

38 Example on Naïve Bayes Cont.
Age Income Has a car Buy youth middle no ? Likelihood for class (yes) = 1/4 * 2/4 * 1/4 = Likelihood for class (no) = 1/3 * 1/3 * 2/3 = P(X|Ci)*P(yes) = * 4/7 = 0.17 P(X|Ci)*P(no) = * 3/7 = 0.317 Therefore, X belongs to class (“youth, middle, no” = “no”)

39 Bayes’s rule Probability of event H given evidence E :
A priori probability of H : Probability of event before evidence is seen A posteriori probability of H : Probability of event after evidence is seen from Bayes “Essay towards solving a problem in the doctrine of chances” (1763) Thomas Bayes Born: 1702 in London, England Died: 1761 in Tunbridge Wells, Kent, England witten&eibe

40 Naïve Bayes for classification
Classification learning: what’s the probability of the class given an instance? Evidence E = instance Event H = class value for instance Naïve assumption: evidence splits into parts (i.e. attributes) that are independent witten&eibe

41 Weather data example Outlook Temp. Humidity Windy Play Sunny Cool High
True ? Evidence E Probability of class “yes” witten&eibe

42 Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: Outlook Temp. Humidity Windy Play ? Cool High True Likelihood of “yes” = 3/9  3/9  3/9  9/14 = Likelihood of “no” = 1/5  4/5  3/5  5/14 = P(“yes”) = / ( ) = 41% P(“no”) = / ( ) = 59% witten&eibe

43 The “zero-frequency problem”
What if an attribute value doesn’t occur with every class value? (e.g. “Humidity = high” for class “yes”) Probability will be zero! A posteriori probability will also be zero! (No matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates) witten&eibe

44 Avoiding the 0-Probability Problem
Naïve Bayesian prediction requires each conditional prob. be non-zero. Otherwise, the predicted prob. will be zero Ex. Suppose a dataset with 1000 tuples, income=low (0), income= medium (990), and income = high (10), Use Laplacian correction (or Laplacian estimator) Adding 1 to each case Prob(income = low) = 1/1003 Prob(income = medium) = 991/1003 Prob(income = high) = 11/1003 The “corrected” prob. estimates are close to their “uncorrected” counterparts

45 Numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: Sample mean  Standard deviation  Then the density function f(x) is Karl Gauss, great German mathematician

46

47 Day Outlook Temp Humidity Windy Play 1 Sunny 85 False No 2 80 90 True 3 Overcast 83 86 Yes 4 Rainy 70 96 5 68 6 65 7 64 8 72 95 9 69 10 75 11 12 13 81 14 71 91

48 Statistics for weather data
Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 64, 68, 65, 71, 65, 70, 70, 85, False 6 9 5 Overcast 4 69, 70, 72, 80, 70, 75, 90, 91, True Rainy 72, … 85, … 80, … 95, … 2/9 3/5  =73  =75  =79  =86 6/9 2/5 9/14 5/14 4/9 0/5  =6.2  =7.9  =10.2  =9.7 3/9 Example density value: witten&eibe

49 Classifying a new day A new day:
Missing values during training are not included in calculation of mean and standard deviation Outlook Temp. Humidity Windy Play Sunny 66 90 true ? Likelihood of “yes” = 2/9    3/9  9/14 = Likelihood of “no” = 3/5    3/5  5/14 = P(“yes”) = / ( ) = 20.9% P(“no”) = / ( ) = 79.1% witten&eibe

50 Naïve Bayes Text Classification

51 Naïve Bayes Text Classification
P(wordi|class)= (Tct +λ )/(Nc+ λV) Where : Tct: The number of times the word occurs in that category C Nc : The number of words in category C V: The size of the vocabulary table (# unique words) λ : The positive constant, usually 1, or 0.5 to avoid zero probability.

52 You have a set of reviews (documents ) and a classification
TEXT CLASS 1 I loved the movie + 2 I hated the movie - 3 A great movie, good movie 4 Poor acting Great acting, a good movie

53 First we need to extract unique words
I, loved, the, movie, hated, a, great, poor ,acting, good. We have 10 unique words Doc I loved the movie hated a great poor acting good class 1 + 2 - 3 4 5

54 Take documents with positive outcomes
loved the movie hated a great poor acting good class 1 + 3 2 5 P(+) = 3/5 = 0.6 Compute P(I |+); P(loved |+); P(the |+); P(movie |+); P(hated |+) ; P(a |+); P(great |+); P(poor |+); P(acting |+); P(good |+);

55 Let P(wk |+) = (nk +1) / (n + size of unique words)
n: the number of words in the (+) case =14 nk =the number of times word k occurs in these cases (+) P(l |+): (1+1) /(14+10)= P(the |+): (1+1) /(14+10)= P(a |+): (2+1) /(14+10)= 0.125 P(acting |+): (1+1) /(14+10)=

56 P(hated |+): (0+1) /(14+10)= P(loved |+): (1+1) /(14+10)= P(movie|+): (4+1) /(14+10)= 0.2 P(great |+): (2+1) /(14+10)= 0.125 P(good |+): (2+1) /(14+10)= 0.125 P(poor |+): (0+1) /(14+10)=

57 Take documents with positive outcomes
loved the movie hated a great poor acting good class 2 1 - 4 P(-) = 2/5 = 0.4 Compute P(I |-); P(loved |-); P(the |-); P(movie |-); P(hated |-) ; P(a |-); P(great |-); P(poor |-); P(acting |-); P(good |-);

58 P(l |-): (1+1) /(6+10)= 0.125 P(loved |-): (0+1) /(6+10)= P(movie |-): (1+1) /(6+10)= 0.125 P(great |-): (0+1) /(6+10)= P(the |-): (1+1) /(6+10)= 0.125 P(hated |-): (1+1) /(6+10)= 0.125 P(acting |-): (1+1) /(6+10)= 0.125

59 P(a |-): (0+1) /(6+10)= P(good |-): (0+1) /(6+10)= P(poor |-): (1+1) /(6+10)= 0.125 Vwb = arg max P(vj) πwϵ words P(W| Pvj) V= value or class Let’s classify a new sentence I hated the poor acting

60 If vj = +; = P(+)* P(I |+)*P(hated |+)* P(the |+)* P(poor |+) * P(acting |+) = 0.6*0.0833*0.0417*0.0833*0.0417*0.0833 6.03*10-7

61 If vj = -; = P(-)* P(I |-)*P(hated |-)*P(the |-)*P(poor |-) * P(acting |-) = 0.4*0. 125*0.125*0.125*0.125*0.125 1.22*10-5 Since 1.22* > 6.03*10-7 Sentence will be classified as a negative


Download ppt "CH5 Data Mining Classification Prepared By Dr. Maher Abuhamdeh"

Similar presentations


Ads by Google