Presentation is loading. Please wait.

Presentation is loading. Please wait.

Review of statistics in data mining

Similar presentations


Presentation on theme: "Review of statistics in data mining"— Presentation transcript:

1 Review of statistics in data mining

2 Mean, total sum of squares, standard deviation
In the absence of other information, average (m) is commonly used as an indicator of the scale of sample values even though their probability distribution is unknown. Total sum of squares (SST) is a commonly used as a measure of the total variability of an attribute over a dataset, regardless of its actual distribution. Note that variability is deviation from the mean. SST is model independent Standard deviation (SD) is a commonly used indicator of the accuracy of averages, regardless of the distribution of samples in the average.

3 Z-scores: attributes transformed to zero mean and unit variance
If variable x is normally distributed with mean m and variance s2, then z = (x – m)/s is a normally-distributed variable with zero mean and unit variance. It is common practice to transform attributes to z-scores even if the samples in the dataset are not normally distributed. One justification is that z-scores are dimensionless and have similar scale, even though attributes may have different scale. Commonly believed that data mining techniques work better with z-scores.

4 Sometimes we can make an attribute’s distribution more “normal”
before reduction to z-scores

5 Terms used to describe the variation of attribute values
Variability: scale of attribute deviation from the mean of population Variance: a quantitative measure of variability generally associated with a normally distribution attributes Proportion of variance: Many different effects contribute to total variability in a dataset, each of which has its own proportion of variance. Example: A part variation in weight is due to age. Correlation: related variation of different attributes. Age and weight show positive correlation. Covariance: a quantitative measure of correlation generally associated with a normally distribution attributes. Common variance: correlation among a subset of attributes due to an underlying cause, like annual income. 5

6 Multivariate normal distribution
Vector with components that are the mean of each attribute Variance is a symmetric matrix called “covariance”. Diagonal elements are variance of individual attributes. Off diagonals elements are pairwise co-variances of attributes. Not independent of attribute units (i.e. change miles to kilometers; covariance matrix changes) dx1 1xd dxd 6

7 Correlation matrix is the covariance matrix of z-scores
dx1 1xd dxd All elements of covariance are quadratic. Divide all elements, including diagonals, by the product of standard deviations. Gives 1’s on the diagonal and correlation coefficients off diagonal. Correlation matrix is the covariance matrix of z-scores

8 Hypothesis testing by p-values
Given null hypothesis H0 and alternative Ha, use sample data (assuming H0 is true) to calculate the value of a test statistic, t-stat. We must know how t-stat is distributed if H0 is true (usually involves degrees of freedom). We must know if t-stat is always positive (1-sided) or if + values need to be considered (2-sided) Calculate the probability of values at least as extreme as t-stat.This is equivalent to the question “If the defending is innocent, what is the chance that we would observe such extreme criminal evidence?” This calculation is the area under the probability distribution of t-stat for extreme values (1- tail or 2-tails). If probability is less than your doubt threshold (e.g. < 0.05 for 95% confidence), reject H0 and assume Ha.

9 In data mining we sometimes encounter discrete probabilities
Discrete probabilities are defined in terms of a finite sample space S. Usually S is a collection outcomes of independent experiments, such as flipping coins, throwing dice, pulling cards, etc. Example: S = set of outcomes from flipping 2 coins = {HH, HT, TH, TT} Events are subsets of S. (S itself is called the “certain” event.) In {HH, HT, TH, TT} the event of getting 1 head and 1 trail = {HT, TH} The empty subset, , called the “null” event.

10 Conditional probability: Given some knowledge about outcomes, we want the probability of an outcome conditioned on our prior knowledge about it. Suppose that someone flips 2 coins and tells us that at least one shows heads. What is the probability that both coins are show heads? S = {HH, HT, TH, TT} Our prior knowledge eliminates outcome TT. The remaining 3 outcomes are equally likely. Pr{HH |conditioned on at least 1 head showing} = 1/3. In the absence of the prior knowledge Pr{HH}=1/4.

11 Conditional probability: general definition
Probability of A conditioned on B, Pr{A|B}, is meaningful only if Pr{B}  0. Given that B occurs, the probability that A also occurs is related to the set of outcomes in which both A and B occur.  Pr{A|B} is proportional to Pr{AB} If we normalize Pr{A|B} by dividing by Pr{B} (which ≠ 0), then Pr{B|B} = Pr{BB}/Pr{B} = Pr{B}/Pr{B} = 1

12 Apply Pr{A|B} = Pr{AB}/Pr{B} to the problem
Example of application of definition of conditional probability Apply Pr{A|B} = Pr{AB}/Pr{B} to the problem “probability of 2 heads showing given that at least one head is showing” AB is the event with 2 heads showing and at least one head showing. In S = {HH, HT, TH, TT}, HH is the only such event. Pr{AB} = ¼ B is the event with at least one head showing. 3 out of 4 events in S have this property; therefore Pr{B} = ¾ Using the definition, Pr{A|B} = Pr{AB}/Pr{B} Pr{A|B} = (1/4)/(3/4) = 1/3 Pr{A|B} = (# of outcomes in both A and B)/(# of outcomes in B)

13 Bayesian statistics explicitly includes prior knowledge
Bayes rules applied to discrete probabilities Assume that Pr{A} and Pr{B} are both non-zero Pr{A|B} = Pr{AB}/Pr{B} Pr{B|A} = Pr{AB}/Pr{A} Use Pr{B|A} to eliminate AB from Pr{A|b} Pr{A|B} = Pr{A} Pr{B|A} /Pr{B} Pr{A} called the “prior” Pr{B|A} called the “likelihood” Pr{B} called “evidence” (normalization factor) Pr{A|B} called the “posterior”

14 Bayes’ Rule for binary classification
prior Class likelihood posterior normalization Prior is information relevant to classifying that is independent of attributes x Class likelihood is probability that a member of class C will have attribute x Posterior is the probability that example with attributes x should be assigned to class C Assign example with attributes x to class C if P(C|x) > 0.5 Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1

15 Bayes’ Rule: K>2 Classes
Prior: information about class Ci that is independent of attributes x Class likelihood: probability that a member of class Ci will have attribute x Posterior: probability that example with attributes x should be assigned to class Ci Assign example with attributes x to class Ci if P(Ci|x) =maxk P(Ck|x) Priors, likelihoods, posteriors, and margins are class specific Evidence is sum of margins over classes

16 Application of Bayes’ rule for binary classification
prior Class likelihood posterior normalization Phone service offers discount for 1st year. Cancellation after 1st year called “churn” Phone service has options, voice mail, international plan, caller ID, etc. Based on historical data, how do these options affect the probability of churn? Let C denote the class of churn=true Let C denote the class of churn=false Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1

17 Application of Bayes’ rule for binary classification
prior Class likelihood posterior normalization 2 choices for method of calculation: normalized or not-normalized. Normalization is not required because p(x) is the same for C and C. If we normalize, we don’t have to calculate P(C|x) = 1- P(C|x). If we normalize, we do have to calculate p(x) = P(C)P(x|C) + P(C)P(x|C). Let’s choose to normalize. Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1

18 Data on churn: 3333 records Class size: 483 churn = true -> prior(C) = 483/3333 = Prior knowledge: most customer stick with the service Since P(C|x) proportional to P(C)P(x|C), x must be a strong predictor of churn to overcome our prior knowledge that churn is rare. For records with churn=true, 80 out of 483 sign up for voice mail: P(V|C)=0.1656 Voice mail is not an attribute that makes the likelihood of churn high. For records with churn=true, 137 out of 483 sign up for International plan: P(I|C)= is a stronger likelihood of churn but is it strong enough?

19 Normalization for international plan
p(I) = P(C)P(I|C) + P(C)P(I|C) P(C) = 1-P(C) = P(I|C) = 186 out of 2850 = p(I) = (0.1449)(0.2836)+(0.8551)(0.0653) = P(C|I) = P(C)P(I|C)/p(I) = close but still less than 0.5 A model in which new customers who sign up for the International Plan are classified as a churners is not supported by the data.

20 Without Normalization
P(C)P(I|C) = (0.1449)(0.2836) = is smaller than P(C)P(I|C) = (0.8552)(0.0653) = , therefore assign a new customer who signs up for the International plan to the non-churner class Same conclusion as before without calculating P(I)

21 Dominance of priors The imbalance in the dataset toward customers that do not churn (2850 out of 3333) makes the posterior for the non-churn class greater for all options in the phone plan. To explore the fine detail of how options might influence churning requires a more balanced dataset. This could be achieved by randomly deleting some non-churner examples in the dataset.

22 In preparation for your 1st assignment using WEKA:
Cover a simple method of classification that can be justified by Bayes’ rules Cover some of the measures of performance WEKA will report

23 Derive K-nearest-neighbors (KNN) classification method
A classification method that WEKA calls “lazy” When K=1, assign the example with attribute x to the same class as its nearest neighbor (example with attribute y such that |x-y| is smallest). Note that x and y could be attribute vectors Rationalize K>1 by Bayes rules

24 Bayes’ M>2 classifier based on K nearest neighbors
Consider dataset with N examples, Ni of which belong to class i; set P(Ci) = Ni Given an example with attributes x, draw a hyper-sphere of volume V in attribute space, centered on x and containing precisely K other training examples (K nearest neighbors), irrespective of their class. Suppose this sphere contains ni examples from class i, then p(x|Ci)P(Ci) = V-1(ni/Ni)Ni = V-1ni Prior is what we know about credit risk before we observe a clients attributes; might be per-capita bankrupties Class likelihood, p(x|C), probability of observing x conditioned on the event being in class C given client is high-risk (C = 1) how likely is X = {x1, x2} deduced by data on a set of known high-risk clients Evidence, p(x), is essentially a normalization; also called “marginal probability” that x is seen regardless of class Posterior, P(C|x), probability that client belongs to class C conditioned on attributes being X When normalized by evidence, posteriors add up to 1 24

25 Bayes’ classifier based on K nearest neighbors
Using Bayes’ rule we find posterior for class k is p(Ck|x) = nk/K Assign x to the class with highest posterior, which is the class with the highest representation among the K nearest neighbors of x. More specially, for every other example, y, in the data set calculate |x-y|. Rank these results by increasing magnitude. Cut the list off after K values and consider the classes to which members of the short list belong. Assign x to the class that is predominant among the short-list members. If the number of classes exceeds 2, no class may predominate. For binary classification there is always a predominant class is K is an odd integer. Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 25

26 Best value for K: test set approach
Most data mining methods involve both “complexity” and “optimization”. Example: fitting a polynomial to 1D data (x, y). “complexity” is what degree of polynomial (linear, quadratic, cubic, etc.) After choosing degree, optimize to get the best coefficients of polynomials of that degree. Finding the best degree requires separating data into training and validation sets. Discussed in more detail in “Regression in 1D” KNN method does not involve optimization. Value of K can be considered as the “complexity” Divide dataset into training and test sets. Test set 10-50% of dataset Try a range of values of K (usually odd integer) on the training data. Pick the value of K that gives the highest accuracy May not be a minimum in error as a function of K For the selected K-value, use the test set to see if accuracy on training data is preserved.

27 Test-set dilemma and cross validation

28 Let h be an hypothesis tested by data mining that yields f(x) as the predicted response to attribute x. Let green marbles represent cases in the general population where of our data mining result is correct. Let red marbles represent cases in the general population where our data mining result is not correct. The probability of red marbles is a measure of “out-of-sample” error. The fraction of red marbles in a test set is a measure of “test-set” error. Apply Hoeffding’s inequality to get an upper bound on |Eout(h)-Etest(h)| Etest(h)

29 In this analogy, the sample cannot be the training set because that sample was use to select the best hypothesis. This “in-sample-error” is a biased estimate of “out-sample error”

30 Etest(h) Given a test-set of size N and confidence level 1-d then
|Etest - Eout|< e(d,N) = sqrt(ln(2/d)/2N), obtained by solving d = 2exp(-2e2N) for e The larger N is the smaller e will be for a given d. The “test-set dilemma” is that in order to make the test set more meaningful, we must give up more training data. Cross validation is an alternative to a test set when a dataset is too small to allow a meaningful test set with out compromising training.

31 Leave-one-out: best cross-validation choice
Let gopt be the hypothesis that minimizes the in-sample error when the full dataset is used for training. Let g-k be the hypothesis that minimizes the in-sample error when ever example except the kth datum is used for training. Let e-k be the error in predicting the kth datum by hypothesis g-k E-1 = < e-k >, the average of errors predicting the datum left out, is the best approximation to Etest(gopt) by cross validation

32 K-fold cross validation for K>1
Leave-one-out gives the best estimate of Etest because g-k should be close to gopt for any one datum left out. Lots of computation if the dataset is large Diagram illustrates calculation of g-10 when the 4th one tenth of the training data is left out 10-fold cross validation is a compromise. Not as much computation, but g-10 is not as close to gopt as g-1k

33 Analysis of binary classification results

34 Quantities defined by binary confusion matrix
Let C1 be positive class, C2 be negative class, N be total # of instances Error rate = (FP+FN)/N = 1-accuracy Ture positive rate = TP / (TP+FN) = fraction of C1 instances correctly classified False positive rate = FP / (FP+TN) = fraction of C2 instances misclassified 34

35 Quantities defined by binary confusion matrix
If either class could be the “positive” class (i.e. no reason why one class should be yes), then TP and FP rates become class specific. Changing the identity of positive class will change the numerical value of elements in the 2x2 confusion matrix. 35

36 Performance metrics in WEKA output
Explain the TP and FP rates for class=2 from the confusion matrix. How does the confusion matrix change when 4 is the positive class? Explain the TP and FP rates for class=4 from the new confusion matrix.

37 Receiver operating characteristic (ROC) curve
Let C1 be positive class Let q be the threshold of P(C1|x) for assignment of x to C1 If q is near 1, rare assignments to C1 have high probability of being correct both FP-rate and TP-rate are small As q decreases both FP-rate and TP-rate increase For every value of q, (FP-rate, TP-rate) is point on the ROC curve Objective: find a value of q such that TP-rate near 1 when FP-rate << 1 37

38 Examples of ROC curves marginal success 0.9 - Chance alone
Minimum acceptable tp-rate is 0.9. Which classifier is best? Ideal is TP-rate = 1 when FP-rate = 0 (upper-left courner) Classify by chance alone, expect TP-rate = FP-rate (diagonal) Shape of ROC curve provides a basis for comparing classifiers 38

39 Calculating smooth ROC curves
At selected thresholds of P(C1|x) for assignment of x to C1, calculate FP-rate and TP-rate and plot (FP-rate, TP-rate) pairs.

40 Digital ROC curves Assume C1 is the positive class. Rank all examples by decreasing value of P(C1|x) In decreasing rank order, move up 1/P(C1) for each positive example and move right 1/P(C2) for each negative example If all examples are correctly classified, ROC curve will be in upper left. Area under the ROC = 1 If P(C1|x) is not correlated with class labels, ROC curve will be close to the diagonal. Area under the ROC ~ 0.5

41 Like TP and FP rates, ROC curves can be class specific
In this case, ROC curves were sufficiently similar to make ROC areas the same.

42 Assignment 1: Due 9/5/17 Classification by K Nearest Neighbors (KNN) technique Dataset on class web page from Golub et al, Science, 286 (1999) Can 2 types of leukemia, AML and ALL, be distinguished by gene-expression data? See class website for details

43 Assignment 1: Classification by K Nearest Neighbors (KNN) technique
Due 8/31/17 Dataset on class web page from Golub et al, Science, 286 (1999) Download / familiarize yourself with Weka.  Weka is a useful tool that has implemented most of the major machine learning algorithms. We will be using the Weka GUI in this assignment. Download and start the Weka GUI.  Follow the instructions on the Weka site: On Weka GUI you will see four buttons; we will only use the Explorer functionality.  For more information about Explorer functionality, check out Explorer guide

44 Open the leukemia gene expression file in Weka
Open the leukemia gene expression file in Weka. This file has data from 72 leukemia patients (rows). The expression values are for 150 genes (columns).  The last column is the type of leukemia (ALL or AML) for each patient. Q1.  What is the mean value of expression of the gene labeled “CD33 CD33 antigen (differentiation antigen)”?

45 Go to the “classify” tab. Under “Classifier” click the “Choose” button
Go to the “classify” tab.  Under “Classifier” click the “Choose” button.  Expand the “lazy” menu and choose “IBk”.  This is KNN.  IBk stands for Instance-Based k.  Click on this text in the parameter box for IBk. A menu will pop up.  For “KNN”, enter 5.  Recall that this means the algorithm will use the five nearest neighbors to classify each data point.  Leave the rest of the values as default. Under “Test options” choose “Cross-validation” and under “Folds” enter 5.  The dropdown menu below Test options should say “(Nom) leukemia_type”.  This means that the algorithm will classify “leukemia_type” (AML or ALL), using the gene expression value as attributes. Click the “Start’ button.  The main window will show a variety of results, such as accuracy, true positive rates, false positive rates, and a confusion matrix when ALL is treated as the positive class. Q2a. What is the % of correctly classified instances? Q2b. Calculate the TP and FP rates for ALL from the confusion matrix. Q2c. What is the confusion matrix when AML is treated as the positive class? Q2d. Calculate the TP and FP rates for AML from the new confusion matrix?

46 Right click on your result in the “Result list” on the left side of the screen.  Choose “visualize threshold curve” and “ALL”.  An ROC curve plots true positive (TP) rate vs. false positive (FP) rate, which are the defaults.  You can also view other types of curves by clicking the dropdown menus.  For example, precision-recall curves are an alternative to ROC curves; precision and recall are options in the dropdown menu. Q3a. Capture the ROC curve when ALL is the positive class. Q3b. Capture the ROC curve when AML is the positive class.

47 Which class has the best ROC?
ALL Area = 0.978 AML Area = 0.967

48 ZeroR is a baseline classifier that identifies the class with the most examples and predicts all examples to be in that class.  Click the Choose button under Classifier, and expand the “rules” folder.  Choose “ZeroR”.  Again use cross-validation with Folds=5.  Run it. Q4a. What is the % of correctly classified instances? Q4b. Calculate the TP and FP rates for ALL from the confusion matrix. Q4c. What is the confusion matrix when AML is treated as the positive class? Q4d. Calculate the TP and FP rates for AML from the new confusion matrix? Any successful classification should yield more accurate results than ZeroR; however, if the number of examples of each class in the data set is greatly imbalanced, results with ZeroR will look good because most of the examples are correctly classified. This is an indication that you need to deal with uneven class sizes by weighting, a topic not covered in this class. Check out the “Cost Sensitive Classifier” and “Cost Sensitive evaluation” if you’re interested.

49 Bayesian classification does not scale well with dimension
Classifying objects with m attributes into k classes require calculation of km class likelihoods. Example: binary classification of phone customers (C vs C) considering both International plan and Voice mail requires P(I V|C), P(I V|C), P(I V|C), and P(I V|C), For high dimensionality, alternative approximations to class likelihoods are needed Example: assume class likelihood is a multivariate normal distribution

50 Mahalanobis distance: (x – μ)T ∑–1 (x – μ) is analogous to (x-m)2/s2
Mahalanobis distance: (x – μ)T ∑–1 (x – μ) is analogous to (x-m)2/s2. Measure distance of x from mean m in units of S x - m is a dx1 column vector. S is dxd covariance matrix. Estimates of m and S come from example in class C. Mahalanobis distance is scalar. Requires calculating the inverse and determinant of the covariance matrix. Includes correlation between attributes.

51 Naïve Bayes classification
Neglect correlation among attributes. Assume covariance matrix is diagonal. Class likelihood, P(x|C), is a product of 1D Gaussians for each attribute. p(xi|C) is the probability that an example in class C will have value xi for attribute i Each class is characterized by a set of means and variances of the attributes of examples in the dataset that belong to that class. 51

52 Try naïve Bayes in Weka for the leukemia data set
Under the “bayes” Classifier folder, choose “NaiveBayes” and run.  What is the % of correctly classified instances? What are the TP and FP rates for ALL and AML Naïve Bayes another method, like KNN, that does not require minimization of in-sample-error for its application. Maximum likelihood estimation applied to a 1D Gaussian determines that m and s2 are the best estimates m and s2 under the assumption that attributes are normally distributed.

53 More about naïve Bayes and Bayesian networks in Chapter 5
Quiz #1 Thursday 9/7/17 Material in “Jargon of data mining” and “Review statistic” Nothing from textbook


Download ppt "Review of statistics in data mining"

Similar presentations


Ads by Google