Download presentation
Presentation is loading. Please wait.
Published byLaureen Golden Modified over 9 years ago
1
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 1 Advanced databases – Inferring implicit/new knowledge from data(bases): Evaluation (especially of models) Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ Last update: 26 November 2007
2
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 2 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors
3
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 3 What is evaluation? Evaluation - act of ascertaining or fixing the value or worth of (Programming) evaluation - The process of examining a system or system component to determine the extent to which specified properties are present. ( http://www.webster-dictionary.org/definition/evaluation )http://www.webster-dictionary.org/definition/evaluation Refine the definition: the act of ascertaining the value of an object according to specified criteria, operationalised in terms of measures.
4
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 4 What can be evaluated? 1. Is the model good? l e.g.: does a classifier mis-classify only few instances? l Are the patterns interesting? [note: also a question for 2. below] l Focus: machine learning – „evaluation of mining“ 2. Is the fit model – application good? I.e., does the model meet the business objectives? l e.g., is there some important business issue that has not been sufficiently considered? l Focus: KDD / fit between the model and the application 3. Is the application that produces the data good? l e.g., does a Web site satisfy its customers? l Focus: application – “mining for evaluation”
5
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 5 Recall: Knowledge discovery
6
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 6 Phases talked about today
7
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 7 Evaluation and data analysis – Evaluation of data analysis n Objects of evaluation: l one or more data analysis procedures –specific algorithms or comprehensive software/information systems solutions –existing / in place or suggested / planned l patterns (results of a data analysis) n Criteria: quality and performance of a procedure; interestingness of a pattern n Measures include: l accuracy of a classification algorithm, l impact of introducing a data mining software solution on tasks, resources, staffing l interestingness measures n These measures´ values are derived from theoretical analyses and from the data. data mining procedure uses measures / describes patterns outputs
8
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 8 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors
9
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 9 Some measures to assess classification quality Percentage of retrieved documents that are relevant: precision=TP/(TP+FP) Percentage of relevant documents that are returned: recall =TP/(TP+FN) Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall) F-measure=(2 × recall × precision)/(recall+precision) sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN)) Accuracy = percentage of correctly classified instances From Information Retrieval to classification: relevant truly in class C; retrieved classified as class C
10
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 10 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors
11
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 11 Evaluation: the key to success How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data Otherwise 1-NN would be the optimum classifier! Simple solution that can be used if lots of (labeled) data is available: Split data into training and test set However: (labeled) data is usually limited More sophisticated techniques need to be used
12
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 12 Issues in evaluation Statistical reliability of estimated differences in performance ( significance tests) Choice of performance measure: Number of correct classifications Accuracy of probability estimates Error in numeric predictions Costs assigned to different types of errors Many practical applications involve costs
13
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 13 Training and testing I Natural performance measure for classification problems: error rate Success: instance’s class is predicted correctly Error: instance’s class is predicted incorrectly Error rate: proportion of errors made over the whole set of instances Resubstitution error: error rate obtained from training data Resubstitution error is (hopelessly) optimistic!
14
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 14 Training and testing II Test set: independent instances that have played no part in formation of classifier Assumption: both training data and test data are representative samples of the underlying problem Test and training data may differ in nature Example: classifiers built using customer data from two different towns A and B To estimate performance of classifier from town A in completely new town, test it on data from B
15
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 15 Note on parameter tuning It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Stage 1: build the basic structure Stage 2: optimize parameter settings The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters
16
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 16 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish) The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Dilemma: ideally both training set and test set should be large!
17
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 17 Predicting performance Assume the estimated error rate is 25%. How close is this to the true error rate? Depends on the amount of test data Prediction is just like tossing a (biased!) coin “Head” is a “success”, “tail” is an “error” In statistics, a succession of independent events like this is called a Bernoulli process Statistical theory provides us with confidence intervals for the true underlying proportion [More details in the Witten & Frank slides, chapter 5]
18
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 18 Holdout estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training Usually: one third for testing, the rest for training Problem: the samples might not be representative Example: class might be missing in the test data Advanced version uses stratification Ensures that each class is represented with approximately equal proportions in both subsets
19
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 19 Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation) The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap Can we prevent overlapping?
20
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 20 Cross-validation Cross-validation avoids overlapping test sets First step: split data into k subsets of equal size Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate
21
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 21 More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten? Extensive experiments have shown that this is the best choice to get an accurate estimate There is also some theoretical evidence for this Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)
22
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 22 Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: Set number of folds to number of training instances I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Very computationally expensive (exception: NN)
23
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 23 Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes Best inducer predicts majority class 50% accuracy on fresh data Leave-One-Out-CV estimate is 100% error!
24
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 24 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors
25
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 25 Counting the cost In practice, different types of classification errors often incur different costs Examples: Terrorist profiling “Not a terrorist” correct 99.99% of the time Loan decisions Oil-slick detection Fault diagnosis Promotional mailing
26
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 26 Counting the cost The confusion matrix: There are many other types of cost! E.g.: cost of collecting training data Actual class True negativeFalse positiveNo False negativeTrue positiveYes NoYes Predicted class
27
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 27 Aside: the kappa statistic Two confusion matrices for a 3-class problem: actual predictor (left) vs. random predictor (right) Number of successes: sum of entries in diagonal (D) Kappa statistic: measures relative improvement over random predictor
28
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 28 Classification with costs Two cost matrices: Success rate is replaced by average cost per prediction n Cost is given by appropriate entry in the cost matrix
29
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 29 Cost-sensitive classification Can take costs into account when making predictions n Basic idea: only predict high-cost class when very confident about prediction Given: predicted class probabilities n Normally we just predict the most likely class n Here, we should make the prediction that minimizes the expected cost l Expected cost: dot product of vector of class probabilities and appropriate column in cost matrix l Choose column (class) that minimizes expected cost
30
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 30 Cost-sensitive learning So far we haven't taken costs into account at training time Most learning schemes do not perform cost- sensitive learning They generate the same classifier no matter what costs are assigned to the different classes Example: standard decision tree learner Simple methods for cost-sensitive learning: Resampling of instances according to costs Weighting of instances according to costs Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes
31
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 31 Lift charts In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Example: promotional mailout to 1,000,000 households Mail to all; 0.1% respond (1000) Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off Identify subset of 400,000 most promising, 0.2% respond (800) A lift chart allows a visual comparison
32
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 32 Generating a lift chart Sort instances according to predicted probability of being positive: x axis is sample size y axis is number of true positives …… … Yes0.88 4 No0.93 3 Yes0.93 2 Yes0.95 1 Actual classPredicted probability
33
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 33 A hypothetical lift chart 40% of responses for 10% of cost 80% of responses for 40% of cost
34
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 34 ROC curves ROC curves are similar to lift charts Stands for “receiver operating characteristic” Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences to lift chart: y axis shows percentage of true positives in sample rather than absolute number x axis shows percentage of false positives in sample rather than sample size
35
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 35 A sample ROC curve Jagged curve—one set of test data Smooth curve—use cross-validation
36
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 36 Cross-validation and ROC curves Simple method of getting a ROC curve using cross-validation: Collect probabilities for instances in test folds Sort instances according to probabilities This method is implemented in WEKA However, this is just one possibility Another possibility is to generate an ROC curve for each fold and average them
37
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 37 ROC curves for two schemes For a small, focused sample, use method A For a larger one, use method B In between, choose between A and B with appropriate probabilities
38
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 38 Summary of some measures ExplanationPlotDomain TP/(TP+FN) TP/(TP+FP) Recall Precision Information retrieval Recall- precision curve TP/(TP+FN) FP/(FP+TN) TP rate FP rate CommunicationsROC curve TP (TP+FP)/(TP+FP+TN +FN) TP Subset size MarketingLift chart
39
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 39 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors
40
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 40 Evaluation as a KDD / CRISP-DM phase
41
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 41 Forms of evaluation and their main foci Purposeful, key informants in mining: interesting patterns Random, probabilisticSampling Exploratory, hypothesis generating pattern disc. Confirmatory, hypothesis testing Relationship to prior knowledge Naturalistic inquiryExperimental designDesign Holistic interdependent system Independent and dependent variables Conceptuali- sation understand how something works analyze strengths and weaknesses towards improvement, give feedback assess concrete achievements give results and evidence Purpose FormativeSummativeMode Case studies, content and pattern analysis Descriptive and inferential statistics Analysis
42
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 42 Evaluation and data analysis – Evaluation of data analysis n Objects of evaluation: l one or more data analysis procedures –specific algorithms or comprehensive software/information systems solutions –existing / in place or suggested / planned l patterns (results of a data analysis) n Criteria: quality and performance of a procedure; interestingness of a pattern n Measures include: l accuracy of a classification algorithm, l impact of introducing a data mining software solution on tasks, resources, staffing l interestingness measures n These measures´ values are derived from theoretical analyses and from the data. measure / describe data goal mining procedure Application gives rise to contributes to uses measures / describes patterns outputs
43
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 43 measures / describes Evaluation and data analysis – Data analysis for evaluation n Object of evaluation: a Web site n Criteria: l quality as an interactive software, l quality as a retailer‘s distribution channel, l quality as a (mass) communication medium, l... n The measures include: l usability metrics, l business metrics l... n These measures´ values are derived from analysing the data. data goal mining procedure Application gives rise to contributes to uses
44
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 44 Site-specific usability issues: Example II Criterion: Navigation should “support users’ goals and behaviors.” Search criteria that are popular should be easy to find+use. [BS00,Ber02] investigated search behavior in an online catalog with support of pages / of sequences as measures: Search using selection interfaces (clickable map, drop-down menue) was most popular. Search by location was most popular. The most efficient search by location (type in city name) was not used much. Action: modify page design.
45
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 45 Next lecture Inputs Data preparation Outputs Multirelational data mining Evaluation Algorithm What if the input isn‘t in a table (or even multiple tables)? Mining semi-structured / unstructured data
46
Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 46 References / background reading; acknowledgements n „Evaluation of mining“ l Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html http://www.cs.waikato.ac.nz/%7Eml/weka/book.html l In particular, pp. 10-38 are based on the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ (chapter 5): http://books.elsevier.com/companions/9780120884070/ http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter5.pdf http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter5.pdf or http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter5.odp http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter5.odp n „Evaluation as a KDD phase“ l CRISP-DM 1.0. Step-by-step data mining guide. http://www.crisp-dm.org/CRISPWP-0800.pdf n „Mining for evaluation“ l is investigated in great detail (focusing on Web mining) in (and slides are based on this) Berendt, B., Spiliopoulou, M., & Menasalvas, E. (2004). Evaluation in Web Mining. Tutorial at ECML/PKDD'0), Pisa, Italy, 20 September 2004. http://ecmlpkdd.isti.cnr.it/tutorials.html#work11 http://ecmlpkdd.isti.cnr.it/tutorials.html#work11
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.