Presentation is loading. Please wait.

Presentation is loading. Please wait.

Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new.

Similar presentations


Presentation on theme: "Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new."— Presentation transcript:

1 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 1 Advanced databases – Inferring implicit/new knowledge from data(bases): Evaluation (especially of models) Bettina Berendt Katholieke Universiteit Leuven, Department of Computer Science http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ Last update: 26 November 2007

2 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 2 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors

3 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 3 What is evaluation? Evaluation - act of ascertaining or fixing the value or worth of (Programming) evaluation - The process of examining a system or system component to determine the extent to which specified properties are present. ( http://www.webster-dictionary.org/definition/evaluation )http://www.webster-dictionary.org/definition/evaluation  Refine the definition: the act of ascertaining the value of an object according to specified criteria, operationalised in terms of measures.

4 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 4 What can be evaluated? 1. Is the model good? l e.g.: does a classifier mis-classify only few instances? l Are the patterns interesting? [note: also a question for 2. below] l Focus: machine learning – „evaluation of mining“ 2. Is the fit model – application good? I.e., does the model meet the business objectives? l e.g., is there some important business issue that has not been sufficiently considered? l Focus: KDD / fit between the model and the application 3. Is the application that produces the data good? l e.g., does a Web site satisfy its customers? l Focus: application – “mining for evaluation”

5 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 5 Recall: Knowledge discovery

6 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 6 Phases talked about today

7 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 7 Evaluation and data analysis – Evaluation of data analysis n Objects of evaluation: l one or more data analysis procedures –specific algorithms or comprehensive software/information systems solutions –existing / in place or suggested / planned l patterns (results of a data analysis) n Criteria: quality and performance of a procedure; interestingness of a pattern n Measures include: l accuracy of a classification algorithm, l impact of introducing a data mining software solution on tasks, resources, staffing l interestingness measures n These measures´ values are derived from theoretical analyses and from the data. data mining procedure uses measures / describes patterns outputs

8 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 8 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors

9 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 9 Some measures to assess classification quality Percentage of retrieved documents that are relevant: precision=TP/(TP+FP)‏ Percentage of relevant documents that are returned: recall =TP/(TP+FN)‏ Summary measures: average precision at 20%, 50% and 80% recall (three-point average recall)‏ F-measure=(2 × recall × precision)/(recall+precision)‏ sensitivity × specificity = (TP / (TP + FN)) × (TN / (FP + TN))‏ Accuracy = percentage of correctly classified instances From Information Retrieval to classification: relevant  truly in class C; retrieved  classified as class C

10 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 10 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors

11 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 11 Evaluation: the key to success How predictive is the model we learned? Error on the training data is not a good indicator of performance on future data  Otherwise 1-NN would be the optimum classifier! Simple solution that can be used if lots of (labeled) data is available:  Split data into training and test set However: (labeled) data is usually limited  More sophisticated techniques need to be used

12 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 12 Issues in evaluation Statistical reliability of estimated differences in performance (  significance tests)‏ Choice of performance measure:  Number of correct classifications  Accuracy of probability estimates  Error in numeric predictions Costs assigned to different types of errors  Many practical applications involve costs

13 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 13 Training and testing I Natural performance measure for classification problems: error rate  Success: instance’s class is predicted correctly  Error: instance’s class is predicted incorrectly  Error rate: proportion of errors made over the whole set of instances Resubstitution error: error rate obtained from training data Resubstitution error is (hopelessly) optimistic!

14 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 14 Training and testing II Test set: independent instances that have played no part in formation of classifier Assumption: both training data and test data are representative samples of the underlying problem Test and training data may differ in nature Example: classifiers built using customer data from two different towns A and B To estimate performance of classifier from town A in completely new town, test it on data from B

15 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 15 Note on parameter tuning It is important that the test data is not used in any way to create the classifier Some learning schemes operate in two stages: Stage 1: build the basic structure Stage 2: optimize parameter settings The test data can’t be used for parameter tuning! Proper procedure uses three sets: training data, validation data, and test data Validation data is used to optimize parameters

16 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 16 Making the most of the data Once evaluation is complete, all the data can be used to build the final classifier Generally, the larger the training data the better the classifier (but returns diminish)‏ The larger the test data the more accurate the error estimate Holdout procedure: method of splitting original data into training and test set Dilemma: ideally both training set and test set should be large!

17 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 17 Predicting performance Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount of test data Prediction is just like tossing a (biased!) coin  “Head” is a “success”, “tail” is an “error” In statistics, a succession of independent events like this is called a Bernoulli process  Statistical theory provides us with confidence intervals for the true underlying proportion  [More details in the Witten & Frank slides, chapter 5]

18 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 18 Holdout estimation What to do if the amount of data is limited? The holdout method reserves a certain amount for testing and uses the remainder for training  Usually: one third for testing, the rest for training Problem: the samples might not be representative  Example: class might be missing in the test data Advanced version uses stratification  Ensures that each class is represented with approximately equal proportions in both subsets

19 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 19 Repeated holdout method Holdout estimate can be made more reliable by repeating the process with different subsamples  In each iteration, a certain proportion is randomly selected for training (possibly with stratificiation)‏  The error rates on the different iterations are averaged to yield an overall error rate This is called the repeated holdout method Still not optimum: the different test sets overlap  Can we prevent overlapping?

20 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 20 Cross-validation Cross-validation avoids overlapping test sets  First step: split data into k subsets of equal size  Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation Often the subsets are stratified before the cross-validation is performed The error estimates are averaged to yield an overall error estimate

21 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 21 More on cross-validation Standard method for evaluation: stratified ten-fold cross-validation Why ten?  Extensive experiments have shown that this is the best choice to get an accurate estimate  There is also some theoretical evidence for this Stratification reduces the estimate’s variance Even better: repeated stratified cross-validation  E.g. ten-fold cross-validation is repeated ten times and results are averaged (reduces the variance)‏

22 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 22 Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation:  Set number of folds to number of training instances  I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Very computationally expensive  (exception: NN)‏

23 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 23 Leave-One-Out-CV and stratification Disadvantage of Leave-One-Out-CV: stratification is not possible  It guarantees a non-stratified sample because there is only one instance in the test set! Extreme example: random dataset split equally into two classes  Best inducer predicts majority class  50% accuracy on fresh data  Leave-One-Out-CV estimate is 100% error!

24 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 24 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors

25 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 25 Counting the cost In practice, different types of classification errors often incur different costs Examples:  Terrorist profiling “Not a terrorist” correct 99.99% of the time  Loan decisions  Oil-slick detection  Fault diagnosis  Promotional mailing

26 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 26 Counting the cost The confusion matrix: There are many other types of cost! E.g.: cost of collecting training data Actual class True negativeFalse positiveNo False negativeTrue positiveYes NoYes Predicted class

27 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 27 Aside: the kappa statistic Two confusion matrices for a 3-class problem: actual predictor (left) vs. random predictor (right) Number of successes: sum of entries in diagonal (D) Kappa statistic: measures relative improvement over random predictor

28 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 28 Classification with costs Two cost matrices: Success rate is replaced by average cost per prediction n Cost is given by appropriate entry in the cost matrix

29 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 29 Cost-sensitive classification Can take costs into account when making predictions n Basic idea: only predict high-cost class when very confident about prediction Given: predicted class probabilities n Normally we just predict the most likely class n Here, we should make the prediction that minimizes the expected cost l Expected cost: dot product of vector of class probabilities and appropriate column in cost matrix l Choose column (class) that minimizes expected cost

30 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 30 Cost-sensitive learning So far we haven't taken costs into account at training time Most learning schemes do not perform cost- sensitive learning They generate the same classifier no matter what costs are assigned to the different classes Example: standard decision tree learner Simple methods for cost-sensitive learning: Resampling of instances according to costs Weighting of instances according to costs Some schemes can take costs into account by varying a parameter, e.g. naïve Bayes

31 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 31 Lift charts In practice, costs are rarely known Decisions are usually made by comparing possible scenarios Example: promotional mailout to 1,000,000 households Mail to all; 0.1% respond (1000)‏ Data mining tool identifies subset of 100,000 most promising, 0.4% of these respond (400) 40% of responses for 10% of cost may pay off Identify subset of 400,000 most promising, 0.2% respond (800)‏ A lift chart allows a visual comparison

32 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 32 Generating a lift chart Sort instances according to predicted probability of being positive: x axis is sample size y axis is number of true positives …… … Yes0.88 4 No0.93 3 Yes0.93 2 Yes0.95 1 Actual classPredicted probability

33 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 33 A hypothetical lift chart 40% of responses for 10% of cost 80% of responses for 40% of cost

34 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 34 ROC curves ROC curves are similar to lift charts  Stands for “receiver operating characteristic”  Used in signal detection to show tradeoff between hit rate and false alarm rate over noisy channel Differences to lift chart:  y axis shows percentage of true positives in sample rather than absolute number  x axis shows percentage of false positives in sample rather than sample size

35 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 35 A sample ROC curve Jagged curve—one set of test data Smooth curve—use cross-validation

36 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 36 Cross-validation and ROC curves Simple method of getting a ROC curve using cross-validation:  Collect probabilities for instances in test folds  Sort instances according to probabilities This method is implemented in WEKA However, this is just one possibility  Another possibility is to generate an ROC curve for each fold and average them

37 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 37 ROC curves for two schemes For a small, focused sample, use method A For a larger one, use method B In between, choose between A and B with appropriate probabilities

38 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 38 Summary of some measures ExplanationPlotDomain TP/(TP+FN)‏ TP/(TP+FP)‏ Recall Precision Information retrieval Recall- precision curve TP/(TP+FN)‏ FP/(FP+TN)‏ TP rate FP rate CommunicationsROC curve TP (TP+FP)/(TP+FP+TN +FN)‏ TP Subset size MarketingLift chart

39 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 39 Agenda The concept and importance of evaluation Error measures Training and testing Other forms of evaluation in KDD Costs of errors

40 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 40 Evaluation as a KDD / CRISP-DM phase

41 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 41 Forms of evaluation and their main foci Purposeful, key informants  in mining: interesting patterns Random, probabilisticSampling Exploratory, hypothesis generating  pattern disc. Confirmatory, hypothesis testing Relationship to prior knowledge Naturalistic inquiryExperimental designDesign Holistic interdependent system Independent and dependent variables Conceptuali- sation understand how something works analyze strengths and weaknesses towards improvement, give feedback assess concrete achievements give results and evidence Purpose FormativeSummativeMode Case studies, content and pattern analysis Descriptive and inferential statistics Analysis

42 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 42 Evaluation and data analysis – Evaluation of data analysis n Objects of evaluation: l one or more data analysis procedures –specific algorithms or comprehensive software/information systems solutions –existing / in place or suggested / planned l patterns (results of a data analysis) n Criteria: quality and performance of a procedure; interestingness of a pattern n Measures include: l accuracy of a classification algorithm, l impact of introducing a data mining software solution on tasks, resources, staffing l interestingness measures n These measures´ values are derived from theoretical analyses and from the data. measure / describe data goal mining procedure Application gives rise to contributes to uses measures / describes patterns outputs

43 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 43 measures / describes Evaluation and data analysis – Data analysis for evaluation n Object of evaluation: a Web site n Criteria: l quality as an interactive software, l quality as a retailer‘s distribution channel, l quality as a (mass) communication medium, l... n The measures include: l usability metrics, l business metrics l... n These measures´ values are derived from analysing the data. data goal mining procedure Application gives rise to contributes to uses

44 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 44 Site-specific usability issues: Example II Criterion: Navigation should “support users’ goals and behaviors.”  Search criteria that are popular should be easy to find+use. [BS00,Ber02] investigated search behavior in an online catalog with support of pages / of sequences as measures:  Search using selection interfaces (clickable map, drop-down menue) was most popular.  Search by location was most popular.  The most efficient search by location (type in city name) was not used much.  Action: modify page design.

45 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 45 Next lecture Inputs Data preparation Outputs Multirelational data mining Evaluation Algorithm What if the input isn‘t in a table (or even multiple tables)? Mining semi-structured / unstructured data

46 Berendt: Advanced databases, winter term 2007/08, http://www.cs.kuleuven.be/~berendt/teaching/2007w/adb/ 46 References / background reading; acknowledgements n „Evaluation of mining“ l Witten, I.H., & Frank, E.(2005). Data Mining. Practical Machine Learning Tools and Techniques with Java Implementations. 2nd ed. Morgan Kaufmann. http://www.cs.waikato.ac.nz/%7Eml/weka/book.html http://www.cs.waikato.ac.nz/%7Eml/weka/book.html l In particular, pp. 10-38 are based on the instructor slides for that book available at http://books.elsevier.com/companions/9780120884070/ (chapter 5): http://books.elsevier.com/companions/9780120884070/ http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter5.pdf http://books.elsevier.com/companions/9780120884070/revisionnotes/01~PDFs/chapter5.pdf or http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter5.odp http://books.elsevier.com/companions/9780120884070/revisionnotes/02~ODP%20Files/chapter5.odp n „Evaluation as a KDD phase“ l CRISP-DM 1.0. Step-by-step data mining guide. http://www.crisp-dm.org/CRISPWP-0800.pdf n „Mining for evaluation“ l is investigated in great detail (focusing on Web mining) in (and slides are based on this) Berendt, B., Spiliopoulou, M., & Menasalvas, E. (2004). Evaluation in Web Mining. Tutorial at ECML/PKDD'0), Pisa, Italy, 20 September 2004. http://ecmlpkdd.isti.cnr.it/tutorials.html#work11 http://ecmlpkdd.isti.cnr.it/tutorials.html#work11


Download ppt "Berendt: Advanced databases, winter term 2007/08, 1 Advanced databases – Inferring implicit/new."

Similar presentations


Ads by Google