Dr. Sampath Jayarathna Cal Poly Pomona

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.
Chapter 9 Hypothesis Testing Understandable Statistics Ninth Edition
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
STAT 135 LAB 14 TA: Dongmei Li. Hypothesis Testing Are the results of experimental data due to just random chance? Significance tests try to discover.
Section 9.1 ~ Fundamentals of Hypothesis Testing Introduction to Probability and Statistics Ms. Young.
Evaluating Search Engine
July, 2000Guang Jin Statistics in Applied Science and Technology Chapter 9_part I ( and 9.7) Tests of Significance.
Cost-Sensitive Classifier Evaluation Robert Holte Computing Science Dept. University of Alberta Co-author Chris Drummond IIT, National Research Council,
QM Spring 2002 Business Statistics Introduction to Inference: Hypothesis Testing.
Hypothesis Testing Steps of a Statistical Significance Test. 1. Assumptions Type of data, form of population, method of sampling, sample size.
7/2/2015Basics of Significance Testing1 Chapter 15 Tests of Significance: The Basics.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Statistical hypothesis testing – Inferential statistics I.
Inferential Statistics
Statistics: Unlocking the Power of Data Lock 5 Hypothesis Testing: Hypotheses STAT 101 Dr. Kari Lock Morgan SECTION 4.1 Statistical test Null and alternative.
Copyright © 2010, 2007, 2004 Pearson Education, Inc Lecture Slides Elementary Statistics Eleventh Edition and the Triola Statistics Series by.
Hypothesis Testing.
1 Power and Sample Size in Testing One Mean. 2 Type I & Type II Error Type I Error: reject the null hypothesis when it is true. The probability of a Type.
Classification Performance Evaluation. How do you know that you have a good classifier? Is a feature contributing to overall performance? Is classifier.
10.2 Tests of Significance Use confidence intervals when the goal is to estimate the population parameter If the goal is to.
Lecture 16 Dustin Lueker.  Charlie claims that the average commute of his coworkers is 15 miles. Stu believes it is greater than that so he decides to.
Statistics 101 Chapter 10 Section 2. How to run a significance test Step 1: Identify the population of interest and the parameter you want to draw conclusions.
Lecture 18 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
Ch 10 – Intro To Inference 10.1: Estimating with Confidence 10.2 Tests of Significance 10.3 Making Sense of Statistical Significance 10.4 Inference as.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Lecture 17 Dustin Lueker.  A way of statistically testing a hypothesis by comparing the data to values predicted by the hypothesis ◦ Data that fall far.
ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.
KNR 445 Statistics t-tests Slide 1 Introduction to Hypothesis Testing The z-test.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Machine Learning Tutorial-2. Recall, Precision, F-measure, Accuracy Ch. 5.
INTRODUCTION TO HYPOTHESIS TESTING From R. B. McCall, Fundamental Statistics for Behavioral Sciences, 5th edition, Harcourt Brace Jovanovich Publishers,
AP Statistics Chapter 11 Notes. Significance Test & Hypothesis Significance test: a formal procedure for comparing observed data with a hypothesis whose.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Major Steps. 1.State the hypotheses.  Be sure to state both the null hypothesis and the alternative hypothesis, and identify which is the claim. H0H0.
Today: Hypothesis testing. Example: Am I Cheating? If each of you pick a card from the four, and I make a guess of the card that you picked. What proportion.
Statistical Inference for the Mean Objectives: (Chapter 8&9, DeCoursey) -To understand the terms variance and standard error of a sample mean, Null Hypothesis,
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
Lecture Slides Elementary Statistics Twelfth Edition
Sampath Jayarathna Cal Poly Pomona
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
Logic of Hypothesis Testing
Unit 5: Hypothesis Testing
STA 291 Spring 2010 Lecture 18 Dustin Lueker.
Evaluation of IR Systems
CHAPTER 9 Testing a Claim
Hypothesis Testing: Hypotheses
More about Tests and Intervals
Evaluation.
INTEGRATED LEARNING CENTER
Modern Information Retrieval
CHAPTER 9 Testing a Claim
Section 10.2: Tests of Significance
Chapter 9: Hypothesis Testing
Chapter 9: Hypothesis Tests Based on a Single Sample
CHAPTER 9 Testing a Claim
INTRODUCTION TO HYPOTHESIS TESTING
CHAPTER 12 Inference for Proportions
STA 291 Spring 2008 Lecture 18 Dustin Lueker.
CHAPTER 12 Inference for Proportions
Intro to Confidence Intervals Introduction to Inference
False discovery rate estimation
Dr. Sampath Jayarathna Cal Poly Pomona
CHAPTER 9 Testing a Claim
STA 291 Summer 2008 Lecture 18 Dustin Lueker.
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
STA 291 Spring 2008 Lecture 17 Dustin Lueker.
STAT 1301 Tests of Significance about Averages.
Presentation transcript:

Dr. Sampath Jayarathna Cal Poly Pomona CS 299 Introduction to Data Science Lecture 5- Evaluations Dr. Sampath Jayarathna Cal Poly Pomona Credit for some of the slides in this lecture goes to Prof. Ray Mooney at UT Austin & Prof. Rong Jin at MSU

Evaluation Evaluation = Process of judging the merit or worth of something Evaluation is key to building effective and efficient Data Science systems usually carried out in controlled experiments online testing can also be done

Why System Evaluation? There are many models/ algorithms/ systems, which one is the best? What is the best component for: similarity function (cosine, correlation,…) Term selection (stopword removal, stemming…) Term weighting (TF, TF-IDF,…) How far down the list will a user need to look to find some/all relevant documents in text retrieval?

Precision and Recall Relevant Retrieved Relevant + Not Relevant + Not Retrieved Space of all documents

Confusion Matrix A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data true positives (TP): These are cases in which we predicted positive (they have the disease), and they do have the disease. true negatives (TN): We predicted negative, and they don't have the disease. false positives (FP): We predicted positive, but they don't actually have the disease. (Also known as a "Type I error.") false negatives (FN): We predicted negative, but they actually do have the disease. (Also known as a "Type II error.") Actual: Positive Negative Predicted: Positive tp fp Predicted: fn tn

Precision and Recall in Text Retrieval The ability to retrieve top-ranked documents that are mostly relevant. Precision P = tp/(tp + fp) Recall The ability of the search to find all of the relevant items in the corpus. Recall R = tp/(tp + fn) Relevant Nonrelevant Retrieved tp fp Not Retrieved fn tn

Precision/Recall : Example

Precision/Recall : Example

Accuracy Overall, how often is the classifier correct? Number of correct predictions / Total number of predictions Accuracy = tp+tn/(tp + fp + fn + tn) Accuracy = 1+90/(1+1+8+90) = 0.91 91 correct prediction out of 100 total examples Precision = 1/2 and Recall =1/9 Accuracy alone doesn't tell the full story when you're working with a class-imbalanced data set Positive Negative Predicted Positive 1 Predicted Negative 8 90

Activity 14 Accuracy of a retrieval model is defined by, Relevant Nonrelevant Retrieved tp = ? fp = ? Not Retrieved fn = ? tn = ? Activity 14 Accuracy of a retrieval model is defined by, Accuracy = tp + tn tp+tn+fp+fn Calculate the tp, fp, fn, tn and accuracy for Ranking algorithm #1 and #2 for the highlighted location in the ranking. Obj = pd.read_csv(‘values.csv’)

F Measure (F1/Harmonic Mean) One measure of performance that takes into account both recall and precision. Harmonic mean of recall and precision: Why harmonic mean? harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large Data are extremely skewed; over 99% documents are non- relevant. This is why accuracy is not an appropriate measure Compared to arithmetic mean, both need to be high for harmonic mean to be high.

F Measure (F1/Harmonic Mean) : example Recall = 2/6 = 0.33 Precision = 2/3 = 0.67 F = 2*Recall*Precision/(Recall + Precision) = 2*0.33*0.67/(0.33 + 0.67) = 0.44

F Measure (F1/Harmonic Mean) : example Recall = 5/6 = 0.83 Precision = 5/6 = 0.83 F = 2*Recall*Precision/(Recall + Precision) = 2*0.83*0.83/(0.83 + 0.83) = 0.83

Mean Average Precision (MAP) Average Precision: Average of the precision values at the points at which each relevant document is retrieved. Ex1: (1 + 1 + 0.75 + 0.667 + 0.38 + 0)/6 = 0.633 Ex2: (1 + 0.667 + 0.6 + 0.5 + 0.556 + 0.429)/6 = 0.625 Averaging the precision values from the rank positions where a relevant document was retrieved Set precision values to be zero for the not retrieved documents

Average Precision: Example

Average Precision: Example

Average Precision: Example

Average Precision: Example Miss one relevant document

Average Precision: Example Miss two relevant documents

Mean Average Precision (MAP) Summarize rankings from multiple queries by averaging average precision Most commonly used measure in research papers Assumes user is interested in finding many relevant documents for each query Requires many relevance judgments in text collection

Mean Average Precision (MAP)

Significance Testing Also called “hypothesis testing” Objective: to test a claim about parameter μ Procedure: State hypotheses H0 and Ha Calculate test statistic Convert test statistic to P-value and interpret Consider significance level (optional)

Hypotheses H0 (null hypothesis) claims “no difference” HS 67 Sunday, July 07, 2019 Hypotheses H0 (null hypothesis) claims “no difference” Ha (alternative hypothesis) contradicts the null Example: We test whether a population gained weight on average… H0: no average weight gain in population Ha: H0 is wrong (i.e., “weight gain”) Next  collect data  quantify the extent to which the data provides evidence against H0 The first step in the procedure is to state the hypotheses null and alternative forms. The null hypothesis (abbreviate “H naught”) is a statement of no difference. The alternative hypothesis (“H sub a”) is a statement of difference. Seek evidence against the claim of H0 as a way of bolstering Ha. The next slide offers an illustrative example on setting up the hypotheses. The Basics of Significance Testing

Significance Tests Given the results from a number of queries, how can we conclude that ranking algorithm B is better than algorithm A? A significance test null hypothesis: no difference between A and B alternative hypothesis: B is better than A the power of a test is the probability that the test will reject the null hypothesis correctly

t-test The t test (also called Student’s T Test) compares two averages (means) and tells you if they are different from each other. The t test also tells you how significant the differences are; In other words it lets you know if those differences could have happened by chance.

t-test What are T-Values and P-values? How big is “big enough”? Every t-value has a p- value to go with it. A p-value is the probability that the results from your sample data occurred by chance. P-values are from 0% to 100%. They are usually written as a decimal. For example, a p value of 5% is 0.05.  Low p-values are good; They indicate your data did not occur by chance. For example, a p-value of .01 means there is only a 1% probability that the results from an experiment happened by chance. In most cases, a p-value of 0.05 (5%) is accepted or 95% confidence that experiment didn’t happen by a chance.

Example Experimental Results Significance level:  = 0.05 Probability for B=A

Example Experimental Results p-value = 0.02 < 0.05 Probability for B=A is 0.02 Reject null hypothesis Avg 41.1 62.5  B is better than A Significance level:  = 0.05, Probability for B=A The p-value is less than the alpha level: p < 0.05 We can be 95% sure to reject the null hypothesis that there is a significant difference between means.

T-test Python import scipy.stats as stats import numpy as np sample1 = np.random.randn(10, 1) sample2 = 1 + np.random.randn(10, 1) t_stat, p_val = stats.ttest_ind(sample1, sample2, equal_var=False) print(t_stat) print(p_val)