1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and.

Slides:



Advertisements
Similar presentations
(nothing to see here). First thing you need to learn is that sysadmin is about people, not technology If youre a sysadmin so you dont have to deal with.
Advertisements

Useful tricks in studying reading papers doing research writing papers publishing papers English e-manuscripts.
Exams and Revision Some hints and tips.
Dana Nau: Lecture slides for Automated Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
Exponential Functions Logarithmic Functions
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Statistics.  Statistically significant– When the P-value falls below the alpha level, we say that the tests is “statistically significant” at the alpha.
Physics 270 – Experimental Physics. Let say we are given a functional relationship between several measured variables Q(x, y, …) What is the uncertainty.
Copyright © 2010 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
1-1 Copyright © 2015, 2010, 2007 Pearson Education, Inc. Chapter 25, Slide 1 Chapter 25 Comparing Counts.
Lecture 5 CS171: Game Design Studio 1I UC Santa Cruz School of Engineering 18 Feb 2010.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 5): Outliers Fall, 2008.
Copyright © 2010 Pearson Education, Inc. Chapter 19 Confidence Intervals for Proportions.
“Teach A Level Maths” Statistics 1
Scot Exec Course Nov/Dec 04 Ambitious title? Confidence intervals, design effects and significance tests for surveys. How to calculate sample numbers when.
Exam Strategies for Essay Exams
How Many Valentines?.
One-Factor Experiments Andy Wang CIS 5930 Computer Systems Performance Analysis.
Unit 1 (Chapter 5): Significant Figures
CORRELATION & REGRESSION
Academic Viva POWER and ERROR T R Wilson. Impact Factor Measure reflecting the average number of citations to recent articles published in that journal.
Problem Determination Your mind is your most important tool!
Significance Tests: THE BASICS Could it happen by chance alone?
Other Regression Models Andy Wang CIS Computer Systems Performance Analysis.
The Guessing Game The entire business of Statistics is dedicated to the purpose of trying to guess the value of some population parameter. In what follows.
Psy B07 Chapter 4Slide 1 SAMPLING DISTRIBUTIONS AND HYPOTHESIS TESTING.
Copyright © 2009 Pearson Education, Inc. Chapter 21 More About Tests.
Exam Exam starts two weeks from today. Amusing Statistics Use what you know about normal distributions to evaluate this finding: The study, published.
Dana Nau: CMSC 722, AI Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
How many times a week are you late to school? Leslie Lariz Marisol Cipriano Period
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 4): Power Fall, 2008.
How to read a scientific paper
Hypotheses tests for means
Conditions. Objectives  Understanding what altering the flow of control does on programs and being able to apply thee to design code  Look at why indentation.
Lecture 1 Page 1 CS 239, Fall 2010 Distributed Denial of Service Attacks and Defenses CS 239 Advanced Topics in Computer Security Peter Reiher September.
1 Psych 5500/6500 The t Test for a Single Group Mean (Part 1): Two-tail Tests & Confidence Intervals Fall, 2008.
Chapter 7 Probability and Samples: The Distribution of Sample Means
Section 10.1 Confidence Intervals
The Guide to Great Graphs Ten things every graph needs! Or… Ten things we hate about graphs!
Data Analysis Econ 176, Fall Populations When we run an experiment, we are always measuring an outcome, x. We say that an outcome belongs to some.
Optimizing Your Computer To Run Faster Using Msconfig Technical Demonstration by: Chris Kilkenny.
STA291 Statistical Methods Lecture 17. Bias versus Efficiency 2 AB CD.
11/25/2015Slide 1 Scripts are short programs that repeat sequences of SPSS commands. SPSS includes a computer language called Sax Basic for the creation.
S2 Chapter 2: Poisson Distribution Dr J Frost Last modified: 20 th September 2015.
Inference about a population proportion. 1. Paper due March 29 Last day for consultation with me March 22 2.
Slide 21-1 Copyright © 2004 Pearson Education, Inc.
Conditional Statements.  Quiz  Hand in your jQuery exercises from last lecture  They don't have to be 100% perfect to get full credit  They do have.
Business Statistics for Managerial Decision Farideh Dehkordi-Vakil.
Chapter 9 Day 2 Tests About a Population Proportion.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
The Practice of Statistics, 5th Edition Starnes, Tabor, Yates, Moore Bedford Freeman Worth Publishers CHAPTER 9 Testing a Claim 9.2 Tests About a Population.
Linear Regression Models Andy Wang CIS Computer Systems Performance Analysis.
Lecture PowerPoint Slides Basic Practice of Statistics 7 th Edition.
Section 9.1 First Day The idea of a significance test What is a p-value?
Dana Nau: CMSC 722, AI Planning Licensed under the Creative Commons Attribution-NonCommercial-ShareAlike License:
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
10.1 Estimating with Confidence Chapter 10 Introduction to Inference.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 21 More About Tests and Intervals.
How to make an effective presentation 8 July :00pm-4:00pm.
What Is a Test of Significance?
Chapter 21 More About Tests.
CHAPTER 9 Testing a Claim
CHAPTER 9 Testing a Claim
How to make an effective presentation
Hypothesis Testing for Proportions
CHAPTER 26: Inference for Regression
CHAPTER 22: Inference about a Population Proportion
CSCE 489- Problem Solving Programming Strategies Spring 2018
Presentation transcript:

1 Running Experiments for Your Term Projects Dana S. Nau CMSC 722, AI Planning University of Maryland Lecture slides for Automated Planning: Theory and Practice

2 Measuring a Program’s Performance Some of you want to run experiments to measure the performance of a program Possible things to measure  Running time »usually CPU time, not clock time  Solution quality »size? cost? some other measure?  Number of problems solved  Other things »E.g., in a game you might measure the number of wins, or the number of points How to display the results  Usually as a graph  It can be hard to look at a large table of numbers and figure out what it means

3 Sources of Test Problems International Planning Competition  For most of the competitions, you can download the entire set of competition problems Usually they’re written in a language called PDDL  Some PDDL tutorials:   

4 The IPC test data isn’t always adequate Only 20 problems total!  Not clear whether the results are statistically meaningful The problems aren’t sorted by difficulty  Hard to see what the trends are

5 Preferably each data point should be an average of at least 20 problems of similar difficulty What does similar difficulty mean?  Generally it means similar size  Sometimes there’s more than one possible measure of size (see below) Below, each data point is an average over 100 problems  The curve is much smoother -- easier to see how the program is performing  These results have high statistical significance

6 Statistical significance? If each data point is an average of many runs, it’s possible to compute error bars  95% confidence interval around each point  Some graphing programs can compute this for you automatically In other areas of AI (e.g., machine learning), people do this a lot In AI planning, almost nobody does it, and I won’t require it

7 How to get a large set of problems In some cases, you can download the program that generated the problems, and use it to generate more problems  If you can’t find the program on the web, ask me and I’ll check to see whether I can find it In other cases you can build such a program yourself  But be careful how you do it  Important for the program to generate an unbiased random sample  On a biased sample, a program’s performance can look much different than on an unbiased sample Many years ago, when one of my PhD students showed me his experimental results, his program appeared to be running in time O(n 2 )  When I looked at how he was generating his problems, it turned out his program only generated very easy ones  When we fixed that, the running time turned out to be exponential

8 Polynomial versus exponential? Which of these is growing the fastest?

9 Polynomial versus exponential? Which of these is growing the fastest?

10 Polynomial versus exponential? Which of these is growing the fastest? y = x 2 y = x 3.5 y = 2 x y = 1.2 x

11 What if a program can’t handle all the data? Sometimes a program can’t solve all of the problems at each data point  It may run out of time or run out of memory In this case, you generally need to throw out that data point  Don’t just take an average over the problems that the program can solve This biases the data – you’re only reporting the average on the easy problems  That makes a program’s performance look much better than it really is

12 2-hour time limit for each run If all 100 runs took less than 2 hours each, we took the average time per run If lots of runs failed to finish with 2 hours, we omitted those data points  No good way to get a good value for those data points If at least 97 of the runs finished within 2 hours  we counted others failure as 2 hours each when computing the average  we marked the data point with an asterisk, to tell the reader than in this case the program looks better than it actually is This gave us data points that were close to being correct  If we had thrown them away, the graph wouldn’t have been as informative

13 What problem domains to use? Depends on what you’re trying to show For a journal or conference paper  If you want to show that a program works well in lots of cases, then you generally want to compare its performance against other programs on several (at least 3?) domains that are significantly different from each other For the term project, results on a single domain are probably OK  Select a domain that illustrates a particular situation that you’re trying to investigate

14 If you have difficulty If you’re trying to evaluate a program’s performance and you run into difficulty  e.g., there’s some reason why it wouldn’t be feasible to run your program on a large set of problems Come discuss it with me