Workshop: Experimental research in practice

Slides:

Advertisements

Similar presentations

Overcoming Limitations of Sampling for Agrregation Queries Surajit ChaudhuriMicrosoft Research Gautam DasMicrosoft Research Mayur DatarStanford University.

Advertisements

Unit 1: Science of Psychology

On Experimental Research in Sampling-based Motion Planning Roland Geraerts Workshop on Benchmarks in Robotics Research IROS 2006.

Experimental Evaluation

Abstraction IS 101Y/CMSC 101 Computational Thinking and Design Tuesday, September 17, 2013 Carolyn Seaman University of Maryland, Baltimore County.

QNT 531 Advanced Problems in Statistics and Research Methods

Programming Translators.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

ICOM 6115: Computer Systems Performance Measurement and Evaluation August 11, 2006.

© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.

Presenting and Analysing your Data CSCI 6620 Spring 2014 Thesis Projects: Chapter 10 CSCI 6620 Spring 2014 Thesis Projects: Chapter 10.

Chapter 13: Inferences about Comparing Two Populations Lecture 8b Date: 15 th November 2015 Instructor: Naveen Abedin.

Stat 31, Section 1, Last Time Distribution of Sample Means –Expected Value  same –Variance  less, Law of Averages, I –Dist’n  Normal, Law of Averages,

The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.

Test Loads Andy Wang CIS Computer Systems Performance Analysis.

Victoria Ibarra Mat:  Generally, Computer hardware is divided into four main functional areas. These are:  Input devices Input devices  Output.

Uncertainties in Measurement Laboratory investigations involve taking measurements of physical quantities. All measurements will involve some degree of.

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Memory Management.

Two-Sample Hypothesis Testing

Key Ideas from day 1 slides

Introduction to Visual Basic 2008 Programming

Unit 5: Hypothesis Testing

AP Seminar: Statistics Primer

PCB 3043L - General Ecology Data Analysis.

CHAPTER 4 Designing Studies

WELCOME TO THE WORLD OF INFERENTIAL STATISTICS

Combining Random Variables

Sampling Distributions

Little work is accurate

4.2 Experiments.

CSCI1600: Embedded and Real Time Software

CHAPTER 4 Designing Studies

Analysis of Variance (ANOVA)

SME1013 PROGRAMMING FOR ENGINEERS

Chapter 9.1: Sampling Distributions

Coding Concepts (Basics)

Significance Tests: The Basics

CHAPTER 4 Designing Studies

SME1013 PROGRAMMING FOR ENGINEERS

One-Way Analysis of Variance

Significance Tests: The Basics

Chapter 8: Estimating with Confidence

Mastering Memory Modes

Fixed, Random and Mixed effects

CHAPTER 4 Designing Studies

Chapter 8: Estimating with Confidence

Tonga Institute of Higher Education IT 141: Information Systems

CHAPTER 4 Designing Studies

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Tonga Institute of Higher Education IT 141: Information Systems

CHAPTER 4 Designing Studies

Psych 231: Research Methods in Psychology

CHAPTER 4 Designing Studies

One-Factor Experiments

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

Chapter 8: Estimating with Confidence

CHAPTER 4 Designing Studies

Chapter 8: Estimating with Confidence

CSCI1600: Embedded and Real Time Software

In Today’s Class.. General Kernel Responsibilities Kernel Organization

F test for Lack of Fit The lack of fit test..

Evaluation David Kauchak CS 158 – Fall 2019.

Presentation transcript:

Workshop: Experimental research in practice Roland Geraerts 16 November 2016 Picture: http://www.quickmeme.com/meme/3odobv/ Please send comments to R.J.Geraerts@uu.nl

Bad repeatability Why can it be hard to reproduce papers’ claims?

Bad repeatability (1) Problem Cause The results cannot be reproduced easily Cause Details of the method are lacking Parts of the method are not described Degenerate cases are missing References to other papers (without mentioning details) Parameters don’t get assigned values (usually weights) Source code is not available The experimental setup is not clear Tested hardware (e.g. which PC/GPU, the number of cores used) Statistical setup (e.g. number of runs, seed) Details of the scenario(s) are missing Picture: http://tvaholics.blogspot.nl/

Bad repeatability (2) Problem Cause Solution The results cannot be reproduced easily Cause Low significance caused by a low number of runs Hard problems can be hard to implement Solution Let someone else implement the method/paper Provide the source code Picture: http://tvaholics.blogspot.nl/

Data collection errors What kind of errors occur during the collection of (raw) data? Picture: http://resource.wur.nl/en/wetenschap/detail/science_cafe_measurement_error_or_revolution/

Data collection errors (1) Problem Errors occur during collection of raw data E.g., copy/paste values from GUIs into excel sheets or text files Cause The data collection process was not automated There is a GUI but not a command line (console) version Variables aren’t assigned the right values (how to verify?) The precision of the stored numbers is too low Statistics are computed wrongly (e.g. how to compute the SD) Only the execution of a part of the algorithm is recorded The visualization part is not strictly separated from the execution part of the algorithm E.g. While the method performs its computations, the results are being written to a log file and sent to the GPU for visualization purposes Standard deviation: Must we divide by n or n-1? Partial results:

Data collection errors (2) Solution Automate the process using a console called from a batch file For small experiments, call the arguments in the batch file Otherwise, build a load/save mechanism Create an API that supports setting up experiments Standard deviation: Must we divide by n or n-1? Partial results:

Data collection errors (3): Time measurement errors Problem Time is measured wrongly Cause Lack of timer’s accuracy C++: Don’t use time.h Don’t start/stop the timer inside the method, especially not if the parts take less than 1 ms to compute Intervening network/CPU/GPU processes

Data collection errors (4): Time measurement errors Solution Use accurate timers C++: Use QueryPerformanceCounter(…) instead; be careful of 0.3s jumps, or C++ 11: std::chrono::high_resolution_clock Run fast methods many times and take the average; watch out for non-deterministic behavior Take the average of some runs, also in case of deterministic algorithms Only measure the running time of the algorithm Switch off the network Kill the virus killer Stop the e-mail program Disable update functionality Use only 1 core Don’t work on your thesis while running the experiments on the same machine; and yes, this happens

Bad figures When do figures convey information badly? Figure 7. Motion planning result considering the effect of currents.

Bad figures Problem Cause Solution The figures convey information badly Cause The figures are hard to read (e.g. too small or bitmapped) Axes haven’t been labeled The y-axis doesn’t start at 0 which amplifies (random) differences Use the right number precision/format Don’t display 100,000.001 Don’t display 0.0005 s, or 0.1 0.15 0.2 … The meaning is not conveyed clearly Some colors/patterns don’t do well on black & white printers Solution Use e.g. GNUplot (set all labels and export to vector: EPS or PDF) Use vector images as much as possible (e.g. use IPE) Explain all phenomena Use vector graphics Picture: http://nl.123rf.com/photo_8159463_gea-soleerde-golden-luxe-bad-gemaakt-in-3d-graphics.html

Conclusions are too general When are drawn conclusions too general?

Conclusions are too general (1) Problem The conclusions drawn are often too general Cause Only one instance is tested, e.g. environment / moving entity Only one problem setting is tested A favorable setup is used, e.g. a few axis-aligned rectangular obstacles polygonal convex obstacles 1 fixed query Deterministic experiments do suffer from the ‘variance problem’ Variance problem: Deterministic versus randomized techniques Deterministic techniques can respond sensitively to small changes in the problem setting Even worse, there might be a statistical significance while a better implementation might halve the running times

Conclusions are too general (2) Solution Try to sample the problem space as good as possible Don’t try to bias any method Use a favorable setup (to show certain properties) and a ‘normal’ one Also choose worst-case scenarios Tune all methods equally Compare against the state-of-the-art instead of old methods only Dare to show the weakness(es) of your method Variance problem: Deterministic versus randomized techniques Deterministic techniques can respond sensitively to small changes in the problem setting Even worse, there might be a statistical significance while a better implementation might halve the running times

Statistical weaknesses When are the statistics less reliable?

Statistical weaknesses Problem Statistics are done badly Cause Results have been collected on different sets of hardware Too few runs Not all running times are mentioned (e.g. initialization) Only averages are mentioned Solution Use the same machine (and don’t change the setup) Use e.g. GNUplot and set all (relevant) labels Use other measures, e.g. SD Boxplot Student’s t-test: statistical hypothesis test ANOVA: Analysis of variance Student's t-test, source: http://en.wikipedia.org/wiki/Student_t_test “A t-test is any statistical hypothesis test in which the test statistic follows a Student's t distribution if the null hypothesisis supported. It can be used to determine if two sets of data are significantly different from each other, and is most commonly applied when the test statistic would follow a normal distribution if the value of a scaling term in the test statistic were known. When the scaling term is unknown and is replaced by an estimate based on the data, the test statistic (under certain conditions) follows a Student's t distribution.” Anova, source: http://en.wikipedia.org/wiki/Analysis_of_variance “Analysis of variance (ANOVA) is a collection of statistical models used to analyze the differences between group means and their associated procedures (such as "variation" among and between groups), developed by R.A. Fisher. In the ANOVA setting, the observed variance in a particular variable is partitioned into components attributable to different sources of variation. In its simplest form, ANOVA provides a statistical test of whether or not the means of several groups are equal, and therefore generalizes the t-test to more than two groups. Doing multiple two-sample t-tests would result in an increased chance of committing a type I error. For this reason, ANOVAs are useful in comparing (testing) three or more means (groups or variables) for statistical significance.”

Statistically significant? Picture: http://www.quickmeme.com/meme/3odobv/ Please send comments to R.J.Geraerts@uu.nl

So your method is statistically significant While a method was granted being statistically significant, this does not have to mean anything in practice… …due to the programmer’s bias. Suppose different methods run in 10.2, 10.0, 10.3, and 9.6 seconds (with appropriate SDs etc). While the latter one might be better, in reality it does not have to be… …since the third one might be the only one that wasn’t optimized. Picture: http://tvaholics.blogspot.nl/

Ways to bias your results (1) Run the code with choices in of Hardware (CPU, GPU, memory, cache, #cores, #threads) Language (C++/C#, 32/64bit, different optimizations) Software libraries (own code/boost/STL) Implementation is done by different people

Ways to bias your results (2): Some code optimizations Enable optimizations in your compiler Run in release mode! Visual studio full optimization inline function expansion Enable intrinsic functions Etc. Compile the code with a 64-bit compiler 2-15% improvement of running times due to usage of a larger instruction set Not having to simulate 32-bit code However, watch code that deals with memory and loops use memsize-types in address arithmetic See http://www.viva64.com/

Ways to bias your results (3): Some code optimizations Unroll loops Improves usage of parallel execution (e.g. SSE2) Create small code E.g. by improving the implementation; properly align data Improves cache behavior Avoid mixed arithmetic Use STL Is heavily optimized Avoid disk usage and writing to a console etc. Follow the course: Optimization and vectorization

Ethics versus mistakes Let’s have a discussion here!