Wake Up and Smell the Coffee: Evaluation Methodology for the 21st Century May 4th 2017 Ben Lenard
Introduction Methodology is the foundation determining if an experiment yielded good or bad results Like anything else in life, methodology namely, needs to be inline with the current technologies Article compared this to the testing methods for c/c++ vs Java and how outdated benchmarks can provide the wrong conclusions DaCapo is a suite of benchmarking for Java
Workload Design and Use DaCapo was created in 2003 after they pointed out to a NSF panel the need for realistic Java benchmarks Despite being providing additional NSF funds, the group continued to develop the benchmark suite since the current benchmarks are dated Relevant and diverse workload: Wide range of current applications Suitable for research: controlled and easy to use
Relevance and Diversity The authors used ‘real world’ applications, such as Eclipse – a Java IDE The DaCapo suite was able to run repeatable runs with various parameters; each run was about a minute In addition to standard metrics, the authors also collected metrics about the Java heap such as allocation rate, GC, and growth
Suitable for Research Easy to control workloads Easy to use instrumentation / packaging to encourage use and the ease in ability to make multiple runs The ability to use one host and not a whole infrastructure
The Researcher/ Do Not Cherry-Pick! Workloads need to be relevant to the experiment and if one does exist create one with a consortium A well designed benchmark reflects a range of behaviors for an application, and all results should be shown so ideas are not skewed.
Experimental Design / Gaming Your Results In addition to selecting a baseline when conducting an experiment, one most also identify the parameters that have relevance in the experiment Make sure your results don’t mislead people For example the authors cited that people compare Java Garbage Collection without comparing different heap sizes.
Control in a Changing World In C/C++ and Fortran, most important variables are the host and compiler and runtime libraries In Java you have more variables: Heap size and its parameters Warm up of the JVM or runtime environment Nondeterminism The Java/JIT compiler itself
A Case Study The authors designed a study to evaluate garbage collection in a JVM The space-time tradeoff in the heap The relationship between the collector and the application itself Meaningful Baseline – this is needed to make sure the study is ‘apples-to-apples’ Host Platform - architecture-dependent performance properties Language Runtime – libraries and JIT compiler behave differently and should be controlled
A Case Study (cont) Heap size – Since the authors are studying GC different heap sizes should be used since GC can behavior differently Warm-up – As more iterations occur, less compiling and loading will occur yielding better results Controlling Nondeterminism: use deterministic replay of optimization plans take multiple measurements in a single JVM invocation, after being warm generate sufficient data points and apply suitable statistical analysis
Analysis Data analysis is: Looking at repeated experiments to defeat experimental noise Looking at diverse experiments to draw conclusions
Conclusion Sound methodology relies on: relevant workloads the use of principled experimental design rigorous analysis. The underlying point of the article is to control variables within the experiment's’ environment