On Benchmarking Frequent Itemset Mining Algorithms Balázs Rácz, Ferenc Bodon, Lars Schmidt-Thieme Budapest University of Technology and Economics Computer and Automation Research Institute of the Hungarian Academy of Sciences Computer-Based New Media Group, Institute for Computer Science
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 2 History Over 100 papers on Frequent Itemset Mining Many of them claim to be the ‘best’ Based on benchmarks run against some publicly available implementation on some datasets FIMI03, 04 workshop: extensive benchmarks with many implementations and data sets Serves as a guideline ever since How ‘fair’ was the benchmark and what did it measure?
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 3 On FIMI contests Problem 1: We are interested in the quality of algorithms, but we can only measure implementations. No good theoretical data model yet for analytical comparison We’ll see later: would need good hardware model Problem 2: If we gave our algorithms and ideas to a very talented and experienced low-level programmer, that could completely re-draw the current FIMI rankings. A FIMI contest is all about the ‘constant factor’
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 4 On FIMI contests (2) Problem 3: Seemingly unimportant implementation details can hide all algorithmic features when benchmarking. These details are often unnoticed even by the author and almost never published.
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 5 On FIMI contests (3) Problem 4: FIM implementations are complete ‘suites’ of a basic algorithm and several algorithmic/implementational optimizations. Comparing such complete ‘suites’ tells us what is fast, but does not tell us why. Recommendation: Modular programming Benchmarks on the individual features
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 6 On FIMI contests (4) Problem 5: All ‘dense’ mining tasks’ run time is dominated by I/O. Problem 6: On ‘dense’ datasets FIMI benchmarks are measuring the ability of submitters to code a fast integer-to-string conversion function. Recommendation: Have as much identical code as possible library of FIM functions
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 7 On FIMI contests (5) Problem 7: Run time differences are small Problem 8: Run time varies from run to run The very same executable on the very same input Bug or feature of modern hardware? What to measure? Recommendation: ‘winner takes all’ evaluation of a mining task is unfair
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 8 On FIMI contests (6) Problem 9: Traditional run-time (+memory need) benchmarks do not tell us whether an implementation is better than an other in algorithmic aspects, or implementational (hardware-friendliness) aspects. Problem 10: Traditional benchmarks do not show whether on a slightly different hardware architecture (like AMD vs. Intel) the conclusions would still hold or not. Recommendation: extend benchmarks
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 9 Library and pluggability Code reusal, pluggable components, data structures Object oriented design Do not sacrifice efficiency No virtual method calls allowed in the core Then how? C++ templates Allow pluggability with inlining Plugging requires source code change, but several versions can coexist Sometimes tricky to code with templates
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 10 I/O efficiency Variations of output routine: normal-simple: renders each itemset and each item separately to text normal-cache: caches the string representation of item identifiers df-buffered: (depth-first) reuses the string representation of the last line, appends the last item df-cache: like df-buffered, but also caches the string representation of item identifiers
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 12 Benchmarking: desiderata 1. The benchmark should be stable, and reproducible. Ideally it should have no variation, surely not on the same hardware. 2. The benchmark numbers should reflect the actual performance. The benchmark should be a fairly accurate model of actual hardware. 3. The benchmark should be hardware-independent, in the sense that it should be stable against the slight variation of the underlying hardware architecture, like changing the processor manufacturer or model.
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 13 Benchmarking: reality Different implementations stress different aspects of the hardware Migrating to other hardware: May be better in one aspect, worse in another one Ranking cannot be migrated between HW Complex benchmark results are necessary Win due to algorithmic or HW-friendliness reason? Performance is not as simple as ‘run time in seconds’
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 14 Benchmark platform Virtual machine How to define? How to code the implementations? Cost function? Instrumentation (simulation of actual CPU) Slow (100-fold slower than plain run time) Accuracy? Cost function?
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 15 Benchmark platform (2) Run-time measurement Performance counters Present in all modern processor (since i586) Count performance-related events real-time PerfCtr kernel patch under Linux, vendor-specific software under Windows Problem: measured numbers reflect the actual execution, thus are subject to variation
Three sets of bars: wide, centered total size shows total clockticks used, i.e. run-time, purple shows time of stall (CPU waiting for sth) Three sets of bars: narrow, centered brown shows # of instructions (u-ops) executed – stable, cyan shows wasted u-ops due to branch mis- predictions Three sets of bars: narrow, right lbrown shows ticks of memory r/w (mostly wait) black shows read-ahead (prefetch)
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 19 Conclusion We cannot measure algorithms, only implementations Modular implementations with pluggable features Shared code for the common functionality (like I/O) FIMI library with C++ templates Benchmark: run time varies, depends on hardware used Complex benchmarks needed Conclusions on algorithmic aspects or hardware friendliness?
OSDM05, On Benchmarking Frequent Itemset Mining Algorithms 20 Thank you for your attention Big question: how does the choice of compiler influence the performance and the ranking?