Experiment Databases: Towards better experimental research in machine learning and data mining Hendrik Blockeel Katholieke Universiteit Leuven
Motivation Much research in ML / DM involves experimental evaluation Interpreting results is more difficult than it may seem Typically, a few specific implementations of algorithms, with specific parameter settings, are compared on a few datasets, and then general conclusions are drawn How generalizable are these results really? Evidence exists that too general conclusions are often drawn E.g., Perlich & Provost: different relative performance of techniques depending on size of dataset
Very sparse evidence Dataset space (DS) Algorithm parameter space (AP) x x x A few points in an N-dim space, where N is very large: very sparse evidence!
An improved methodology We here argue in favour of an improved experimental methodology: Perform much more experiments Better coverage of algorithm – dataset space Store results in an “experiment database” Better reproducability Mine that database for patterns More advanced analysis possible The approach shares characteristics of inductive databases: The database will be mined for specific kinds of patterns: inductive queries, constraint based mining
Classical setup of experiments Currently, performance evaluations of algorithms rely on few specific instantiations of algorithms (implementations, parameters), tested on few datasets (with specific properties), often focusing on specific evaluation criteria, and with a specific research question in mind Disadvantages: Limited generalisability (see before) Limited reusability of experiments If we want to test another hypothesis, we need to run new experiments, with a different setup, and now recording other information
Setup of an experiment database The ExpDB is filled with results from random instantiations of algorithms, on random datasets Algorithm parameters, dataset properties are recorded Performance criteria are measured and stored These experiments cover the whole DS x AP-space Choose alg.Choose param Generate dataset Run CART C4.5 Ripper... Leaf size > 2 Heuristic = gain... #examples=1000 #attr=20... Store Alg. par., dataset prop., results
Setup of an experiment database When experimenting with 1 learner, e.g., C4.5: ExAttrCompl 2gain MLSheur TPFPRT... Algorithm parameters Dataset characteristics Performance
Setup of an experiment database When experimenting with multiple learners: More complicated setting, will not be considered here DTC4.5C45-1 Alg.Inst.PI 2gain... MLSheur... C4.5ParInst C45-1 PI ExpDB ExAttrCompl TPFPRT... yesGini... BSheur... CA-1 PI CART-ParInst DTCARTCA
Experimental questions and hypotheses Example questions: What is the effect of Parameter X on runtime ? What is the effect of the number of examples in the dataset on TP and FP?.... With classical methodology: Different sets of experiments needed for each (Unless all questions known in advance, and experiments designed in order to answer all of them) ExpDB approach: Just query the ExpDB table for the answer New question = 1 new query, not new experiments
Inductive querying To find the right patterns in the ExpDB, we need a suitable query language Many queries can be answered with standard SQL, but (probably) not all (easily) We illustrate this with some simple examples
Investigating a simple effect The effect of #Items on Runtime for frequent itemset algorithms SELECT NItems, Runtime FROM ExpDB SORT BY NItems SELECT NItems, AVG Runtime FROM ExpDB GROUP BY NItems SORT BY NItems NItems Runtime x x x x x x x x x x
Investigating a simple effect Note: Setting all parameters randomly creates more variance in the results In the classical approach, these other parameters would simply be kept constant This leads to clearer, but possibly less generalisable results This can be simulated easily in the ExpDB setting! + : condition is explicit in the query - : we use only a part of the ExpDB So, ExpDB needs to have many experiments SELECT NItems, Runtime FROM ExpDB WHERE MinSupport=0.05 SORT BY NItems
Investigating interaction of effects E.g., does effect of NItems on Runtime change with MinSupport and NTrans? FOR a=0.01, 0.02, 0.05, 0.1 DO FOR b=10 3,10 4, 10 5,10 6,10 7 DO PLOT SELECT Nitems, Runtime FROM ExpDB WHERE MinSupport=$a AND $b <= NTrans < 10*$b SORT BY NITems
Direct questions instead of repeated hypothesis testing (“true” data mining) What is the algorithm parameter that has the strongest influence on the runtime of my decision tree learner? SELECT ParName, Var(A)/Avg(V) as Effect FROM AlgorithmParameters, (SELECT $ParName, Var(Runtime) as V, Avg(Runtime) as A FROM ExpDB GROUP BY $ParName) GROUP BY ParName SORT BY Effect Not (easily) expressible in standard SQL ! (pivoting: possible by hardcoding all attribute names in the query: not very readable or reusable)
A comparison Classical approachExpDB approach 1) Experiments are goal-oriented 2) Experiments seem more convincing than they are 3) Need to do new experiments when new research questions pop up 4) Conditions under which results are valid are unclear 5) Relatively simple analysis of results 6) Mostly repeated hypothesis testing, rather than direct questions 7) Low reusability and reproducibility 1) Experiments are general-purpose 2) Experiments seem as convincing as they are 3) No new experiments needed when new research questions pop up 4) Conditions under which results are valid are explicit in the query 5) Sophisticated analysis of results possible 6) Direct questions possible, given suitable inductive query languages 7) Better reusability and reproducibility
Summary ExpDB approach Is more efficient The same set of experiments is reusable and reused Is more precise and thrustworthy Conditions under which the conclusions hold are explicitly stated Yields better documented experiments Precise information on all experiments is kept, experiments are reproducible Allows more sophisticated analysis of results Interaction of effects, true data mining capacity Note: interesting for meta-learning!
The challenges... (*) Good dataset generators necessary Generating truly varying datasets is not easy Could start from real-life datasets (build variations) Extensive descriptions of datasets and algorithms Vary as many possibly relevant properties as possible Database schema for multi-algorithm ExpDB Suitable inductive query languages (*) note: even without solving all these problems, some improvement over the current situation is feasible and easy to achieve