FAKE GAME updates Pavel Kordík
2/67 FAKE GAME concept
3/67 Automated data preprocessing For each feature, optimal sequence of preprocessing methods is evolved…
4/67 Preprocessing methods implemented in FAKE GAME
5/67 Evolving preprocessing sequences
6/67 More on automated preprocessing
7/67 FAKE GAME concept
8/67 Automated Data Mining The GAME engine – automated evolution of models
9/67 Example: Housing data CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA MEDV Per capita crime rate by town Proportion of owner-occupied units built prior to 1940 Weighted distances to five Boston employment centers Input variables Output variable Median value of owner-occupied homes in $1000's
10/67 Housing data – records CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA MEDV Input variables Output variable … … … A B C A = Training set … to adjust weights and coefficients of neurons B = Validation set … to select neurons with the best generalization C = Test set … not used during training
11/67 Housing data – inductive model CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA MEDV Input variables Output variable sigmoidgauss ? sigmoidexplinear … …… MEDV=a 1 *PTRATIO+ a 0 MEDV=1/(1-exp(-a 1 *CRIM+ a 0 ))
12/67 Housing data – inductive model CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA MEDV Input variables Output variable sigmoid Validation error: 0.13Validation error: 0.21 MEDV=1/(1-exp(-5.724*CRIM ))MEDV=1/(1-exp(-5.861*AGE )) ?
13/67 Housing data – inductive model CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA MEDV Input variables Output variable sigmoid Error: 0.13 Error: 0.21 sigmoid Error: 0.26 linear Error: 0.24 polyno mial Error: 0.10 MEDV=0.747*(1/(1-exp(-5.724*CRIM ))) *(1/(1-exp(-5.861*AGE )))
14/67 Housing data – inductive model CRIM ZN INDUS NOX RM AGE DIS RAD TAX PTRATIO B LSTA MEDV Input variables Output variable sigmoid linear polyno mial polyno mial linear expo nential Validation error: 0.08
15/67 Optimization of coefficients (learning) x 1 x n x 2... * 1aeay n n i ii a ax n Gaussian(GaussianNeuron) We have inputs x 1, x 2, …, x n and target output y in the training data set We are looking for optimal values of coefficients a 0, a 1, …, a n+2 y’ The difference between unit output y’ and the target value y should be minimal for all vectors from the training data set
16/67 Which optimization method is the best? It depends on –dataset –type of unit –location of unit –configuration of method Use them all!
17/67 Remember the Genetic algorithm optimizing the structure of GAME?
18/67 More on optimization …
19/67 Experimental results supporting “use them all” strategy Atrial fibrillation (AF) is the most common clinically significant arrhythmia … number of affected patients is expected to rise to 3.3 million by 2020 and 5.6 million in 2050
20/67 Input features Number of inflection points in particular A-EGM signal (IP). Mean value of number of inflexion points in the found SCs in particular A-EGM signal (MIPSC). Variance of number of inflexion points in the found SCs in particular A-EGM signal (VIPSC). Mean value of width of found SCs in particular A-EGM signal (MWSC). Number of inflection points in found SCs in particular A-EGM signal (IPSC). IPSC normalized per number of found SCs in particular A-EGM signal (NIPSC). IP + MIPSC (IPMIPSC). pIPSC2 + TDM2 (IPSCTDM). Number of zero-level crossing points in found SCs in particular A-EGM signal (ZCP). Maximum of IPSC in particular A-EGM signal (MIPSC). Time domain method (see below) with rough (unfiltered) input A-EGM signal (TDM). 12. Time domain method using input A-EGM signal filtered by above described wavelet filter (FTDM).
21/67 Output 3 experts manually annotated the signal (Class I, II, III, IV).
22/67 Prepared Data Avg Cls – average of experts ranking (regression) Class I,II,III,IV – majority ranking (classification) x36x37x39x40Avg ClsClass IClass IIClass IIIClass IV
23/67 Experimental Setup Regression, Classification 10 fold crossvalidation not enough to get statistically significant results - repeated 10 times Each boxplot – average RMS or Classif. Accuracy from 100 models Compared with WEKA methods
24/67 Regression – GAME RMSE GAME configurations Conclusions: ensemble is more accurate and stable linear regression sufficient all performs slightly worse Configurations: lin – linear units only, QN std – subset of unit types, QN quick – std with 5 epochs only all – all units, all methods ens3 postfix – ensemble of 3 models
25/67 Comparison with WEKA Prefix w - WEKA Conclusions: w-RBFN fails in spite of tunning linear regression best game-all3 not bad RMSE GAME configurations
26/67 Classification – GAME Classification accuracy [%] GAME configurations lin – fails ens3 – good enough all-ens best
27/67 Comparison with WEKA Conclusions: GAME slightly better than decesion tree j48
28/67 It takes a lot of time Do it in parallel, employ all cores
29/67 More on distributed computing
30/67 Parallel threads synchronized by join()
31/67 Experiment
32/67 Speedup for two cores
33/67 Speedup for 8 cores
34/67 Speedup limitations For infinite number of cores, the speedup will be limited to 5.82 ! The speedup for N cores according to Amdahl’s law: The speedup for 2 cores is 1.7, for 8 cores just 3.5 … P
35/67 FAKE GAME concept
36/67 Information that might be useful Extracted from data –Data statistics – hist,…plots, matrices, feature ranking, etc. Extracted from automated preprocessing –Outliers –Data transformations –Data reduction Extracted from models –Accuracy, 10xcv –2d structure – type of units, opt methods –Formulas –3D structure of model - behavior –Feature ranking –Behavior of models –Credibility of prediction/classification
37/67
38/67
39/67
40/67
41/67
42/67 Data projections 2D 3D
43/67 Feature ranking
44/67 More on feature ranking
45/67 Model structure …
46/67
47/67 Model structure, behavior of units
48/67 Play FAKE GAME with your data
49/67 Log messages Evolving preprocessing sequences … Evolving ensemble of inductive models … Evolving “interesting” visualizations … Generating report … Done … in 2009