Download presentation
Presentation is loading. Please wait.
Published byElijah Perkins Modified over 9 years ago
1
Random Sets Approach and its Applications Basic iterative feature selection, and modifications. Tests for independence & trimmings (similar to HITON algorithm). Experimental results with some comments. Concluding remarks. Introduction: input data, objectives and main assumptions. 1 Random sets approach. Vladimir Nikulin, Suncorp, Australia
2
Training data: where is binary label and is a vector of m features: Introduction 2 In practical situation the label y may be hidden, and the task is to estimate it using vector of features: Area under receiver operating curve (AUC) will be used as an evaluation and optimisation criterion.
3
3 Causal relations X1X2 Y X6 X3X4 X7 X8X9 Main assumption: direct features have stronger influence on the target variable and, therefore, are more likely to be selected by the FS-algorithms. Manipulations are actions or experiments performed by an external agent on a system, whose effect disrupts the natural functioning of the system. By definition, all direct features can not be manipulated.
4
4 Basic iterative FS-algorithm 5: Transfer feature f from S to Z. 6: stop if there are no improvement. Alternatively, goto Step 4. 3: Set Z = []. 4: Select feature f according to D and g. 1: Input: Y-target variable; S –set of features. 2: Select evaluation criterion D and algorithm g.
5
5 BIFS: behaviour of the target function CINA LUCAP MARTI REGED
6
6 RS-algorithm 5: Select range of repeats for detailed investigation. 6: Apply some trimming (tests for independence). 3: Select block B of top/worst performing sets of features. 4: Compute for any feature number of repeats in B. 1: Evaluate long sequence of RS using CV. 2: Sort results in increasing order.
7
7 RS(10000, 40), MARTI case 10%
8
8 Test for independence (or trimming) 1: Input: Z – subset of features; Δ – threshold parameter. 2: Compute: 3: Z:=Z \ f and goto Step 2 if α < Δ; stop procedure, alternatively.
9
9 Data# Train (positive)# TestDimension Method Software LUCAS 2000 (1443)1000011 neural+gentleboostMATLAB-CLOP LUCAP 2000 (1443)10000143 neural+gentleboostMATLAB-CLOP REGED 500 (59)20000999 SVM-RBFC SIDO 12678 (452)100004932 binaryRFC CINA 16033 (3939)10000132 adaBoost R MARTI 500 (59)200001024 svc+standardizeMATLAB-CLOP Base models and software
10
10 Final results (first 4 lines) DataSubmissionCASE0CASE1CASE2MeanRank REGED vn140.99890.95220.77720.90944 SIDO vn140.94290.71920.61430.75885 CINA vn14a0.97640.86170.71320.85042 MARTI vn140.98890.89530.73640.87364 LUCAS vn10.92090.90970.79580.8755 validation LUCAP vn10b+vn10.97550.91670.92120.9378 validation CINA vn10.97650.85640.72530.8528 all features CINA vn110.97780.86370.7180.8532 CE
11
11 Data Submission # features Fscore TrainAUC TestAUC REGED1 vn144000.731610.9522 REGED1 vn11d1500.822310.9487 REGED1 vn19990.510.9445 REGED1 vn88990.514510.9436 MARTI1 vn12c5000.578410.8977 MARTI1 vn144000.555410.8953 MARTI1 vn39990.512410.8872 MARTI1 vn78990.489510.8722 SIDO0 vn92030.52180.96840.946 SIDO0 vn9a3260.5360.97270.9459 SIDO0 vn110300.57850.98110.943 SIDO0 vn145270.55020.97790.9429 Some particular results
12
12 Behaviour of linear filtering coefficients, MARTI-set
13
13 CINA-set: AdaBoost, plot of one solution against another
14
14 SIDO, RF(1000, 70, 10)
15
15 Some comments In practical applications we are dealing not with pure probability distributions, but with mixtures of distributions, which reflect changing in time trends and patterns. Accordingly, it appears to be more natural to form training set as an unlabeled mixture of subsets derived from different (manipulated) distributions, for example, REGED1, REGED2,..,REGED9. As a distribution for the test set we can select any “pure” distribution. Proper validation is particularly important in the case when training and test sets have different distributions. Respectively, it will be good to apply traditional strategy: split randomly available test-set into 2 parts 50/50 where one part will be used for validation, second part for testing.
16
16 Concluding remarks Random sets approach has heuristic nature and has been inspired by the growing speed of computations. It is general method, and there are many ways for further developments. Performance of the model depends on the particular data. Definitely, we can not expect that one method will produce good solutions for all problems. Probably, it was necessary to apply more aggressive FS-strategy in the case of Causal Discovery competition. Our results against all unmanipulated and all validation sets are in line with top results.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.