Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing.

Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing

Overview nSelective sampling nCluster creation using an index nCluster and case utility scores nEvaluation

Selective Sampling selected cases labelled cases select interesting cases unlabelled cases (pool) Index case-base Relevance feedback Distance learning Patient monitoring

Uncertainty and Representativeness +- ? ? +- ? ? ? ? ? ?

Sampling Procedure L = set of labelled cases U = set of unlabelled cases LOOP model <= create-domain-model (L) clusters <= create-clusters(model, L, U) k-clusters <= select-clusters(k, clusters, L, U) FOR 1 to Max-Batch-Size case <= select-case(k-clusters, L, U) L <= L U get-label(case, oracle) U <= L \ case UNTIL stopping-criterion

Overview nSelective sampling nCluster creation using an index nCluster and case utility scores nEvaluation

Forming Clusters 5 labelled (4X, 1Y) 6 unlabelled 0 labelled 6 unlabelled f3 5 labelled (2X, 2Z, 1Y) 0 unlabelled < N>= N 5 labelled (2X, 2Y, 1Z) 6 unlabelled f1 f2 ab d e 5 labelled (4Y, 1Z) 0 unlabelled c

Analysing Clusters X X X Y X Y X X Y Z Z Y Y Y Y Z X X Y Z

Overview nSelective sampling nCluster creation nCluster and case utility scores nEvaluation

Ranking Clusters - Cluster Utility Score

Ranking Cases - Case Utility Score

Overview nSelective sampling nCluster creation nCluster and case utility scores nEvaluation

Evaluation nSelection Heuristics Rnd : randomly select cluster and cases Rnd-Cluster : random cluster with highest ranked cases Rnd-Case : highest ranked cluster random cases Informed-S : highest ranked cluster and cases Informed-M : highest ranked clusters and case nUCI ML (6 datasets) smaller data sets (Zoo, Iris, Lymph, Hep) medium data sets (house votes, breast cancer)

Experimental Design Index case-base sampling pool Inc2Inc3Inc4Inc5Inc test set case base size = L + selected cases selected cases = sampling iterations * Max-Batch-Size kNN accuracy

Results I RndRnd-clusterRnd-caseInformed-MInformed-S nZoo (7C, 18F, A, P9) nIris (3C, 4F, #+A, P3)

Results II RndRnd-clusterRnd-caseInformed-MInformed-S nLymphography (4C, 19F, #+A, P9) nHepatitis (2C, 20F, A+?, P7)

Results III RndRnd-clusterRnd-caseInformed-MInformed-S nHouse (2C, 16F, A+?, P3 ) nBreast (2C, 9F, A+?, P7)

Conclusions nDeveloped a case selection mechanism exploiting case base partitions nUtility Scores to rank clusters and cases ClUS captures uncertainty within clusters and uses entropy to further weight this score CaUS captures the impact on other cases nSignificant improvement with informed selection on 6 data sets nThe influence of votes, partitions and entropy needs further investigation

Training Time Ratio (Informed-M/Rnd) Training set size5075100125150 Zoo1.11.62.32.62.9 Iris1.51.71.92.12.3 Lymphography1.51.92.12.42.6 Hepatitis1.61.92.1 2.3 Training set size150200250300350 House Votes2.32.83.43.94.5 Breast Cancer2.63.34.14.44.7 nSmall data sets (difference 2 sec to 15 sec) nLarge data sets (difference 15 sec to 60 sec)

Discussion nImproving the utility scores the changing performance of informed-M, informed-S with different partition numbers needs examined should distances employed with CaUS be transformed? what about considering the votes of the labelled cases? should the training accuracy play a more active role in ClUS? nHow can the presented approach be used for hole discovery? case base maintenance? nShould be evaluated with other sampling methods Uncertainty sampling

Entropy L = labelled cases M = 2 p is the proportion of positive cases in L p Θ the proportion of negative cases in L Entropy measures the impurity of L: Entropy(L) = p (-log 2 p ) + p Θ (-log 2 p Θ ) = - p log 2 p - p Θ log 2 p Θ P Entropy Log 2 m Entropy(C unlabelled )= 0 Entropy(+1, -1)= 1 Entropy (+6, -1)= 0.59 Entropy(+7, -2)= 0.76

Creation, Sampling, Maintenance Case generation Meta Knowledge Sampling Impact of Sampling

Some Requirements for Sampling nUncertainty is not enough? Consider the effect of sampling on the rest of the unlabelled Sampling in dense regions may be good compared to isolated points, because it influences many cases Selecting more than one case may help pick representatives from dense areas, i.e. informed

Forming Clusters f1 5 labelled (4X, 1Y) f2 0 labelled f3 5 labelled (2X, 2Y, 1Z) ab d e < N>=N 5 labelled (2X, 2Y, 1Z) 5 labelled (4Y, 1Z) c 5 labelled 6 unlabelled 0 labelled 6 unlabelled f3 5 labelled 0 unlabelled < N>= N 5 labelled 6 unlabelled f1 f2 ab d e 5 labelled 0 unlabelled c

Experimental Design nUCI ML (6 datasets) Larger data sets (house votes, breast cancer) Smaller data sets (Zoo, Iris, Lymph, Hep) n5 increasing train / test set sizes equally sized splits for selection pool / test sets Training set or case base initialised with labelled cases 150 with an increment of 50 50 with an increment of 25 nK-NN accuracy on test set averaged over 25 runs

Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing.

Similar presentations

Presentation on theme: "Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing.

Similar presentations

Presentation on theme: "Index Driven Selective Sampling for CBR Nirmalie Wiratunga Susan Craw Stewart Massie THE ROBERT GORDON UNIVERSITY ABERDEEN School of Computing."— Presentation transcript:

Similar presentations

About project

Feedback