Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun Bidirectional Active Learning: A Two-Way Exploration Into Unlabeled and Labeled Data Set By Xiao-Yu Zhang, Shupeng Wang, Xiaochun Yun Presented By Ruhani Faiheem Rahman
But what about mislabeled data? Abstract Labelling the data is one of the major problem in Machine Learning We have huge amount of unlabeled data and few has the label Active Learning helps in that case. But what about mislabeled data? This noise will propagate to the model This paper explores both labeled and unlabeled data sets simultaneously
Introduction Classic Machine Learning Supervised Learning Unsupervised Learning If unlabeled data explored along with the labeled data, considerable amount of improvement possible Actively select the most informative instances to improve the model
Unidirectional Active Learning Traditional active learning Chose instance from the sample, to learn the model effectively Uncertainty sampling - choose the least certain instance Query by Committee - a voting is done among the classes. Most disagreed samples are selected Decision Theoretic Approach - instance which reduce the model’s generalization error if its label was known Noise in the labeled data can jeopardize the learning performance
Unidirectional Active Learning
Bidirectional Active Learning Forward Learning Backward Learning Backward Instance Detecting Instance-level detecting Label-level detecting Backward Instance Processing Undo Redo Relabel Reselect Backward Learning Algorithm
Forward Learning Similar to Unidirectional Active Learning (UDAL) Selects a forward instance from Unlabeled data set based on the selection mechanism described before Add the instance to label data set and removes it from unlabeled data set Train a new model
Backward Learning Detect a Backward Instance Explores the labeled data set instead of Unlabeled data set Detect an instance from the labeled data set based on Instance level detecting FInd the instance without which the entropy over unlabeled data set would be minimized Label level detecting Find the most suspiciously mislabeled instance If the label is changed and the entire error is minimized
Backward Learning Process the backward instance Undo Eliminate the negative influence by removing it from the training set Suitable for instance level method Redo Relabel Backward instance is returned to be labeled for the second time If new label is the same as previous then it will be copied twice Otherwise replace it with the new label Reselect Find the nearest neighbour of the backward instance in the unlabeled data set Probability of the neighbour’s label and the backward instance is higher Add this instance to the train model
BDAL Algorithm
Experiments Synthesis Data Classification Handwritten Digit Classification Image Classification Patent Document Classification
Synthesis Data Classification Two class synthesis data 410 instances 205 instances for each class 5 instances from each class are selected randomly for initial training
Synthesis Data Classification
B. Handwritten Digit Classification MNIST data set 10,000 instances of test data Each image has 28 * 28 pixels 100 images randomly picked for initial training For model update, 100 images are labeled with 5% noise Result is averaged over 20 runs
B. Handwritten Digit Classification
C. Image Classification 50 categories of images. Like car, ship, human etc Each category contains 100 images 10 images from each category used for initial training the model So, total 500 images as initial training data For each model update 500 images with 5% noise. Result is averaged over 50 runs.
C. Image Classification
D. Patent Document Classification 5000 patents data from Innography database. All those data are manually classified by domain experts into 5 classes. 5484 terms are extracted as raw text file, then use PCA for dimention reduction 150-D feature vector 50 instances are picked randomly for initial training data For each model update, 100 instances are labeled with 5% noise Result is averaged over 20 runs
D. Patent Document Classification
Conclusion BDAL performs better in all the experiments BDAL > UDAL > NonAL Redo Strategy receives slightly improved performance than UDAL Undo Strategy outperforms most of the experiments
Future Plan Different strategies can be adopted during the backward learning process based on the noise on the data Fast approximation algorithm will be studied for computational efficiency
Thank You
Any Question?