Download presentation
Presentation is loading. Please wait.
Published byEvelyn Bates Modified over 9 years ago
1
Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong
2
What Problem We Are Facing Six data sets extracted from six different domains Domains were removed in the contest They are all binary classification problems They are all imbalanced data sets Percentage of positive labels varies from 7.2% to 25.2% This information was removed in the competition They were significantly different from the development sets They all have one known label to start with
3
Datasets Summary Final contest datasets DatasetDomainFeature number Train number Positive Label % AHandwriting Recognition 9217,5357.23 BMarketing25025,0009.16 CChemo- informatics 85125,7208.15 DText Processing 12,00010,00025.19 EEmbryology15432,2529.03 FEcology1267,6287.68
4
Stochastic Semi-supervised Learning Condition: Label distribution is highly imbalanced, positive labels are rare Known labels are few Unlabeled data are abundant Approach to A, C, and D: Randomly pick one record from unlabeled data pool as “negative” Use the given positive seed and picked “negative” seed as initial cluster center for k-means clustering Label the cluster as positive where the positive seed resides Repeat above process n times Take the normalized cluster membership count of each data point as the first set of prediction score Our approach when number of labels <200
5
Stochastic Semi-supervised Learning -- continued Approach to A, C, and D: When more labels are known after query, use both known labels and randomly picked “negative” seeds as initial cluster center Label cluster using known positive seeds Discard cluster whose membership is not clear Store the cluster membership of each data points Use normalized positive cluster membership counts as prediction score Our approach when number of labels <200
6
Stochastic Semi-supervised Learning -- continued Approach to B, E, and F: Randomly pick 20 unlabeled data as “negative” labels for each known positive label. Build over-fit logistic regression model using above dataset Repeat above random picking and model building process n times Final score is the average of n models. Our approach when number of labels <200
7
Supervised Learning Using Gradient Boosting Decision Tree (TreeNet)
8
Querying Strategy One critical part of active learning is the query strategy Popular approaches: Uncertainty sampling Expected model change Query by committee What we tried: Uncertainty sampling + density based selective sampling Random sampling (for large label purchase) Certainty sampling (try to get more positive labels)
9
Dataset A: Handwriting Recognition Global score = 0.623, rank 2 nd. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy 123210.67Uncertainty /Selective 219592330.82Uncertainty /Selective 3428621920.92Random 41105764780.94Get All 50175350.93
10
Dataset B: Marketing Global score = 0.375, rank 2 nd. Pie Chart Title
11
Dataset C: Chemo-informatics Global score = 0.334, rank 4 th. Passive learning. Pie Chart Title
12
Dataset D: Text Processing Global score = 0.331, rank 18 th. Pie Chart Title
13
Dataset E: Embryology Global score = 0.533, rank 3 rd. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy 1210.75Certainty 2330.66Uncertainty /Selective 3360.67Uncertainty /Selective 43224390.72Get All 50322520.86
14
Dataset E: Embryology Performance gets worse with more labels Newly queried labels did too much correction to the existing model This phenomenon was common in this contest Global score = 0.533, rank 3 rd. Pie Chart Title
15
Dataset F: Ecology Global score = 0.77, rank 4 th. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy 1210.76Uncertainty /Selective 2730.73Uncertainty /Selective 3542100.77Uncertainty /Selective 451755520.95Random 56190157270.98Get all 60676280.99
16
Dataset F: Ecology Performance gets worse with 2 more labels at beginning Most of the time, too many small queries do more harm than good to global score Pie Chart Title
17
Summary on Results Overall rank 3 rd. Pie Chart Title DatasetPositive label % AUCALCNum. Queries RankWinner AUC Winner ALC A7.230.9250.623420.8620.629 B9.160.7670.375220.7330.376 C8.150.8140.334140.7990.427 D25.190.8900.3313180.9640.745 E9.030.8650.533430.8940.627 F7.680.9880.771540.9990.802
18
Discussions How to consistently get better performance with only a few labels across different datasets How to consistently improve model performance with the increase of labels in a given dataset Does the log2 scaling give too much weight on first few queries? What about every dataset starts with a little bit more labels?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.