Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong

What Problem We Are Facing  Six data sets extracted from six different domains  Domains were removed in the contest  They are all binary classification problems  They are all imbalanced data sets  Percentage of positive labels varies from 7.2% to 25.2%  This information was removed in the competition  They were significantly different from the development sets  They all have one known label to start with

Datasets Summary Final contest datasets DatasetDomainFeature number Train number Positive Label % AHandwriting Recognition 9217,5357.23 BMarketing25025,0009.16 CChemo- informatics 85125,7208.15 DText Processing 12,00010,00025.19 EEmbryology15432,2529.03 FEcology1267,6287.68

Stochastic Semi-supervised Learning  Condition:  Label distribution is highly imbalanced, positive labels are rare  Known labels are few  Unlabeled data are abundant  Approach to A, C, and D:  Randomly pick one record from unlabeled data pool as “negative”  Use the given positive seed and picked “negative” seed as initial cluster center for k-means clustering  Label the cluster as positive where the positive seed resides  Repeat above process n times  Take the normalized cluster membership count of each data point as the first set of prediction score Our approach when number of labels <200

Stochastic Semi-supervised Learning -- continued  Approach to A, C, and D:  When more labels are known after query, use both known labels and randomly picked “negative” seeds as initial cluster center  Label cluster using known positive seeds  Discard cluster whose membership is not clear  Store the cluster membership of each data points  Use normalized positive cluster membership counts as prediction score Our approach when number of labels <200

Stochastic Semi-supervised Learning -- continued  Approach to B, E, and F:  Randomly pick 20 unlabeled data as “negative” labels for each known positive label.  Build over-fit logistic regression model using above dataset  Repeat above random picking and model building process n times  Final score is the average of n models. Our approach when number of labels <200

Supervised Learning Using Gradient Boosting Decision Tree (TreeNet)

Querying Strategy  One critical part of active learning is the query strategy  Popular approaches:  Uncertainty sampling  Expected model change  Query by committee  What we tried:  Uncertainty sampling + density based selective sampling  Random sampling (for large label purchase)  Certainty sampling (try to get more positive labels)

Dataset A: Handwriting Recognition Global score = 0.623, rank 2 nd. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy 123210.67Uncertainty /Selective 219592330.82Uncertainty /Selective 3428621920.92Random 41105764780.94Get All 50175350.93

Dataset B: Marketing Global score = 0.375, rank 2 nd. Pie Chart Title

Dataset C: Chemo-informatics Global score = 0.334, rank 4 th. Passive learning. Pie Chart Title

Dataset D: Text Processing Global score = 0.331, rank 18 th. Pie Chart Title

Dataset E: Embryology Global score = 0.533, rank 3 rd. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy 1210.75Certainty 2330.66Uncertainty /Selective 3360.67Uncertainty /Selective 43224390.72Get All 50322520.86

Dataset E: Embryology  Performance gets worse with more labels  Newly queried labels did too much correction to the existing model  This phenomenon was common in this contest Global score = 0.533, rank 3 rd. Pie Chart Title

Dataset F: Ecology Global score = 0.77, rank 4 th. Pie Chart Title Column Chart Title SequenceNum. Samples Num. Queried Samples AUCSampling Strategy 1210.76Uncertainty /Selective 2730.73Uncertainty /Selective 3542100.77Uncertainty /Selective 451755520.95Random 56190157270.98Get all 60676280.99

Dataset F: Ecology  Performance gets worse with 2 more labels at beginning  Most of the time, too many small queries do more harm than good to global score Pie Chart Title

Summary on Results Overall rank 3 rd. Pie Chart Title DatasetPositive label % AUCALCNum. Queries RankWinner AUC Winner ALC A7.230.9250.623420.8620.629 B9.160.7670.375220.7330.376 C8.150.8140.334140.7990.427 D25.190.8900.3313180.9640.745 E9.030.8650.533430.8940.627 F7.680.9880.771540.9990.802

Discussions  How to consistently get better performance with only a few labels across different datasets  How to consistently improve model performance with the increase of labels in a given dataset  Does the log2 scaling give too much weight on first few queries? What about every dataset starts with a little bit more labels?

Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Similar presentations

Presentation on theme: "Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong.

Similar presentations

Presentation on theme: "Semi-supervised Learning on Partially Labeled Imbalanced Data May 16, 2010 Jianjun Xie and Tao Xiong."— Presentation transcript:

Similar presentations

About project

Feedback