Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay.

Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay

ChakrabartiKDD20032 Document classification  Set of labels A Bookmark folders, Yahoo topics  Training documents, each with one A label  Supervised approach Use training docs to induce classifier Invoke classifier on each unlabeled document in isolation  Semi-supervised approach Unlabeled documents available during training Nigam et al. show how to exploit them collectively

ChakrabartiKDD20033 Cross-training from another taxonomy  Another set of labels B, partly related but not identical to A A=Dmoz topics, B=Yahoo topics A=Personal bookmark topics, B=Yahoo topics  Training docs come in two flavors now Fully labeled with A and B labels (rare) Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)?  Inductive transfer, multi-task learning DADA DBDB

ChakrabartiKDD20034 Motivation  Symmetric taxonomy mapping Ecommerce catalogs: A=distributor, B=retailer Web directories: A = Dmoz, B = Yahoo  Incomplete taxonomies, small training sets Bookmark taxonomy vs. Yahoo  Cartesian label spaces UKUSA … Regional Top Sports BaseballCricket Region   Topic Label-pair- conditioned term distribution

ChakrabartiKDD20035 Obvious approach: Labels as features  A-label known, estimate B-label  Suppose we have A+B labeled training set  Discrete valued “label column”   Multinomial naïve Bayes too biased, cannot balance heterogeneous features  Do not have fully-labeled data Must guess  (use soft scores instead of 0/1) Term feature values   Augmented feature vectorTarget label

ChakrabartiKDD20036 SVM-CT: Cross-trained SVM S(A,0) Train DA–DBDA–DB Docs having only A-labels One-vs-rest SVM ensemble for A: returns |A| scores for each test doc (signed distance from separator) DB–DADB–DA Docs having only B-labels Test Test output  t  t  |A|   Label Text features S(B,1) Train One-vs-rest SVM ensemble for B (target label set) Test case with A-label known (coded using a vector of +1 and –1) Term features–1,…,–1,+1,–1,… S(A,1) S(B,2) S(A,2) …

ChakrabartiKDD20037 SVM-CT anecdotes  Discriminant reveals relations between A and B One-to-one, many-to-one, related, antagonistic  However, accuracy gains are meager PositiveNegative

ChakrabartiKDD20038 EM1D: Info from unlabeled docs  Use training docs to induce initial classifier for taxonomy B, say  Repeat until classifier satisfactory Estimate Pr(  |d) for unlabeled doc d,  B Reweigh d by factor Pr(  |d) and add to training set for label  Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.)  Ignores labels from another taxonomy A

ChakrabartiKDD20039 Stratified EM1D  Target labels = B  B-labeled docs are labeled training instances  Consider A-labeled docs labeled  These are unlabeled for taxonomy B  Run EM1D for each row   Test instance has  known Invoke semi-supervised model for row  to classify  EM2D minus 2D model interaction  A topics B-topics  Docs in D A –D B labeled  … D B –D A : docs with B-labels Docs in D A –D B labeled  ’

ChakrabartiKDD200310 EM2D: Cartesian product EM  Initialize with fully labeled docs which go to a specific ( ,  ) cell  Smear training doc across label row or column Uniform smear could be bad Use a naïve Bayes classifier to seed  Parameters extended from EM1D  ,  prior probability for label pair ( ,  )  , ,t multinomial term probability for ( ,  ) Labels in A  Labels in B  A-labeled doc B-labeled doc

ChakrabartiKDD200311 EM2D updates  E-step for an A-labeled document  M-step Updated class-pair priors Updated class-pair- conditioned term stats

ChakrabartiKDD200312 Applying EM2D to a test doc  Mapping a B-labeled test doc d to an A label (e-commerce catalogs) Given , find argmax  Pr( ,  |d)  Classifying a document d with no labels to an A label Aggregation For each  compute   Pr( ,  |d), pick best  Guessing (EM2D-G) Guess the best  * using a B-classifier Find argmax  Pr( ,  *|d)  EM pitfalls: damping factor, early stopping

ChakrabartiKDD200313 Experiments  Selected 5 Dmoz and Yahoo subtree pairs  Compare EM2D against Naïve Bayes, best #features and smoothing EM1D: ignore labels from other taxonomy, consider as unlabeled docs Stratified EM1D  Mapping test doc with A-label to B-label or vice versa  Classifying zero-labeled test doc  Accuracy = fraction with correct labels

ChakrabartiKDD200314 Accuracy benefits in mapping  EM1D and NB are close, because training set sizes for each taxonomy are not too small  EM2D > Stratified EM1D > NB 2d transfer of model info seems important Improvement over NB: 30% best, 10% average

ChakrabartiKDD200315 Asymmetric setting  Few (only 300) bookmarked URLs (taxonomy B, target)  Many Yahoo URLs, larger number of classes (taxonomy A)  Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew

ChakrabartiKDD200316 Zero-labeled test documents  EM1D improves accuracy only for 12 train docs  EM2D with guessing improves beyond EM1D In fact, better than aggregating scores to 1d  Choice of unlabeled:labeled damping ratio L may be important to get benefits

ChakrabartiKDD200317 Robustness to initialization  Seeding choices: hard (best class), NB scores, uniform  Smear a fraction uniformly, rest by NB scores  EM2D is robust to wide range of smear fractions  Fully uniform smearing can fail (local optima) Uniform smear Naïve Bayes smear

ChakrabartiKDD200318 Related work  Multi-task learning, “life-long learning”, inductive transfer (Thrun, Caruana) Find earlier learning tasks similar to current Reuse models, features, parameters  Co-training (Blum, Mitchell) Two learners over a single label set Partitioned feature set  Catalog mapping (Agrawal, Srikant) Two-label docs to estimate priors Raise prior to exponent, tune by validation EM2D: generative model, slightly better accuracy

ChakrabartiKDD200319 Summary and future work  Two algorithms for cross-training EM-based semi-supervised algorithm EM2D SVM-based algorithm SVM-CT  Benefits Improved accuracy Interpretable mappings between label sets  General issue: how best to deal with a large number of heterogeneous attributes?  Future work Brittle naïve Bayes scores in EM2D Small relative gains in SVM-CT: better kernels? feature selection?

Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay.

Similar presentations

Presentation on theme: "Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay.

Similar presentations

Presentation on theme: "Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay."— Presentation transcript:

Similar presentations

About project

Feedback