Download presentation
Presentation is loading. Please wait.
Published byJesse Stanley Modified over 8 years ago
1
Cross-training: Learning probabilistic relations between taxonomies Sunita Sarawagi Soumen Chakrabarti Shantanu Godbole IIT Bombay
2
ChakrabartiKDD20032 Document classification Set of labels A Bookmark folders, Yahoo topics Training documents, each with one A label Supervised approach Use training docs to induce classifier Invoke classifier on each unlabeled document in isolation Semi-supervised approach Unlabeled documents available during training Nigam et al. show how to exploit them collectively
3
ChakrabartiKDD20033 Cross-training from another taxonomy Another set of labels B, partly related but not identical to A A=Dmoz topics, B=Yahoo topics A=Personal bookmark topics, B=Yahoo topics Training docs come in two flavors now Fully labeled with A and B labels (rare) Half-labeled with either an A or a B label Can B make classification for A more accurate (and vice versa)? Inductive transfer, multi-task learning DADA DBDB
4
ChakrabartiKDD20034 Motivation Symmetric taxonomy mapping Ecommerce catalogs: A=distributor, B=retailer Web directories: A = Dmoz, B = Yahoo Incomplete taxonomies, small training sets Bookmark taxonomy vs. Yahoo Cartesian label spaces UKUSA … Regional Top Sports BaseballCricket Region Topic Label-pair- conditioned term distribution
5
ChakrabartiKDD20035 Obvious approach: Labels as features A-label known, estimate B-label Suppose we have A+B labeled training set Discrete valued “label column” Multinomial naïve Bayes too biased, cannot balance heterogeneous features Do not have fully-labeled data Must guess (use soft scores instead of 0/1) Term feature values Augmented feature vectorTarget label
6
ChakrabartiKDD20036 SVM-CT: Cross-trained SVM S(A,0) Train DA–DBDA–DB Docs having only A-labels One-vs-rest SVM ensemble for A: returns |A| scores for each test doc (signed distance from separator) DB–DADB–DA Docs having only B-labels Test Test output t t |A| Label Text features S(B,1) Train One-vs-rest SVM ensemble for B (target label set) Test case with A-label known (coded using a vector of +1 and –1) Term features–1,…,–1,+1,–1,… S(A,1) S(B,2) S(A,2) …
7
ChakrabartiKDD20037 SVM-CT anecdotes Discriminant reveals relations between A and B One-to-one, many-to-one, related, antagonistic However, accuracy gains are meager PositiveNegative
8
ChakrabartiKDD20038 EM1D: Info from unlabeled docs Use training docs to induce initial classifier for taxonomy B, say Repeat until classifier satisfactory Estimate Pr( |d) for unlabeled doc d, B Reweigh d by factor Pr( |d) and add to training set for label Retrain classifier EM1D: Expectation maximization with one label set B (Nigam et al.) Ignores labels from another taxonomy A
9
ChakrabartiKDD20039 Stratified EM1D Target labels = B B-labeled docs are labeled training instances Consider A-labeled docs labeled These are unlabeled for taxonomy B Run EM1D for each row Test instance has known Invoke semi-supervised model for row to classify EM2D minus 2D model interaction A topics B-topics Docs in D A –D B labeled … D B –D A : docs with B-labels Docs in D A –D B labeled ’
10
ChakrabartiKDD200310 EM2D: Cartesian product EM Initialize with fully labeled docs which go to a specific ( , ) cell Smear training doc across label row or column Uniform smear could be bad Use a naïve Bayes classifier to seed Parameters extended from EM1D , prior probability for label pair ( , ) , ,t multinomial term probability for ( , ) Labels in A Labels in B A-labeled doc B-labeled doc
11
ChakrabartiKDD200311 EM2D updates E-step for an A-labeled document M-step Updated class-pair priors Updated class-pair- conditioned term stats
12
ChakrabartiKDD200312 Applying EM2D to a test doc Mapping a B-labeled test doc d to an A label (e-commerce catalogs) Given , find argmax Pr( , |d) Classifying a document d with no labels to an A label Aggregation For each compute Pr( , |d), pick best Guessing (EM2D-G) Guess the best * using a B-classifier Find argmax Pr( , *|d) EM pitfalls: damping factor, early stopping
13
ChakrabartiKDD200313 Experiments Selected 5 Dmoz and Yahoo subtree pairs Compare EM2D against Naïve Bayes, best #features and smoothing EM1D: ignore labels from other taxonomy, consider as unlabeled docs Stratified EM1D Mapping test doc with A-label to B-label or vice versa Classifying zero-labeled test doc Accuracy = fraction with correct labels
14
ChakrabartiKDD200314 Accuracy benefits in mapping EM1D and NB are close, because training set sizes for each taxonomy are not too small EM2D > Stratified EM1D > NB 2d transfer of model info seems important Improvement over NB: 30% best, 10% average
15
ChakrabartiKDD200315 Asymmetric setting Few (only 300) bookmarked URLs (taxonomy B, target) Many Yahoo URLs, larger number of classes (taxonomy A) Need to control damping factor (= importance of labeled :: unlabeled) to tackle population skew
16
ChakrabartiKDD200316 Zero-labeled test documents EM1D improves accuracy only for 12 train docs EM2D with guessing improves beyond EM1D In fact, better than aggregating scores to 1d Choice of unlabeled:labeled damping ratio L may be important to get benefits
17
ChakrabartiKDD200317 Robustness to initialization Seeding choices: hard (best class), NB scores, uniform Smear a fraction uniformly, rest by NB scores EM2D is robust to wide range of smear fractions Fully uniform smearing can fail (local optima) Uniform smear Naïve Bayes smear
18
ChakrabartiKDD200318 Related work Multi-task learning, “life-long learning”, inductive transfer (Thrun, Caruana) Find earlier learning tasks similar to current Reuse models, features, parameters Co-training (Blum, Mitchell) Two learners over a single label set Partitioned feature set Catalog mapping (Agrawal, Srikant) Two-label docs to estimate priors Raise prior to exponent, tune by validation EM2D: generative model, slightly better accuracy
19
ChakrabartiKDD200319 Summary and future work Two algorithms for cross-training EM-based semi-supervised algorithm EM2D SVM-based algorithm SVM-CT Benefits Improved accuracy Interpretable mappings between label sets General issue: how best to deal with a large number of heterogeneous attributes? Future work Brittle naïve Bayes scores in EM2D Small relative gains in SVM-CT: better kernels? feature selection?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.