Instance Weighting for Domain Adaptation in NLP Jing Jiang & ChengXiang Zhai University of Illinois at Urbana-Champaign June 25, 2007
2 Domain Adaptation Many NLP tasks are cast into classification problems Lack of training data in new domains Domain adaptation: –POS: WSJ biomedical text –NER: news blog, speech –Spam filtering: public corpus personal inboxes Domain overfitting NER TaskTrain → TestF1 to find PER, LOC, ORG from news text NYT → NYT0.855 Reuters → NYT0.641 to find gene/protein from biomedical literature mouse → mouse0.541 fly → mouse0.281
3 Existing Work on Domain Adaptation Existing work –Prior on model parameters [Chelba & Acero 04] –Mixture of general and domain-specific distributions [Daumé III & Marcu 06] –Analysis of representation [Ben-David et al. 07] Our work –A fresh instance weighting perspective –A framework that incorporates both labeled and unlabeled instances
4 Outline Analysis of domain adaptation Instance weighting framework Experiments Conclusions
5 The Need for Domain Adaptation source domain target domain
6 The Need for Domain Adaptation source domain target domain
7 Where Does the Difference Come from? p(x, y) p(x)p(y | x) p s (y | x) ≠ p t (y | x) p s (x) ≠ p t (x) labeling difference instance difference labeling adaptation instance adaptation ?
8 An Instance Weighting Solution (Labeling Adaptation) source domain target domain p t (y | x) ≠ p s (y | x) remove/demote instances
9 An Instance Weighting Solution (Labeling Adaptation) source domain target domain p t (y | x) ≠ p s (y | x) remove/demote instances
10 An Instance Weighting Solution (Labeling Adaptation) source domain target domain p t (y | x) ≠ p s (y | x) remove/demote instances
11 An Instance Weighting Solution (Instance Adaptation: p t (x) < p s (x)) source domain target domain p t (x) < p s (x) remove/demote instances
12 An Instance Weighting Solution (Instance Adaptation: p t (x) < p s (x)) source domain target domain p t (x) < p s (x) remove/demote instances
13 An Instance Weighting Solution (Instance Adaptation: p t (x) < p s (x)) source domain target domain p t (x) < p s (x) remove/demote instances
14 An Instance Weighting Solution (Instance Adaptation: p t (x) > p s (x)) source domain target domain p t (x) > p s (x) promote instances
15 An Instance Weighting Solution (Instance Adaptation: p t (x) > p s (x)) source domain target domain p t (x) > p s (x) promote instances
16 An Instance Weighting Solution (Instance Adaptation: p t (x) > p s (x)) source domain target domain p t (x) > p s (x) promote instances
17 An Instance Weighting Solution (Instance Adaptation: p t (x) > p s (x)) Labeled target domain instances are useful Unlabeled target domain instances may also be useful source domain target domain p t (x) > p s (x)
18 The Exact Objective Function unknown true marginal and conditional probabilities in the target domain log likelihood (log loss function)
19 Three Sets of Instances DsDs D t, l D t, u
20 Three Sets of Instances: Using D s DsDs D t, l D t, u in principle, non-parametric density estimation; in practice, high dimensional data (future work) XDsXDs need labeled target data
21 DsDs D t, l D t, u X D t,l small sample size, estimation not accurate Three Sets of Instances: Using D t,l
22 DsDs D t, l D t, u X D t,u pseudo labels (e.g. bootstrapping, EM) Three Sets of Instances: Using D t,u
23 Using All Three Sets of Instances DsDs D t, l D t, u X D s + D t,l + D t,u ?
24 A Combined Framework a flexible setup covering both standard methods and new domain adaptive methods
25 α i = β i = 1, λ s = 1, λ t,l = λ t,u = 0 Standard Supervised Learning using only D s
26 λ t,l = 1, λ s = λ t,u = 0 Standard Supervised Learning using only D t,l
27 Standard Supervised Learning using both D s and D t,l α i = β i = 1, λ s = N s /(N s +N t,l ), λ t,l = N t,l /(N s +N t,l ), λ t,u = 0
28 α i = 0 if (x i, y i ) are predicted incorrectly by a model trained from D t,l ; 1 otherwise Domain Adaptive Heuristic: 1. Instance Pruning
29 Domain Adaptive Heuristic: 2. D t,l with higher weights λ s N t,l /(N s +N t,l )
30 Standard Bootstrapping γ k (y) = 1 if p(y | x k ) is large; 0 otherwise
31 Domain Adaptive Heuristic: 3. Balanced Bootstrapping γ k (y) = 1 if p(y | x k ) is large; 0 otherwise λ s = λ t,u = 0.5
32 Experiments Three NLP tasks: –POS tagging: WSJ (Penn TreeBank) Oncology (biomedical) text (Penn BioIE) –NE type classification: newswire conversational telephone speech (CTS) and web-log (WL) (ACE 2005) –Spam filtering: public collection personal inboxes (u01, u02, u03) (ECML/PKDD 2006)
33 Experiments Three heuristics: 1. Instance pruning 2. D t,l with higher weights 3. Balanced bootstrapping Performance measure: accuracy
34 Instance Pruning Removing “Misleading” Instances from D s kCTSkWL all0.8830all kOncology all kUser 1User 2User all POS NE Type Spam useful in most cases; failed in some case When is it guaranteed to work? (future work)
35 D t,l with Higher Weights until D s and D t,l Are Balanced methodCTSWL DsDs D s + D t,l D s + 5D t,l D s + 10D t,l methodOncology DsDs D s + D t,l D s + 10D t,l D s + 20D t,l methodUser 1User 2User 3 DsDs D s + D t,l D s + 5D t,l D s + 10D t,l POS NE Type Spam D t,l is very useful promoting D t,l is more useful
36 Instance Pruning + D t,l with Higher Weights MethodCTSWL D s + 10D t,l D s ’+ 10D t,l methodOncology D s + 20D t,l D s ’ + 20D t,l methodUser 1User 2User 3 D s + 10D t,l D s ’+ 10D t,l POS NE Type Spam The two heuristics do not work well together How to combine heuristics? (future work)
37 Balanced Bootstrapping methodCTSWL supervised standard bootstrap balanced bootstrap methodOncology supervised standard bootstrap balanced bootstrap methodUser 1User 2User 3 supervised standard bootstrap balanced bootstrap POS NE Type Spam Promoting target instances is useful, even with pseudo labels
38 Conclusions Formally analyzed the domain adaptation from an instance weighting perspective Proposed an instance weighting framework for domain adaptation –Both labeled and unlabeled instances –Various weight parameters Proposed a number of heuristics to set the weight parameters Experiments showed the effectiveness of the heuristics
39 Future Work Combining different heuristics Principled ways to set the weight parameters –Density estimation for setting β
Thank You!