Domain Adaptation in Natural Language Processing Jing Jiang Department of Computer Science University of Illinois at Urbana-Champaign
Jan 8, Contains much useful information –E.g. >85% corporate data stored as text Hard to handle –Large amount: e.g. by 2002, 2.5 billion documents on surface Web, +7.3 million / day –Diversity: s, news, digital libraries, Web logs, etc. –Unstructured: vs. relation databases How to manage textual data? Textual Data in the Information Age
Jan 8, Information retrieval: to rank documents based on relevance to keyword queries Not always satisfactory –More sophisticated services desired
Jan 8, Automatic Text Summarization
Jan 8, Question Answering
Jan 8, CompanyFounder …… GoogleLarry Page …… Information Extraction
Jan 8, Beyond Information Retrieval Automatic text summarization Question answering Information extraction Sentiment analysis Machine translation Etc. All relies on Natural Language Processing (NLP) techniques to deeply understand and analyze text
Jan 8, Typical NLP Tasks “Larry Page was Google’s founding CEO” Part-of-speech tagging Larry/noun Page/noun was/verb Google/noun ’s/possessive-end founding/adjective CEO/noun Chunking [NP: Larry Page] [V: was] [NP: Google ’s founding CEO] Named entity recognition [person: Larry Page] was [organization: Google] ’s founding CEO Relation extraction Founder(Larry Page, Google) Word sense disambiguation “Larry Page” vs. “Page 81” state-of-the-art solution: supervised machine learning
Jan 8, WSJ articles Supervised Learning for NLP Larry/NNP Page/NNP was/VBD Google/NNP ’s/POS founding/ADJ CEO/NN trained POS tagger Standard Supervised Learning Algorithm part-of-speech tagging on news articles representative corpushuman annotation POS-tagged WSJ articles training
Jan 8, MEDLINE articles In Reality… We/PRP analyzed/VBD the/DT mutations/NNS of/IN the/DT H-ras/NN genes/NNS trained POS tagger Standard Supervised Learning Algorithm part-of-speech tagging on biomedical articles representative corpushuman annotation POS-tagged MEDLINE articles training X human annotation is expensive POS-tagged WSJ articles
Jan 8, Many Other Examples Named entity recognition –News articles personal blogs –Organism A organism B Spam filtering –Public collection personal inboxes Sentiment analysis of product reviews (positive vs. negative) –Movies books –Cell phones digital cameras Problem with this non-standard setting with domain difference?
Jan 8, Domain Difference Performance Degradation MEDLINE POS Tagger ~96% WSJ MEDLINE POS Tagger ~86% ideal setting realistic setting
Jan 8, Another Example gene name recognizer 54.1% gene name recognizer 28.1% ideal setting realistic setting
Jan 8, Domain Adaptation source domain target domain Labeled Unlabeled Domain Adaptive Learning Algorithm to design learning algorithms that are aware of domain difference and exploit all available data to adapt to the target domain
Jan 8, With Domain Adaptation Techniques… Fly + Mouse Yeast gene name recognizer 63.3% Fly + Mouse Yeast gene name recognizer 75.9% standard learning domain adaptive learning
Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work
Jan 8, Overview Source Domain Target Domain
Jan 8, Ideal Goal Target Domain Source Domain
Jan 8, Standard Supervised Learning Source Domain Target Domain
Jan 8, Source Domain Target Domain Standard Semi-Supervised Learning
Jan 8, Idea 1: Generalization Source Domain Target Domain
Jan 8, Idea 2: Adaptation Source Domain Target Domain
Jan 8, Source Domain Target Domain How to formally formulate the ideas?
Jan 8, Instance Weighting Source Domain Target Domain instance space (each point represents an observed instance) to find appropriate weights for different instances
Jan 8, Feature Selection Source Domain Target Domain feature space (each point represents a useful feature) to separate generalizable features from domain-specific features
Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work
Jan 8, Observation source domain target domain
Jan 8, Observation source domain target domain
Jan 8, Analysis of Domain Difference p(x, y) p(x)p(y | x) p s (y | x) ≠ p t (y | x) p s (x) ≠ p t (x) labeling difference instance difference labeling adaptation instance adaptation ? x: observed instancey: class label (to be predicted)
Jan 8, Labeling Adaptation source domain target domain p t (y | x) ≠ p s (y | x) remove/demote instances
Jan 8, Labeling Adaptation source domain target domain p t (y | x) ≠ p s (y | x) remove/demote instances
Jan 8, Instance Adaptation (p t (x) < p s (x)) source domain target domain p t (x) < p s (x) remove/demote instances
Jan 8, Instance Adaptation (p t (x) < p s (x)) source domain target domain p t (x) < p s (x) remove/demote instances
Jan 8, Instance Adaptation (p t (x) > p s (x)) source domain target domain p t (x) > p s (x) promote instances
Jan 8, Instance Adaptation (p t (x) > p s (x)) source domain target domain p t (x) > p s (x) promote instances
Jan 8, Instance Adaptation (p t (x) > p s (x)) Target domain instances are useful source domain target domain p t (x) > p s (x)
Jan 8, Empirical Risk Minimization with Three Sets of Instances DsDs D t, l D t, u loss function expected loss use empirical loss to replace expected loss optimal classification model
Jan 8, Using D s DsDs D t, l D t, u instance difference (hard for high-dimensional data) XDsXDs labeling difference (need labeled target data)
Jan 8, DsDs D t, l D t, u Using D t,l X D t,l small sample size estimation not accurate
Jan 8, DsDs D t, l D t, u Using D t,u X D t,u use predicted labels (bootstrapping)
Jan 8, Combined Framework a flexible setup covering both standard methods and new domain adaptive methods
Jan 8, Experiments NLP tasks –POS tagging: WSJ (Penn TreeBank) Oncology (biomedical) text (Penn BioIE) –NE type classification: newswire conversational telephone speech (CTS) and web-log (WL) (ACE 2005) –Spam filtering: public collection personal inboxes (u01, u02, u03) (ECML/PKDD 2006) Three heuristics to partially explore the parameter settings
Jan 8, Instance Pruning removing “misleading” instances from D s kCTSkWL all0.8830all kOncology all kUser 1User 2User all POS NE Type Spam useful in most cases; failed in some case When is it guaranteed to work? (future work)
Jan 8, D t,l with Larger Weights methodCTSWL DsDs D s + D t,l D s + 5D t,l D s + 10D t,l methodOncology DsDs D s + D t,l D s + 10D t,l D s + 20D t,l methodUser 1User 2User 3 DsDs D s + D t,l D s + 5D t,l D s + 10D t,l POS NE Type Spam D t,l is very useful promoting D t,l is more useful
Jan 8, Bootstrapping with Larger Weights until D s and D t,u are balanced methodCTSWL supervised standard bootstrap balanced bootstrap methodOncology supervised standard bootstrap balanced bootstrap methodUser 1User 2User 3 supervised standard bootstrap balanced bootstrap POSNE Type Spam promoting target instances is useful, even with predicted labels
Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work
Jan 8, Observation 1 Domain-specific features wingless daughterless eyeless apexless …
Jan 8, Observation 1 Domain-specific features wingless daughterless eyeless apexless … describing phenotype in fly gene nomenclature feature “-less” useful for this organism CD38 PABPC5 … feature still useful for other organisms? No!
Jan 8, Observation 2 Generalizable features …decapentaplegic and wingless are expressed in analogous patterns in each … …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues.
Jan 8, Observation 2 Generalizable features …decapentaplegic and wingless are expressed in analogous patterns in each … …that CD38 is expressed by both neurons and glial cells…that PABPC5 is expressed in fetal brain and in a range of adult tissues. feature “X be expressed”
Jan 8, Assume Multiple Source Domains source domains target domain LabeledUnlabeled Domain Adaptive Learning Algorithm
Jan 8, x Detour: Logistic Regression Classifiers 01001:: :: : less X be expressed wyT xwyT x p binary features … and wingless are expressed in… wywy
Jan 8, Learning a Logistic Regression Classifier 01001:: :: : log likelihood of training data regularization term penalize large weights control model complexity wyT xwyT x
Jan 8, Generalizable Features in Weight Vectors : : : D1D1 D2D2 DKDK w1w1 w2w2 wKwK … K source domains generalizable features domain-specific features
Jan 8, : : : =+ 0 0 … … … … 0 : 0 0 … 0 Decomposition of w k for Each Source Domain shared by all domainsdomain-specific w k = A T v + u k a matrix that selects generalizable features
Jan 8, Framework for Generalization Fix A, optimize: wkwk regularization term λ s >> 1: to penalize domain-specific features Source Domain Target Domain log likelihood of labeled data from K source domains
Jan 8, Framework for Adaptation Source Domain Target Domain log likelihood of target domain examples with predicted labels λ t = 1 << λ s : to pick up domain-specific features in the target domain Fix A, optimize:
Jan 8, Joint optimization How to Find A? (1)
Jan 8, How to Find A? (2) Domain cross validation –Idea: training on (K – 1) source domains and validate on the held-out source domain –Approximation: w f k : weight for feature f learned from domain k w f k : weight for feature f learned from other domains rank features by
Jan 8, Intuition for Domain Cross Validation … domains … expressed … -less D1D1 D2D2 D k-1 D k (fly) … -less … expressed … w w … expressed … -less product of w 1 and w 2 w1w1 w2w2
Jan 8, Experiments Data set –BioCreative Challenge Task 1B –Gene/protein recognition –3 organisms/domains: fly, mouse and yeast Experimental setup –2 organisms for training, 1 for testing –F1 as performance measure
Jan 8, Experiments: Generalization MethodF+M YM+Y FY+F M BL DA-1 (joint-opt) DA-2 (domain CV) Source Domain Target Domain Source Domain Target Domain using generalizable features is effective F: fly M: mouse Y: yeast domain cross validation is more effective than joint optimization
Jan 8, Experiments: Adaptation MethodF+M YM+Y FY+F M BL-SSL DA-2-SSL Source Domain Target Domain F: fly M: mouse Y: yeast Source Domain Target Domain domain-adaptive bootstrapping is more effective than regular bootstrapping
Jan 8, Related Work Problem relatively new to NLP and ML communities –Most related work developed concurrently with our work Instances Used Standard Instance Weighting Feature Selection IW + FS DsDs supervised learning Shimodaira 00Blitzer et al. 06 Our Future Wok D s + D t,l supervised learning Daumé III & Marcus 06 Daumé III 07 D s + D t,u semi- supervised learning ACL’07 HLT’06, CIKM’07 D s + D t,l + D t,u semi- supervised learning
Jan 8, Roadmap What is domain adaptation in NLP? Our work –Overview –Instance weighting –Feature selection Summary and future work
Jan 8, Summary Domain adaptation is a critical novel problem in natural language processing and machine learning Contributions –First systematic formal analysis of domain adaptation –Two novel general frameworks, both shown to be effective –Potentially applicable to other classification problems outside of NLP Future work –Domain difference measure –Unify two frameworks –Incorporate domain knowledge into adaptation process –Leverage domain adaptation to perform large-scale information extraction on scientific literature and on the Web
Jan 8, Information Extraction System Existing Knowledge Bases Labeled Data from Related Domains Entity Recognition Relation Extraction Intelligent Learning Knowledge Resources Exploitation Interactive Expert Supervisio n Domain Adaptive Learning Domain Expert
Jan 8, Biomedical Literature (MEDLINE abstracts, full-text articles, etc.) DWnt-2 is expressed in somatic cells of the gonad throughout development. Entity Recognition Relation Extraction Information Extraction System Extracted Facts genetissue/position DWnt-2gonad expression relations Inference Engine Pathway Construction … Hypothesis Generation Knowledge Base Curation Applications
Jan 8, Applications (cont.) Similar ideas for Web text mining –Product reviews Existing annotated reviews limited (certain products from certain sources) Large amount of semi-structured reviews from review websites Unstructured reviews from personal blogs
Jan 8, Selected Publications J. Jiang & C. Zhai. “A two-stage approach to domain adaptation for statistical classifiers.” In CIKM’07. J. Jiang & C. Zhai. “Instance weighting for domain adaptation in NLP.” In ACL’07. J. Jiang & C. Zhai. “Exploiting domain structure for named entity recognition.” In HLT-NAACL’06. J. Jiang & C. Zhai. “A systematic exploration of the feature space for relation extraction.” In NAACL-HLT’07. J. Jiang & C. Zhai. “Extraction of coherent relevant passages using hidden Markov models.” ACM Transactions on Information Systems (TOIS), Jul J. Jiang & C. Zhai. “An empirical study of tokenization strategies for biomedical information retrieval.” Information Retrieval, Oct X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Generating semi- structured gene summaries from biomedical literature.” Information Processing & Management, Nov X. Ling, J. Jiang, X. He, Q. Mei, C. Zhai & B. Schatz. “Automatically generating gene summaries from biomedical literature.” In PSB’06. this talk feature exploration for relation extraction information retrieval gene summarization