The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor.

Slides:

Advertisements

Similar presentations

Recommender System A Brief Survey.

Advertisements

Latent Variables Naman Agarwal Michael Nute May 1, 2013.

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.

Albert Gatt Corpora and Statistical Methods Lecture 13.

Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.

CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.

Computational Models of Discourse Analysis Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Machine Learning and Data Mining Clustering

A Two-Stage Approach to Domain Adaptation for Statistical Classifiers Jing Jiang & ChengXiang Zhai Department of Computer Science University of Illinois.

Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.

Yiannis Demiris and Anthony Dearden By James Gilbert.

Frustratingly Easy Domain Adaptation

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.

Algorithms and Problem Solving-1 Algorithms and Problem Solving.

Announcements  Project proposal is due on 03/11  Three seminars this Friday (EB 3105) Dealing with Indefinite Representations in Pattern Recognition.

Cluster Analysis.  What is Cluster Analysis?  Types of Data in Cluster Analysis  A Categorization of Major Clustering Methods  Partitioning Methods.

Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)

Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.

Distributed Representations of Sentences and Documents

Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)

Radial Basis Function Networks

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

SI485i : NLP Set 12 Features and Prediction. What is NLP, really? Many of our tasks boil down to finding intelligent features of language. We do lots.

Introduction to domain adaptation

Active Learning for Probabilistic Models Lee Wee Sun Department of Computer Science National University of Singapore LARC-IMS Workshop.

Learning at Low False Positive Rate Scott Wen-tau Yih Joshua Goodman Learning for Messaging and Adversarial Problems Microsoft Research Geoff Hulten Microsoft.

Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Methods in Medical Image Analysis Statistics of Pattern Recognition: Classification and Clustering Some content provided by Milos Hauskrecht, University.

Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.

Overcoming Dataset Bias: An Unsupervised Domain Adaptation Approach Boqing Gong University of Southern California Joint work with Fei Sha and Kristen Grauman.

Design Challenges and Misconceptions in Named Entity Recognition Lev Ratinov and Dan Roth The Named entity recognition problem: identify people, locations,

Using Bayesian Networks to Analyze Expression Data N. Friedman, M. Linial, I. Nachman, D. Hebrew University.

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Xiaoxiao Shi, Qi Liu, Wei Fan, Philip S. Yu, and Ruixin Zhu

This work is supported by the Intelligence Advanced Research Projects Activity (IARPA) via Department of Interior National Business Center contract number.

Unsupervised Constraint Driven Learning for Transliteration Discovery M. Chang, D. Goldwasser, D. Roth, and Y. Tu.

Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Word representations: A simple and general method for semi-supervised learning Joseph Turian with Lev Ratinov and Yoshua Bengio Goodies:

Relation Alignment for Textual Entailment Recognition Cognitive Computation Group, University of Illinois Experimental ResultsTitle Mark Sammons, V.G.Vinod.

Geodesic Flow Kernel for Unsupervised Domain Adaptation Boqing Gong University of Southern California Joint work with Yuan Shi, Fei Sha, and Kristen Grauman.

Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.

Prior Knowledge Driven Domain Adaptation Gourab Kundu, Ming-wei Chang, and Dan Roth Hyphenated compounds are tagged as NN. Example: H-ras Digit letter.

1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.

Guest lecture: Feature Selection Alan Qi Dec 2, 2004.

Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:

Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Computational Biology Group. Class prediction of tumor samples Supervised Clustering Detection of Subgroups in a Class.

Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date: From Word Representations:... ACL2010, From Frequency... JAIR 2010 Representing Word... Psychological.

Multi-Criteria-based Active Learning for Named Entity Recognition ACL 2004.

Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.

Static model noOverlaps :: ArgumentCandidate[] candidates -> discrete[] types for (i : (0.. candidates.size() - 1)) for (j : (i candidates.size()

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Domain Adaptation Slide 1 Hal Daumé III Frustratingly Easy Domain Adaptation Hal Daumé III School of Computing University of Utah

Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.

 Frustratingly Easy Domain Adaptation Hal Daume III.

Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.

Sentiment analysis algorithms and applications: A survey

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

Leverage Consensus Partition for Domain-Specific Entity Coreference

Text Categorization Berlin Chen 2003 Reference:

Ping LUO*, Fen LIN^, Yuhong XIONG*, Yong ZHAO*, Zhongzhi SHI^

Statistical NLP : Lecture 9 Word Sense Disambiguation

Presentation transcript:

The Necessity of Combining Adaptation Methods Cognitive Computation Group, University of Illinois Experimental Results Title Ming-Wei Chang, Michael Connor and Dan Roth  Text Take home message  Propose a theoretical analysis of the ``Frustratingly Easy'' (FE) framework [Daume07]  Demonstrate the complex interaction between unlabeled and labeled approaches (via artificial experiments)  Simple “Source+Target” + “Cluster-like features” is often the best approach! (More details later)  State-of-the-art adaptation performance! Contributions NER Experiments AlgorithmTGTFEFE+S+T SRC Labeled data?NoYes TGT Labeled data:Token F1 MUC7 Dev cluster MUC7 Train cluster Domain Adaptation While recent advances in statistical modeling for natural language processing are exciting, the problem of domain adaptation remains a big challenge. It is widely known that a classifier trained on one domain (e.g. news domain) usually performs poorly on a different domain (e.g. medical domain). The inability of current statistical models to handle multiple domains is one of the key obstacles hindering the progress of NLP. “It is necessary to combine labeled and unlabeled adaptation frameworks!” Most works only focus on one aspect. We argue this is not enough because: 1. Mutual Benefit: We analyze these two types of frameworks and find that they address different adaptation issues. 2.Complex Interaction: these two types of frameworks are not independent. Selected References Artificial Adaptation Experiments Current Approaches Focuses on P(X) (Unlabeled) This type of adaptation algorithm attempts to resolve the difference between the feature space statistics of two domains. While many different techniques have been proposed, the common goal of these algorithms is to find (or append) a better shared representation that brings the source domain and the target domain closer. Often these algorithms do not use labeled examples in the target domain. The works [BlitzerMcPe06,HuangYa09] all belong to this category. Focuses on P(Y|X) (Labeled) These adaptation algorithms assume that there exists a small amount of labeled data for the target domain. Instead of training two weight vectors independently (one for source and the other for the target domain), these algorithms try to relate the source and target weight vectors. This is often achieved by using a special designed regularization term. The works [ChelbaAc04,Daume07,FinkelMa09] belong to this category. A daptation Frameworks To demonstrate some of the complexities and benefits of combining adaptation approaches we ran experiments on artificial data showing the performance of three adaptation frameworks as similarities between two domains were controlled. In the first experiment above (without clusters) we see that tasks need to be similar for FE to work. Once they are nearly identical the simpler S+T is better. In the second experiment a set of identical shared features are added to both hyperplanes (clusters), so both adaptation algorithms improve, and the cluster adaptation has effectively moved the two tasks closer, enlarging the region where S+T improves over FE. Addition of clusters allows simpler algorithm. Adaptation Without Clusters Adaptation With Clusters NER Comparison System Unlabeled?Labeled?P.F1T.F1 FM09NoYes79.98N/A RR09YesNoN/A83.2 RR09 + globalYesNoN/A86.2 Our NERYes John Blitzer, Ryan McDonald, and Fernando Pereira Domain adaptation with structural correspondence learning. In EMNLP. Ciprian Chelba and Alex Acero Adaptation of maximum entropy capitalizer: Little data can help a lot. In EMNLP. Hal Daum ́ III Frustratingly easy domain adaptation. In ACL. J. R. Finkel and C. D. Manning Hierarchical Bayesian domain adaptation. In NAACL. Fei Huang and Alexander Yates Distributional representations for handling sparsity in supervised sequence-labeling. In ACL. L. Ratinov and D. Roth Design challenges and misconceptions in named entity recognition. In CoNLL. FrameworkLabeled Data Unlabeled Data Approach UnlabeledSourceCover Source and Target Generate features that span Domains LabeledSource plus Target NoneTrain classifier using both source and target data Tgt: Train on target only FE: Frustratingly Easy S+T: Train on source and target labeled data together as one. In both experiments training and test data generated for two domains according to random hyperplanes whose difference (cosine) was controlled. The goal of this adaptation experiment is to maximize the performance on the test data of MUC7 dataset with CoNLL training data and (some) MUC7 labeled data. As an unlabeled adaptation method to address feature sparsity, we add cluster-like features based on the gazetteers and word clustering resources used in (Ratinov and Roth, 2009) to bridge the source and target domain. Named Entity Recognition Importantly, adding cluster-like features changes the behavior of the labeled adaptation algorithms. When the cluster-like features are not added, the FE+ algorithm is in general the best labeled adaptation framework. However, after adding the cluster- like features, the simple S+T approach becomes very competitive to both FE and FE+. Resolving features sparsity will change the behavior of labeled adaptation frameworks. TGT: Only uses target labeled training dataset. FE: Uses both labeled datasets. FE+ : Modification of FE, equivalent to multiplying the “shared” part of the FE feature vector by 10 (Finkel and Manning, 2009). S+T: Uses both source and target labeled datasets to train a single model with all labeled data directly.