Cross-lingual Dataless Classification for Many Languages

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Chapter 5: Introduction to Information Retrieval
INFO624 - Week 2 Models of Information Retrieval Dr. Xia Lin Associate Professor College of Information Science and Technology Drexel University.
Albert Gatt Corpora and Statistical Methods Lecture 13.
Introduction to Information Retrieval
Statistical Machine Translation Part II – Word Alignments and EM Alex Fraser Institute for Natural Language Processing University of Stuttgart
Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Distributional Clustering of Words for Text Classification Authors: L.Douglas Baker Andrew Kachites McCallum Presenter: Yihong Ding.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
Gimme’ The Context: Context- driven Automatic Semantic Annotation with CPANKOW Philipp Cimiano et al.
Data Mining with Decision Trees Lutz Hamel Dept. of Computer Science and Statistics University of Rhode Island.
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,
Scalable Text Mining with Sparse Generative Models
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
1/16 Final project: Web Page Classification By: Xiaodong Wang Yanhua Wang Haitang Wang University of Cincinnati.
Introduction to Data Mining Engineering Group in ACL.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Thumbs Up or Thumbs Down? Semantic Orientation Applied to Unsupervised Classification on Reviews Peter D. Turney Institute for Information Technology National.
Language Identification of Search Engine Queries Hakan Ceylan Yookyung Kim Department of Computer Science Yahoo! Inc. University of North Texas 2821 Mission.
Copyright R. Weber Machine Learning, Data Mining ISYS370 Dr. R. Weber.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Information Retrieval and Web Search Cross Language Information Retrieval Instructor: Rada Mihalcea Class web page:
 Copyright 2011 Digital Enterprise Research Institute. All rights reserved. Digital Enterprise Research Institute Enabling Networked Knowledge.
Source-Selection-Free Transfer Learning
Constructing Knowledge Graph from Unstructured Text Image Source: Kundan Kumar Siddhant Manocha.
1 Automatic Classification of Bookmarked Web Pages Chris Staff Second Talk February 2007.
Enhancing Cluster Labeling Using Wikipedia David Carmel, Haggai Roitman, Naama Zwerdling IBM Research Lab (SIGIR’09) Date: 11/09/2009 Speaker: Cho, Chin.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Understanding User’s Query Intent with Wikipedia G 여 승 후.
Evgeniy Gabrilovich and Shaul Markovitch
A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.
HAITHAM BOU AMMAR MAASTRICHT UNIVERSITY Transfer for Supervised Learning Tasks.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Using Wikipedia for Hierarchical Finer Categorization of Named Entities Aasish Pappu Language Technologies Institute Carnegie Mellon University PACLIC.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Advisor: Hsin-Hsi Chen Reporter: Chi-Hsin Yu Date: From Word Representations:... ACL2010, From Frequency... JAIR 2010 Representing Word... Psychological.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Data Mining and Text Mining. The Standard Data Mining process.
Hierarchical Sampling for Active Learning Sanjoy Dasgupta and Daniel Hsu University of California, San Diego Session : Active Learning and Experimental.
Sparse Coding: A Deep Learning using Unlabeled Data for High - Level Representation Dr.G.M.Nasira R. Vidya R. P. Jaia Priyankka.
Unsupervised Sparse Vector Densification for Short Text Similarity
On Dataless Hierarchical Text Classification
Illinois CCG LoReHLT16 Situation Frame System
Cross-lingual Dataless Classification for Many Languages
Queensland University of Technology
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
Web Services and Application of Multi-Agent Paradigm for DL
Cross Domain Distribution Adaptation via Kernel Mapping
On Dataless Hierarchical Text Classification
Text & Web Mining 9/22/2018.
Statistical NLP: Lecture 9
Presented by: Prof. Ali Jaoua
Topic Oriented Semi-supervised Document Clustering
Text Categorization Rong Jin.
Text Categorization Assigning documents to a fixed set of categories
Introduction Task: extracting relational facts from text
Text Categorization Berlin Chen 2003 Reference:
Using Multilingual Neural Re-ranking Models for Low Resource Target Languages in Cross-lingual Document Detection Using Multilingual Neural Re-ranking.
Dennis Zhao,1 Dragomir Radev PhD1 LILY Lab
Statistical NLP : Lecture 9 Word Sense Disambiguation
Active AI Projects at WIPO
Yangqiu Song Lane Department of CSEE West Virginia University
Presentation transcript:

Cross-lingual Dataless Classification for Many Languages Yangqiu Song, Shyam Upadhyay, Haoruo Peng, and Dan Roth Much of the work was done at UIUC

Document Topical Classification On Feb. 8, Dong Nguyen announced that he would be removing his hit game Flappy Bird from both the iOS and Android app stores, saying that the success of the game is something he never wanted. Some fans of the game took it personally, replying that they would either kill Nguyen or kill themselves if he followed through with his decision. Pick a label: Class1 or Class2 ? Mobile Game or Sports? Document classification is a classic problem for text mining and nlp applications. The task is to predict the label of a document. Traditional classification often treats the labels as numbers or IDs to train a classifier for such task. However as a human, if I want you to choose a label of the document like this, can you tell me which label it is? No, right? But if I give you the label names, you can easily tell me the decision without any problem. So the labels really carry a lot of information about the category meaning. If we know the label names, we can even classify the document without any labeled data. But even with label names, the traditional classification models still treat the labels as numbers or ids and do not consider this information. Labels carry a lot of information! But traditional approaches are not using it Models are trained with “numbers or IDs” as labels

Cross-lingual Document Categorization How to map a document in language L to an English ontology of semantic categories, without training with task-specific labeled data. Potentially, given a single document (not a coherent collection) Economy -Taxation Sports -Basketball

Categorization without Labeled Data [AAAI’08, AAAI’14, NAACL’15] This is not an unsupervised learning scenario. Unsupervised learning assumes a coherent collection of data points, and that similar labels are assigned to similar data points. It cannot work on a single document. Given: A single document (or: a collection of documents) A taxonomy of categories into which we want to classify the documents Dataless procedure: Let f(li) be the semantic representation of the labels Let f(d) be the semantic representation of a document Select the most appropriate category: li* = argmini || f(li) - f(d)|| Bootstrap Label the most confident documents; use this to train a model. Key Questions: How to generate a good Semantic Representations? How to do it in many languages, with minimal resources?

General Framework Mobile Game or Sports? Label names Map labels/documents to the same space Compute document and label similarities Documents World knowledge (Cross-lingual) Choose labels M. Chang, L. Ratinov, D. Roth, V. Srikumar: Importance of Semantic Representation: Dataless Classification. AAAI‘08. Y. Song, D. Roth: On dataless hierarchical text classification. AAAI’14. Y. Song, D. Roth: Unsupervised Sparse Vector Densification for Short Text Similarity. HLT-NAACL’15. Y. Song, S. Upadhyay, H. Peng, D. Roth: Cross-lingual Dataless Classification for Many Languages. IJCAI’16.

No task specific supervision. Wikipedia is there. Text Representation [Dense] Distributed Representations (Embeddings) New powerful implementations of good old ideas Learn a representation for a word as a function of words in its context Brown Clusters An HMM based approach Found a lot of applications in other NLP tasks [Sparse] Explicit Semantic Analysis (ESA) A Wikipedia driven approach – best for topical classification Represent a word as a (weighted) list of all Wikipedia titles it occurs in Gabrilovich & Markovitch 2009 Cross-lingual ESA: Exploits the shared semantic space between two languages The ideal representation is task specific. These ideas can be shows also in the context of more involved tasks such as Events and Relation Extraction No task specific supervision. Wikipedia is there.

Language Links

Wikipedia Pages Across Languages Topical classification of documents in language L relies on: The availability of L-Wikipedia The existence of a title space mapping between L and English 292 languages have Wikipedia. We filter the title space to only include long, well linked pages that are linked to the English Wikipedia, yielding 179 languages. English Deutsch Spanish Hindi # Wikipedia Pages >15.8M 3,653,951 3,165,178 179,131 # Pruned pages: (Len>=100, link>=5) 3,090,649 1,482,675 914,927 33,298 # Pages linked to English Wikipedia 459,421 342,285 16,463

Percentage of Language Links to English 179 Wikipedia we can collect from Wikipedia dump The languages above the line may be good to us

Cross-lingual Semantic Similarity – For One Document Cross-lingual Explicit Semantic Analysis (CLESA) Build inverted index of English and L-Wikipedia Search using English label and L-language document as queries Compute similarity based on the intersected Wikipedia titles Hindi Document English Wikipedia Hindi Wikipedia Wikipedia Articles English Label Search Search Label: Sports Cosine Similarity cos[e(li), c(h(d))] To be more formal. A five step algorithm. Martin Potthast, Benno Stein, and Maik Anderka. A wikipedia-based multilingual retrieval model. In ECIR, pages 522–530, 2008. Philipp Sorg and Philipp Cimiano. Exploiting wikipedia for cross-lingual and multilingual information retrieval. Data and Knowledge Engineering, 74:26–45, 2012.

Bootstrapping with Unlabeled Data Application of world knowledge of label meaning Initialize N documents for each label Pure similarity based classifications Train a classifier to label N more documents Continue to label more data until no unlabeled document exists Domain adaptation Mobile games Sports

Experiments Two existing multilingual text categorization collections: TED: 13 languages; 15 labels; 1200 docs/label. RCV2: 13 languages; 4 labels (top level); 102—104 docs/label.

Better than 100 supervised docs/label Not as good as 500 docs/label TED DATA RCV DATA Change dataless to Cross-lingual ESA Karl Moritz Hermann and Phil Blunsom. Multilingual models for compositional distributed semantics. In ACL, pages 58–68, 2014.

Experiments Two existing multilingual text categorization collections: TED: 13 languages; 15 labels; 1200 docs/label. RCV2: 13 languages; 4 labels (top level); 102—104 docs/label. To evaluate the coverage, we generate a new data set. 20 newsgroups. Take 100 documents that the English dataless classifies correctly Translate to 88 languages using Google Translate This gives us a multi-lingual collection for which we know the labels Evaluate the Cross-lingual dataless

88 Languages: Single Document Classification Dataless classification for English Accuracy Hausa Hindi Size of shared English-Language L title space

Conclusions Thank You!  Document categorization without labeled data Semantic representation plays the key role Cross-lingual dataless classification Applied to many languages Cross-lingual ESA outperformed standard word embeddings Comparable supervised learning with 100-200 per-label Future work Low-resource languages Small presence of Wikipedia Thank You! 

Translate 100 documents of 20-newsgroups back to English English ESA based dataless classification E. Gabrilovich and S. Markovitch. Wikipedia-based Semantic Interpretation for Natural Language Processing. J. of Art. Intell. Res. (JAIR). 2009. M. Chang, L. Ratinov, D. Roth, V. Srikumar: Importance of Semantic Representation: Dataless Classification. AAAI‘08. Y. Song, D. Roth: On dataless hierarchical text classification. AAAI’14.

Cross-lingual Classification for 88 languages Single document classification without bootstrapping Hindi: 0.33/0.82 Hausa: 0.12/0.08