Special Topics in Text Mining

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
Chapter 5: Introduction to Information Retrieval
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Semantic Access to Data from the Web Raquel Trillo *, Laura Po +, Sergio Ilarri *, Sonia Bergamaschi + and E. Mena * 1st International Workshop on Interoperability.
Automatic Text Processing: Cross-Lingual Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell’Informazione.
S ENTIMENTAL A NALYSIS O F B LOGS B Y C OMBINING L EXICAL K NOWLEDGE W ITH T EXT C LASSIFICATION. 1 By Prem Melville, Wojciech Gryc, Richard D. Lawrence.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Ch 4: Information Retrieval and Text Mining
1 Noun Homograph Disambiguation Using Local Context in Large Text Corpora Marti A. Hearst Presented by: Heng Ji Mar. 29, 2004.
Designing clustering methods for ontology building: The Mo’K workbench Authors: Gilles Bisson, Claire Nédellec and Dolores Cañamero Presenter: Ovidiu Fortu.
Presented by Zeehasham Rasheed
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Large-Scale Cost-sensitive Online Social Network Profile Linkage.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
Evaluating the Contribution of EuroWordNet and Word Sense Disambiguation to Cross-Language Information Retrieval Paul Clough 1 and Mark Stevenson 2 Department.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.
Name : Emad Zargoun Id number : EASTERN MEDITERRANEAN UNIVERSITY DEPARTMENT OF Computing and technology “ITEC547- text mining“ Prof.Dr. Nazife Dimiriler.
Text Classification, Active/Interactive learning.
Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
1 Statistical NLP: Lecture 9 Word Sense Disambiguation.
Features and Algorithms Paper by: XIAOGUANG QI and BRIAN D. DAVISON Presentation by: Jason Bender.
An Effective Word Sense Disambiguation Model Using Automatic Sense Tagging Based on Dictionary Information Yong-Gu Lee
Terminology and documentation*  Object of the study of terminology:  analysis and description of the units representing specialized knowledge in specialized.
Word Translation Disambiguation Using Bilingial Bootsrapping Paper written by Hang Li and Cong Li, Microsoft Research Asia Presented by Sarah Hunter.
Intelligent Database Systems Lab N.Y.U.S.T. I. M. Externally growing self-organizing maps and its application to database visualization and exploration.
Harvesting Social Knowledge from Folksonomies Harris Wu, Mohammad Zubair, Kurt Maly, Harvesting social knowledge from folksonomies, Proceedings of the.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Information Retrieval using Word Senses: Root Sense Tagging Approach Sang-Bum Kim, Hee-Cheol Seo and Hae-Chang Rim Natural Language Processing Lab., Department.
Iterative similarity based adaptation technique for Cross Domain text classification Under: Prof. Amitabha Mukherjee By: Narendra Roy Roll no: Group:
Special Topics in Text Mining Manuel Montes y Gómez University of Alabama at Birmingham, Spring 2011.
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
A Supervised Machine Learning Algorithm for Research Articles Leonidas Akritidis, Panayiotis Bozanis Dept. of Computer & Communication Engineering, University.
Pattern Recognition. What is Pattern Recognition? Pattern recognition is a sub-topic of machine learning. PR is the science that concerns the description.
Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.
2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.
Trends in NL Analysis Jim Critz University of New York in Prague EurOpen.CZ 12 December 2008.
Jonatas Wehrmann, Willian Becker, Henry E. L. Cagnini, and Rodrigo C
Queensland University of Technology
Measuring Monolinguality
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Sentiment analysis algorithms and applications: A survey
CHP - 9 File Structures.
Text Based Information Retrieval
MID-SEM REVIEW.
Multimedia Information Retrieval
Special Topics on Information Retrieval
European Network of e-Lexicography
Statistical NLP: Lecture 9
Presented by: Prof. Ali Jaoua
Text Categorization Assigning documents to a fixed set of categories
iSRD Spam Review Detection with Imbalanced Data Distributions
Special Topics in Text Mining
Inf 722 Information Organisation
Authors: Wai Lam and Kon Fan Low Announcer: Kyu-Baek Hwang
Information Retrieval
Text Mining Application Programming Chapter 9 Text Categorization
Extracting Why Text Segment from Web Based on Grammar-gram
Statistical NLP : Lecture 9 Word Sense Disambiguation
Presentation transcript:

Special Topics in Text Mining Manuel Montes y Gómez http://ccc.inaoep.mx/~mmontesg/ mmontesg@inaoep.mx

Multilingual text classification

Special Topics on Information Retrieval Agenda Multilinguism data/problem Poly-lingual text classification Language identification Cross-lingual text classification Using machine translation Employing multilingual dictionaries or ontologies Re-categorization methods Special Topics on Information Retrieval

Special Topics on Information Retrieval Initial questions What is multilingual text classification? Is it concerns a practical problem? How to build a multilingual text classification system? Which multilingual resources are necessary? Equally difficult for all language combinations? Special Topics on Information Retrieval

Special Topics on Information Retrieval Languages in the world It is difficult to give an exact figure of the number of languages that exist in the world Not always easy to differentiate between language and dialect. It is usually estimated that the number of languages in the world varies between 3,000 and 8,000. Special Topics on Information Retrieval

Languages in the Web (users) Special Topics on Information Retrieval

Importance of handling multilingual data Existence of a multilingual worldwide network Representation of English is now less than 40% The time of globalization is coming; many countries have been unified. Example: European Union In addition, many countries adopt multiple languages as their official languages Example: Moroco New technologies in network infrastructure and Internet set the platform of the cooperation and globalization. Special Topics on Information Retrieval

Multilingual text classification Poly-lingual classification The system is trained using labeled documents from all the different languages, and allows to classify documents from any of these languages. Cross-lingual classification The system use labeled training data for only one language to classify documents in other languages. Ideas for achieving these two approaches? Possible applications? Complicated or challenging situations? Special Topics on Information Retrieval

Poly-lingual classification Two main steps: Learning of categorization model(s) from a set of pre classified training documents written in different languages Assignment of unclassified poly-lingual documents to predefined categories on the basis of the induced text categorization model The naïve approach considers the problem as multiple independent monolingual text categorization problems. Architecture is a combination of several monolingual classifiers Special Topics on Information Retrieval

Special Topics on Information Retrieval General architecture Classifiers Training sets Unlabeled Document Language 1 Classifier 1 Language Identification Language 1 Classifier 2 Classifier Construction Assigned Category Language N Classifier N How to determine the language? Problems with this architecture? How to take advantage of resources from other languages? Special Topics on Information Retrieval

Written language identification Determine the language of a document from a given set of possible languages A supervised task: we require example documents from all considered languages. Two main approaches: Based on character frequency and co-occurrence (using n-gram models) Based on the occurrence of some particular words (particularly, the stopwords) Special Topics on Information Retrieval

Character frequencies English Special Topics on Information Retrieval

Taking advantage of multilingual data Main Idea: take into account all training documents of all languages when constructing a monolingual classifier for a specific language. They proposed to reassess the weight of a feature in one language by considering the weight of its related features in another language. At the end they have also N different classifiers, but training is more accurate. Specially useful for small training sets or imbalanced multilingual sets Special Topics on Information Retrieval

Construction of classifier for one language Special Topics on Information Retrieval

How to assign new weights Initial weight depends on the discriminative power of the feature in target language There is a weight that depends on the discriminative power of related words in other languages The final weight is a combination of both weights. Problems? How to select the alpha value? Ideas? Special Topics on Information Retrieval

Cross-lingual text classification It consists of using a labeled dataset in one language (L1) to classify unlabelled data in other language (L2). A method that is able to effectively perform this task would reduce the costs of building multi-language classification systems, since the human effort would be reduced to provide a training set in just one language. How can we train a classifier of such characteristics? How similar must be both document sets? Special Topics on Information Retrieval

Using machine translation Main approach is to use translation to ensure that all documents are available in a single language Translation can be used in two different ways: Training-Set Translation: the labeled set is translated into the target language(s). Became a poly-lingual approach Test-Set Translation: This approach consists in translating the unlabelled documents into one language (L1). Which approach is better? Problems of translation? Special Topics on Information Retrieval

Problems caused by translations Certain drawbacks of the bag-of-words model become particularly severe in cross-lingual classification: Spanish ‘coche’ is generally mapped to ‘car ’, whereas French ‘voiture’ is translated to ‘automobile’. Spanish ‘Me duele la cabeza’ to ‘It hurts the head to me’, which does not contain the word ‘headache’. In Japanese and Chinese, there are separate words for older and younger sisters. How to tackle these problems? Special Topics on Information Retrieval

Special Topics on Information Retrieval keyword translation Most methods consider the translation of the whole documents. But our representation is based on a SET of words Order is not capture; moreover, no all words are included. Is it really important to have a GOOD translation? In order to reduce translation errors some methods only approach the translation of keywords. A variant is to translate the sentences containing the N more important keywords. The purpose is to give some context to the translation machine. How to select the keywords of a document? What are the main characteristics of a keyword? Special Topics on Information Retrieval

Special Topics on Information Retrieval Keyword extraction Keywords are the set of significant words in a document that give high-level description of its content. They give clue about the its main idea Two main ideas for keyword extraction: Frequent words are more important Very common words (in the collection) are not relevant to characterize the content of a given document. Frequency of word i in document k Size of the whole collection Number of documents having word i Special Topics on Information Retrieval

Keyword extraction by term distribution Keywords of a document appear here and there in the document Extract important terms in documents applying the TF-IDF criterion. Examine the distribution characteristics of those candidate keywords. Select as document keywords the terms with great frequency and wide distribution Special Topics on Information Retrieval

Supervised keyword extraction Consider the keyword extraction as a classification problem: the purpose is to determine whether a word belong to the class of keywords or ordinary words Assume that there is a training set that can be used to learn how to identify keywords and using the knowledge gained from the training set Some common used features are: Frequency of the word in the document, inverse document frequency, position of the word in the document, position of the word according to the paragraph, format of the word, POS tag. Special Topics on Information Retrieval

Other problems of CL text classification It is clear that, in spite of a perfect translation, there is also a cultural distance between both languages, which will inevitably affect the classification performance. As an example, consider the case of news about sports from France (in French) and from US (in English): The first will include more documents about soccer, rugby and cricket The later will mainly consider notes about baseball, basketball and American football. How to address this issue? Special Topics on Information Retrieval

An EM based algorithm for CLTC Uses two different sets of data: a set of manually labeled documents in language L1 a large amount of unlabeled documents in the target language L2. The main process: Translate training set to L2. Build a classifier using the labeled translated examples Use information in unlabeled examples from L2 to iteratively enrich the classifier The idea is that, even if the labels are not available, useful statistical properties can be extracted by looking at the distribution of terms in unlabeled texts. Rigutini L., Maggini M., and Liu B. An EM based training algorithm for Cross-Language Text Categorization. 2005 IEEE/WIC/ACM International Conference on Web Intelligence. Compiegne, France, Sept. 2005. Special Topics on Information Retrieval

Special Topics on Information Retrieval Scheme of the method When to stop? Another criterion? Which values for k1 and k2? Equal values? Special Topics on Information Retrieval

Special Topics on Information Retrieval Results of the method Monolingual results Training  English Test  Italian Cross language results Translating training to Italian Translation by Idiomax Results from their method K1 = 300, K2 =1000 Special Topics on Information Retrieval

Re-classification using neighbor´s information Post-processing method for CLTC Its purpose is to reduce the classification errors caused by the cultural distance between the two given languages It takes advantage from the synergy between similar documents from the target corpus in order to achieve their re-classification. It relies on the idea that similar documents from the target corpus are about the same topic, and, therefore, that they must belong to the same category. Special Topics on Information Retrieval

Special Topics on Information Retrieval Scheme of the method Iteratively, modify the current class of a document by considering information from their neighbors If all neighbors belong to the same class, assign that class to the document If neighbors do not belong to the same class, maintain current classification Iterate σ times, or repeat until no document changes their category. Special Topics on Information Retrieval

Special Topics on Information Retrieval Results Special Topics on Information Retrieval

Alternative: using a multilingual wordnet Instead of translating documents from one language to other, make them comparable by means of a multilingual wordnet. A wordnet is a large lexical database organized in terms of meanings. Synonym words are grouped into synset ({car, auto, automobile, machine, motorcar}) In a multilingual wordnet there are relations between related synsents It is possible to go from the words in one language to similar words in any other language. Special Topics on Information Retrieval

Special Topics on Information Retrieval Wordnet example Special Topics on Information Retrieval

Using multilingual wordnets (2) Idea is representing documents by a common (monolingual) set of concepts, and not by a common set of words. Advantages: Synonym is captured (car and auto represented by the same instance) Generalization is possible (if one document talk about lions, it somehow talk about felines) Disadvantages: More difficult to have a multilingual wordnet than a translation system. A BIG problem: word sense disambiguation Special Topics on Information Retrieval

Word sense disambiguation The task of selecting a sense for a word from a set of predefined possibilities The bank close at 8pm ? Special Topics on Information Retrieval

Alternative 2: Hybrid approach Translate all documents to English Training and test sets Because English has the largest wordnet Represent documents by a bag-of-synsets Applied any supervised learning approach to learn from this representation. Advantages: Not necessary to have/construct a wordnet for each language WSD in only one single language Special Topics on Information Retrieval

Next section: Non-topical classification Authorship attribution Sentiment classification Genre classification (related) Plagiarism detection What are these tasks about? In what way is it different from thematic classification? Special Topics on Information Retrieval