Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li.

Slides:

Advertisements

Similar presentations

Query Classification Using Asymmetrical Learning Zheng Zhu Birkbeck College, University of London.

Advertisements

Chapter 5: Introduction to Information Retrieval

Relevant characteristics extraction from semantically unstructured data PhD title : Data mining in unstructured data Daniel I. MORARIU, MSc PhD Supervisor:

TÍTULO GENÉRICO Concept Indexing for Automated Text Categorization Enrique Puertas Sanz Universidad Europea de Madrid.

A Survey on Text Categorization with Machine Learning Chikayama lab. Dai Saito.

GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.

On feature distributional clustering for text categorization Bekkerman, El-Yaniv, Tishby and Winter The Technion. June, 27, 2001.

Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.

IR & Metadata. Metadata Didn’t we already talk about this? We discussed what metadata is and its types –Data about data –Descriptive metadata is external.

April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:

Semantic text features from small world graphs Jure Leskovec, IJS + CMU John Shawe-Taylor, Southampton.

XML Document Mining Challenge Bridging the gap between Information Retrieval and Machine Learning Ludovic DENOYER – University of Paris 6.

Using IR techniques to improve Automated Text Classification

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dörre, Peter Gerstl, and Roland Seiffert Presented By: Jake Happs,

Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.

Scalable Text Mining with Sparse Generative Models

Text Classification Using Stochastic Keyword Generation Cong Li, Ji-Rong Wen and Hang Li Microsoft Research Asia August 22nd, 2003.

Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.

Named Entity Disambiguation Based on Explicit Semantics Martin Jačala and Jozef Tvarožek Špindlerův Mlýn, Czech Republic January 23, 2012 Slovak University.

Chapter 5: Information Retrieval and Web Search

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

CS344: Introduction to Artificial Intelligence Vishal Vachhani M.Tech, CSE Lecture 34-35: CLIR and Ranking in IR.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 An Efficient Concept-Based Mining Model for Enhancing.

MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics – Bag of concepts – Semantic distance between two words.

The use of machine translation tools for cross-lingual text-mining Blaz Fortuna Jozef Stefan Institute, Ljubljana John Shawe-Taylor Southampton University.

Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.

Classifying Tags Using Open Content Resources Simon Overell, Borkur Sigurbjornsson & Roelof van Zwol WSDM ‘09.

RuleML-2007, Orlando, Florida1 Towards Knowledge Extraction from Weblogs and Rule-based Semantic Querying Xi Bai, Jigui Sun, Haiyan Che, Jin.

1 The BT Digital Library A case study in intelligent content management Paul Warren

Exploiting Wikipedia as External Knowledge for Document Clustering Sakyasingha Dasgupta, Pradeep Ghosh Data Mining and Exploration-Presentation School.

1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.

1 Wikification CSE 6339 (Section 002) Abhijit Tendulkar.

Multi-Prototype Vector Space Models of Word Meaning __________________________________________________________________________________________________.

Concept-Based Feature Generation and Selection for Information Retrieval OFER EGOZI, EVGENIY GABRILOVICH AND SHAUL MARKOVITCH.

A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.

Annotating Words using WordNet Semantic Glosses Julian Szymański Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications.

A Word at a Time: Computing Word Relatedness using Temporal Semantic Analysis Kira Radinsky (Technion) Eugene Agichtein (Emory) Evgeniy Gabrilovich (Yahoo!

Chapter 6: Information Retrieval and Web Search

A Comparative Study of Kernel Methods for Classification Applications Yan Liu Oct 21, 2003.

Contextual Ranking of Keywords Using Click Data Utku Irmak, Vadim von Brzeski, Reiner Kraft Yahoo! Inc ICDE 09’ Datamining session Summarized.

Mining Binary Constraints in Feature Models: A Classification-based Approach Yi Li.

Understanding User’s Query Intent with Wikipedia G 여 승 후.

A Scalable Machine Learning Approach for Semi-Structured Named Entity Recognition Utku Irmak(Yahoo! Labs) Reiner Kraft(Yahoo! Inc.) WWW 2010(Information.

Exploiting Wikipedia Categorization for Predicting Age and Gender of Blog Authors K Santosh Aditya Joshi Manish Gupta Vasudeva Varma

Evgeniy Gabrilovich and Shaul Markovitch

A Repetition Based Measure for Verification of Text Collections and for Text Categorization Dmitry V.Khmelev Department of Mathematics, University of Toronto.

Named Entity Disambiguation on an Ontology Enriched by Wikipedia Hien Thanh Nguyen 1, Tru Hoang Cao 2 1 Ton Duc Thang University, Vietnam 2 Ho Chi Minh.

1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret

1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.

Text Categorization With Support Vector Machines: Learning With Many Relevant Features By Thornsten Joachims Presented By Meghneel Gore.

Link Distribution on Wikipedia [0407]KwangHee Park.

26/01/20161Gianluca Demartini Ranking Categories for Faceted Search Gianluca Demartini L3S Research Seminars Hannover, 09 June 2006.

Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.

Virtual Examples for Text Classification with Support Vector Machines Manabu Sassano Proceedings of the 2003 Conference on Emprical Methods in Natural.

Proposing a New Term Weighting Scheme for Text Categorization LAN Man School of Computing National University of Singapore 12 nd July, 2006.

SoC Presentation Title 2004 A New Term Weighting Method for Text Categorization LAN Man School of Computing National University of Singapore 16 Apr, 2007.

Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Enhancing Text Clustering by Leveraging Wikipedia Semantics.

Sentiment Analysis Using Common- Sense and Context Information Basant Agarwal 1,2, Namita Mittal 2, Pooja Bansal 2, and Sonal Garg 2 1 Department of Computer.

1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.

Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.

A Self-organizing Semantic Map for Information Retrieval Xia Lin, Dagobert Soergel, Gary Marchionini presented by Yi-Ting.

COMP423: Intelligent Agent Text Representation. Menu – Bag of words – Phrase – Semantics Semantic distance between two words.

2016/9/301 Exploiting Wikipedia as External Knowledge for Document Clustering Xiaohua Hu, Xiaodan Zhang, Caimei Lu, E. K. Park, and Xiaohua Zhou Proceeding.

Information Retrieval in Practice

Cross-lingual Dataless Classification for Many Languages

Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance Hello everyone,

Cross-lingual Dataless Classification for Many Languages

Presented by: Prof. Ali Jaoua

Text Categorization Assigning documents to a fixed set of categories

Presentation transcript:

Engeniy Gabrilovich and Shaul Markovitch American Association for Artificial Intelligence 2006 Prepared by Qi Li

 Introduction  Feature Generation with Wikipedia ◦ Wikipedia as a knowledge Repository ◦ Feature Construction ◦ Feature generator design ◦ Using the link structure  Empirical Evaluation ◦ Implementation Details ◦ Experimental Methodology ◦ The effect of feature generation ◦ Classifying short documents  Conclusions and Future Work

 Text categorization ◦ Deals with automatic assignment of category labels to natural language documents ◦ Represent document as bags of words ◦ Features from words ◦ Categorization based on features ◦ Limitation of BOW:  by individual word occurrences in the training set  Wal-Mart supply chain goes real time  Wal-Mart manages its stock with RFID technology  Effective in medium difficulty categorization, but bad in small categories or short documents  Using encyclopedia to endow the machine document with the broader of knowledge available to humans

 Auxiliary text classifier: ◦ matching documents with the most relevant articles of wikipedia ◦ conventional bag of words + new features  Examples for idea of auxiliary text classifier: ◦ “ Bernanke takes charge ” ◦ BEN BERNANKE, FEDERAL RESERVE, CHAIRMAN OF THE FEDERAL RESERVE, ALAN GREENSPAN, MONETARISM, …  Using wikipedia ◦ Use text similarity algorithms to automatically identify encyclopedia articles relevant to each document ◦ Leverage the knowledge gained from these articles

 Extend the representation of documents for text categorization with knowledge concepts relevant to the document text.  Wikipedia ◦ Largest knowledge repository ◦ Large-scale hierarchies ◦ Qualify, stander written English ◦ …

 Receive a text fragment, and map to most relevant wikipedia articles ◦ E.g. overcoming the brittleness bottleneck using wikipedia: enhancing text categorization with Encyclopedia knwoledge ◦ ENCYCLOPEID, WIKIPEDIA, ENTERPRISE CONTENT MANAGEMENT, BOTTELENECK, PERFORMANCE PROBLEM, HERMENEUTICS  Training documents -> features -> wikipedia concepts -> augment the bag of word

 Unit for feature generation? ◦ Word, sentence, paragraph, document?  Multi-resolution approach ◦ Features are generated for  Individual words  Sentences  Paragraphs  Entire document ◦ Polysemous words is mapped to the concepts that correspond to the sense shared by the context words

 “jaguar car models”,  the Wikipedia-based feature generator returns: ◦ JAGUAR (CAR), ◦ DAIMLER and BRITISH LEYLAND MOTOR CORPORATION (companies merged with Jaguar), ◦ V12 (Jaguar’s engine), ◦ JAGUAR E-TYPE ◦ JAGUAR XJ.  “jaguar Panthera onca”, ◦ JAGUAR, ◦ FELIDAE (feline species family), related felines such as LEOPARD, ◦ PUMA and BLACK PANTHER, as well as KINKAJOU

 A set of simple heuristics for pruning the sets of concepts (wikipedia): ◦ Discarding:  with <100 non stop words  <5 incoming and outgoing links (too short)  disambiguation pages ◦ Each concept is an attribute vector assigned weights using a TF.IDF

 Link—anchor text: ◦ Identical to the canonical name of the target article ◦ Different anchor text refer to the same article: alternative names, variant spellings, and related phrases ◦ Incoming links: significance of an article ◦ Problem: taking all articles pointed from a concept: ill-advised, a lot of weakly related material ◦ Pursue this direction in future work

 Wikipedia snapshot: November 5, 2005  1.8Gb text in 910,989 articles, ◦ removing small and overly specific concepts -- remaining 171,332 articles ◦ Removing stop words and rare words ◦ Stemmed ◦ 296,157 distinct terms presenting concepts

 1 Reuter  2 Reuters Corpus Volume I (RCV1)  3 OHSUMED  4 20 Newsgroups(20NG)  5 Movie Reviews (Movies)  Method: SVM with a linear kernel  Metrics: ◦ precision-recall break-even point (BEP) ◦ Reuter and OHSUMED: micro- and macro-average BEP ◦ 20 NG and Movies: 4-fold cross-validation

Improve more More effective in small categories

Only use title of the articles to do classification

 Feature generator: ◦ identify the most relevant encyclopedia articles ◦ Creating new features  Add semantics to conventional BOW ◦ Latent semantic indexing ◦ LSI + SVM: not good ◦ Wikipedia +svm: improve  Information retrieval