 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University.

Slides:



Advertisements
Similar presentations
Farag Saad i-KNOW 2014 Graz- Austria,
Advertisements

CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 15 Nov, 1, 2011 Slide credit: C. Conati, S.
Bringing Order to the Web: Automatically Categorizing Search Results Hao Chen SIMS, UC Berkeley Susan Dumais Adaptive Systems & Interactions Microsoft.
Distant Supervision for Emotion Classification in Twitter posts 1/17.
Data Mining Methodology 1. Why have a Methodology  Don’t want to learn things that aren’t true May not represent any underlying reality ○ Spurious correlation.
Automatic Identification of Cognates, False Friends, and Partial Cognates University of Ottawa, Canada University of Ottawa, Canada.
Sentiment Analysis An Overview of Concepts and Selected Techniques.
Made with OpenOffice.org 1 Sentiment Classification using Word Sub-Sequences and Dependency Sub-Trees Pacific-Asia Knowledge Discovery and Data Mining.
A Brief Overview. Contents Introduction to NLP Sentiment Analysis Subjectivity versus Objectivity Determining Polarity Statistical & Linguistic Approaches.
GENERATING AUTOMATIC SEMANTIC ANNOTATIONS FOR RESEARCH DATASETS AYUSH SINGHAL AND JAIDEEP SRIVASTAVA CS DEPT., UNIVERSITY OF MINNESOTA, MN, USA.
Self Taught Learning : Transfer learning from unlabeled data Presented by: Shankar B S DMML Lab Rajat Raina et al, CS, Stanford ICML 2007.
Search Engines and Information Retrieval
The use of unlabeled data to improve supervised learning for text summarization MR Amini, P Gallinari (SIGIR 2002) Slides prepared by Jon Elsas for the.
April 22, Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Doerre, Peter Gerstl, Roland Seiffert IBM Germany, August 1999 Presenter:
An Overview of Text Mining Rebecca Hwa 4/25/2002 References M. Hearst, “Untangling Text Data Mining,” in the Proceedings of the 37 th Annual Meeting of.
Information Retrieval in Practice
© Tefko Saracevic, Rutgers University 1 EVALUATION in searching IR systems Digital libraries Reference sources Web sources.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Scalable Text Mining with Sparse Generative Models
Text Mining: Finding Nuggets in Mountains of Textual Data Jochen Dijrre, Peter Gerstl, Roland Seiffert Presented by Drew DeHaas.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Information Extraction with Unlabeled Data Rayid Ghani Joint work with: Rosie Jones (CMU) Tom Mitchell (CMU & WhizBang! Labs) Ellen Riloff (University.
Mining and Summarizing Customer Reviews
(ACM KDD 09’) Prem Melville, Wojciech Gryc, Richard D. Lawrence
The 2nd International Conference of e-Learning and Distance Education, 21 to 23 February 2011, Riyadh, Saudi Arabia Prof. Dr. Torky Sultan Faculty of Computers.
Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.
Introduction The large amount of traffic nowadays in Internet comes from social video streams. Internet Service Providers can significantly enhance local.
MediaEval Workshop 2011 Pisa, Italy 1-2 September 2011.
Web Usage Mining with Semantic Analysis Date: 2013/12/18 Author: Laura Hollink, Peter Mika, Roi Blanco Source: WWW’13 Advisor: Jia-Ling Koh Speaker: Pei-Hao.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 A study on automatically extracted keywords in text categorization Authors:Anette Hulth and Be´ata B. Megyesi From:ACL 2006 Reporter: 陳永祥 Date:2007/10/16.
Reyyan Yeniterzi Weakly-Supervised Discovery of Named Entities Using Web Search Queries Marius Pasca Google CIKM 2007.
Review of the web page classification approaches and applications Luu-Ngoc Do Quang-Nhat Vo.
An Introduction to Machine Learning and Natural Language Processing Tools Presented by: Mark Sammons, Vivek Srikumar (Many slides courtesy of Nick Rizzolo)
Copyright (c) 2003 David D. Lewis (Spam vs.) Forty Years of Machine Learning for Text Classification David D. Lewis, Ph.D. Independent Consultant Chicago,
WebMining Web Mining By- Pawan Singh Piyush Arora Pooja Mansharamani Pramod Singh Praveen Kumar 1.
A Weakly-Supervised Approach to Argumentative Zoning of Scientific Documents Yufan Guo Anna Korhonen Thierry Poibeau 1 Review By: Pranjal Singh Paper.
Learning from Multi-topic Web Documents for Contextual Advertisement KDD 2008.
Math Information Retrieval Zhao Jin. Zhao Jin. Math Information Retrieval Examples: –Looking for formulas –Collect teaching resources –Keeping updated.
A Bootstrapping Method for Building Subjectivity Lexicons for Languages with Scarce Resources Author: Carmen Banea, Rada Mihalcea, Janyce Wiebe Source:
Confidence-Aware Graph Regularization with Heterogeneous Pairwise Features Yuan FangUniversity of Illinois at Urbana-Champaign Bo-June (Paul) HsuMicrosoft.
Bootstrapping for Text Learning Tasks Ramya Nagarajan AIML Seminar March 6, 2001.
Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.
Kansas State University Department of Computing and Information Sciences CIS 830: Advanced Topics in Artificial Intelligence Wednesday, March 29, 2000.
Semi-supervised Dialogue Act Recognition Maryam Tavafi.
What Can My ELLs Do? Grade Level Cluster 3-5 A Quick Reference Guide for Planning Instructional Tasks for English Language Learners.
Medical Information Retrieval: eEvidence System By Zhao Jin Mar
CoCQA : Co-Training Over Questions and Answers with an Application to Predicting Question Subjectivity Orientation Baoli Li, Yandong Liu, and Eugene Agichtein.
COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Digital Libraries1 David Rashty. Digital Libraries2 “A library is an arsenal of liberty” Anonymous.
Data mining, interactive semantic structuring, and collaboration: A diversity-aware method for sense-making in search Mathias Verbeke, Bettina Berendt,
1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:
Automatic Labeling of Multinomial Topic Models
From Words to Senses: A Case Study of Subjectivity Recognition Author: Fangzhong Su & Katja Markert (University of Leeds, UK) Source: COLING 2008 Reporter:
Text Categorization by Boosting Automatically Extracted Concepts Lijuan Cai and Tommas Hofmann Department of Computer Science, Brown University SIGIR 2003.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
Classification using Co-Training
Mining Tag Semantics for Social Tag Recommendation Hsin-Chang Yang Department of Information Management National University of Kaohsiung.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
Bringing Order to the Web : Automatically Categorizing Search Results Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Hao Chen Susan Dumais.
UIC at TREC 2006: Blog Track Wei Zhang Clement Yu Department of Computer Science University of Illinois at Chicago.
Understanding unstructured texts via Latent Dirichlet Allocation Raphael Cohen DSaaS, EMC IT June 2015.
Preface to the special issue on context-aware recommender systems
Aspect-based sentiment analysis
Overview of Machine Learning
An Overview of Concepts and Selected Techniques
Information Retrieval
Hierarchical, Perceptron-like Learning for OBIE
Presentation transcript:

 Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University at Bloomington

Outline  The big picture  A specific problem – opinion detection

Intelligent information retrieval  Characteristics  Not restricted to keyword matching and Boolean search  Deal with natural language query and advanced search criteria  Coarse-to-fine level of granularity  Automatically organize/evaluate/interpret solution space  User-centered, e.g., adapt to user’s learning habit  Etc.

Intelligent information retrieval  System Preferences  Various source of evidence  Natural language processing  Semantic web technologies  Automatic text classification  Etc.

Intelligent IR system diagram

 A Specific Question: Semi-Supervised Learning for Identifying Opinions in Web Content Dissertation work

Growing demand for online opinions  Enormous body of user- generated content  About anything, published anywhere and at any time  Useful for literature review, decision making, market monitoring, etc.

Major approaches for opinion detection

 To acquire a broad and comprehensive collection of opinion-bearing features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic collocations, stylistic features, contextual features);  To generate complex patterns (e.g., “good amount”) that can approximate the context of words.  To generate and evaluate opinion detection systems;  To allow evaluation of opinion detection strategies with high confidence; 9 9 What’s Essential? Labeled Data! And lots of them!!!

Challenges for opinion detection  Shortage of opinion-labeled data: manual annotation is tedious, error-prone and difficult to scale up Domain transfer: strategies designed for opinion detection in one data domain generally do not perform well in another domain

Motivations & research question  Easy to collect unlabeled user-generated content that contains opinions  Semi-Supervised Learning (SSL) requires only a limited number of labeled data to automatically label unlabeled data; has achieved promising results in NLP studies Is SSL effective in opinion detection both in sparse data situations and for domain adaptation?

Datasets & data split Evaluation(5%) Unlabeled (90%) Labeled(1-5%) SSL Full SL Baseline Supervised Learning (SL) Labeled(95%) Evaluation(5%) Labeled(1-5%) Evaluation(5%) Dataset (sentences) Blog PostsMovie ReviewsNews Articles Opinion4,8435,0005,297 Non-opinion4,8435,0005,174

Two major SSL methods: Self-training  Assumption: Highly confident predictions made by an initial opinion classifier are reliable and can be added to the labeled set.  Limitation: Auto-labeled data may be biased by the particular opinion classifier.

Two major SSL methods: Co-training  Assumption: Two opinion classifiers with different strengths and weaknesses can benefit from each other.  Limitation: It is not always easy to create two different classifiers.

Experimental design  General settings for SSL  Naïve Bayes classifier for self-training  Binary values for unigram and bigram features  Co-training strategies:  Unigrams and bigrams (content vs. context)  Two randomly split feature/training sets  A character-based language model (CLM) and a bag-of-words model (BOW)

Results: Overall  For movie reviews and news articles, co- training proved to be most robust  For blog posts, SSL showed no benefits over SL due to the low initial accuracy

Results: Movie reviews  Both self-training and co-training can improve opinion detection performance  Co-training is more effective than self- training

Results: Movie reviews (cont.)  The more different the two classifiers, the better the performance

Results: Domain transfer (movie reviews->blog posts)  For a difficult domain (e.g., blog), simple self-training alone is promising for tackling the domain transfer problem.

Contributions  Comprehensive research expands the spectrum of SSL application to opinion detection  Investigation of SSL model that best fits the problem space extends understanding of opinion detection and provides a resource for knowledge-based representation  Generation of guidelines and evaluation baselines advances later studies using SSL algorithms in opinion detection  Research extensible to other data domains, non-English texts, and other text mining tasks

21 “All my opinions are posted on my online blog.” “A grade of 85 or higher will get you favorable mention on my blog.” “If you want a second opinion, I’ll ask my computer” Thank you!