Text Representation & Text Classification for Intelligent Information Retrieval Ning Yu School of Library and Information Science Indiana University at Bloomington
Outline The big picture A specific problem – opinion detection
Intelligent information retrieval Characteristics Not restricted to keyword matching and Boolean search Deal with natural language query and advanced search criteria Coarse-to-fine level of granularity Automatically organize/evaluate/interpret solution space User-centered, e.g., adapt to user’s learning habit Etc.
Intelligent information retrieval System Preferences Various source of evidence Natural language processing Semantic web technologies Automatic text classification Etc.
Intelligent IR system diagram
A Specific Question: Semi-Supervised Learning for Identifying Opinions in Web Content Dissertation work
Growing demand for online opinions Enormous body of user- generated content About anything, published anywhere and at any time Useful for literature review, decision making, market monitoring, etc.
Major approaches for opinion detection
To acquire a broad and comprehensive collection of opinion-bearing features (e.g., bag-of-words, POS words, N-grams (n>1), linguistic collocations, stylistic features, contextual features); To generate complex patterns (e.g., “good amount”) that can approximate the context of words. To generate and evaluate opinion detection systems; To allow evaluation of opinion detection strategies with high confidence; 9 9 What’s Essential? Labeled Data! And lots of them!!!
Challenges for opinion detection Shortage of opinion-labeled data: manual annotation is tedious, error-prone and difficult to scale up Domain transfer: strategies designed for opinion detection in one data domain generally do not perform well in another domain
Motivations & research question Easy to collect unlabeled user-generated content that contains opinions Semi-Supervised Learning (SSL) requires only a limited number of labeled data to automatically label unlabeled data; has achieved promising results in NLP studies Is SSL effective in opinion detection both in sparse data situations and for domain adaptation?
Datasets & data split Evaluation(5%) Unlabeled (90%) Labeled(1-5%) SSL Full SL Baseline Supervised Learning (SL) Labeled(95%) Evaluation(5%) Labeled(1-5%) Evaluation(5%) Dataset (sentences) Blog PostsMovie ReviewsNews Articles Opinion4,8435,0005,297 Non-opinion4,8435,0005,174
Two major SSL methods: Self-training Assumption: Highly confident predictions made by an initial opinion classifier are reliable and can be added to the labeled set. Limitation: Auto-labeled data may be biased by the particular opinion classifier.
Two major SSL methods: Co-training Assumption: Two opinion classifiers with different strengths and weaknesses can benefit from each other. Limitation: It is not always easy to create two different classifiers.
Experimental design General settings for SSL Naïve Bayes classifier for self-training Binary values for unigram and bigram features Co-training strategies: Unigrams and bigrams (content vs. context) Two randomly split feature/training sets A character-based language model (CLM) and a bag-of-words model (BOW)
Results: Overall For movie reviews and news articles, co- training proved to be most robust For blog posts, SSL showed no benefits over SL due to the low initial accuracy
Results: Movie reviews Both self-training and co-training can improve opinion detection performance Co-training is more effective than self- training
Results: Movie reviews (cont.) The more different the two classifiers, the better the performance
Results: Domain transfer (movie reviews->blog posts) For a difficult domain (e.g., blog), simple self-training alone is promising for tackling the domain transfer problem.
Contributions Comprehensive research expands the spectrum of SSL application to opinion detection Investigation of SSL model that best fits the problem space extends understanding of opinion detection and provides a resource for knowledge-based representation Generation of guidelines and evaluation baselines advances later studies using SSL algorithms in opinion detection Research extensible to other data domains, non-English texts, and other text mining tasks
21 “All my opinions are posted on my online blog.” “A grade of 85 or higher will get you favorable mention on my blog.” “If you want a second opinion, I’ll ask my computer” Thank you!