Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Problem Semi supervised sarcasm identification using SASI
Exploring the Neighborhood with Dora to Expedite Software Maintenance Emily Hill, Lori Pollock, K. Vijay-Shanker University of Delaware.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Search Strategies Online Search Techniques. Universal Search Techniques Precision- getting results that are relevant, “on topic.” Recall- getting all.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
QuASI: Question Answering using Statistics, Semantics, and Inference Marti Hearst, Jerry Feldman, Chris Manning, Srini Narayanan Univ. of California-Berkeley.
UCB BioText TREC 2003 Participation Participants: Marti Hearst Gaurav Bhalotia, Presley Nakov, Ariel Schwartz Track: Genomics, tasks 1 and 2.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
Information Retrieval: Models and Methods October 15, 2003 CMSC Gina-Anne Levow.
1 CS 430 / INFO 430 Information Retrieval Lecture 3 Vector Methods 1.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
1 Languages for aboutness n Indexing languages: –Terminological tools Thesauri (CV – controlled vocabulary) Subject headings lists (CV) Authority files.
Scalable Text Mining with Sparse Generative Models
Authorship Attribution Erik Goldman & Abel Allison.
1 UCB Digital Library Project An Experiment in Using Lexical Disambiguation to Enhance Information Access Robert Wilensky, Isaac Cheng, Timotius Tjahjadi,
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
Indexing Overview Approaches to indexing Automatic indexing Information extraction.
Exercise Your your Library ® Smart Searching UW Library Winter 2007.
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
1 LOMGen: A Learning Object Metadata Generator Applied to Computer Science Terminology A. Singh, H. Boley, V.C. Bhavsar National Research Council and University.
Search Engines and Information Retrieval Chapter 1.
©2008 Srikanth Kallurkar, Quantum Leap Innovations, Inc. All rights reserved. Apollo – Automated Content Management System Srikanth Kallurkar Quantum Leap.
1 Cross-Lingual Query Suggestion Using Query Logs of Different Languages SIGIR 07.
Query Expansion By: Sean McGettrick. What is Query Expansion? Query Expansion is the term given when a search engine adding search terms to a user’s weighted.
Keyword vs. Controlled Vocabulary Searching 12 Basic Skills for IQ.
Medline on OvidSP. Medline Facts Extensive MeSH thesaurus structure with many synonyms used in mapping and multidatabase searching with Embase Thesaurus.
Combining terminology resources and statistical methods for entity recognition: an evaluation Angus Roberts, Robert Gaizauskas, Mark Hepple, Yikun Guo.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Improving Suffix Tree Clustering Base cluster ranking s(B) = |B| * f(|P|) |B| is the number of documents in base cluster B |P| is the number of words in.
TOPIC CENTRIC QUERY ROUTING Research Methods (CS689) 11/21/00 By Anupam Khanal.
Document Clustering 文件分類 林頌堅 世新大學圖書資訊學系 Sung-Chien Lin Department of Library and Information Studies Shih-Hsin University.
Chapter 6: Information Retrieval and Web Search
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
1 CS 430: Information Discovery Lecture 25 Cluster Analysis 2 Thesaurus Construction.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
1 Opinion Retrieval from Blogs Wei Zhang, Clement Yu, and Weiyi Meng (2007 CIKM)
CIKM Opinion Retrieval from Blogs Wei Zhang 1 Clement Yu 1 Weiyi Meng 2 1 Department of.
Authors: Marius Pasca and Benjamin Van Durme Presented by Bonan Min Weakly-Supervised Acquisition of Open- Domain Classes and Class Attributes from Web.
LIS 6771 Indexing with a Controlled Vocabulary Basic Concepts.
Query Suggestion. n A variety of automatic or semi-automatic query suggestion techniques have been developed  Goal is to improve effectiveness by matching.
Carnegie Mellon Novelty and Redundancy Detection in Adaptive Filtering Yi Zhang, Jamie Callan, Thomas Minka Carnegie Mellon University {yiz, callan,
Collocations and Terminology Vasileios Hatzivassiloglou University of Texas at Dallas.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Information Retrieval
AN EFFECTIVE STATISTICAL APPROACH TO BLOG POST OPINION RETRIEVAL Ben He Craig Macdonald Iadh Ounis University of Glasgow Jiyin He University of Amsterdam.
UWMS Data Mining Workshop Content Analysis: Automated Summarizing Prof. Marti Hearst SIMS 202, Lecture 16.
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
Generating Query Substitutions Alicia Wood. What is the problem to be solved?
Learning Extraction Patterns for Subjective Expressions 2007/10/09 DataMining Lab 안민영.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Jean-Yves Le Meur - CERN Geneva Switzerland - GL'99 Conference 1.
1 CS 430: Information Discovery Lecture 21 Interactive Retrieval.
Automatic Categorization of Patent Applications Presentation to the 3rd IPC Workshop, WIPO, Feb , The need for automatic categorization of.
Multi-Class Sentiment Analysis with Clustering and Score Representation Yan Zhu.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Applying Key Phrase Extraction to aid Invalidity Search
Clustering Algorithms for Noun Phrase Coreference Resolution
Presentation transcript:

Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis September 13, 2001

Our Boolean Origins

The Topic Identification System Model –Term-based Topic Identification (TTI) –Term Mapping System –Company Concept Indexing –Named Entity Indexing (Companies, People, Organizations, Places) –Subject Indexing Prototype (not released) –NEXIS Topical Indexing The Topic Identification System

Propositional Language Model Underlies Surface Forms Word Concepts Semantic Priming, Additive up to a Point Spreading Activation Psycholinguistics Features

All words and phrases are searchable – no stop words No automatic morphological or thesaurus expansion –Exception – name variant generation, but subject to human verification Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept Terms and Word Concepts

Frequency & weighting at word concept level rather than at individual term level TTI used chi-square to compare individual word concepts to supervised training set TTI used stepwise linear regression to test in combination and suggest weights Allow both positive and negative weights in addition to absolute yes/no Boolean functionality Frequency & Weighting

5 documents: 3 relevant (G), 2 irrelevant (B) W1 in G1, G2, B1 W2 in G2, G3, B2 W3 in G1, G3, B1 Each W by itself produces 67% recall, 67% precision W1 + W2 -> 100% recall, 60% precision W1 + W3 -> 100% recall, 75% precision W2 + W3 -> 100% recall, 60% precision W1 + W2 + W3 -> 100% recall, 60% precision Also, fewer terms -> faster processing Problem Word Concepts

Count a term extra in key document parts –Headlines –Leading text –Captions Count all potential matches –American gets counted for 100s of companies Don’t count a term when part of another –Mead in Mead Corp. –French in French Fry Looking Up Terms in Documents

Summation of frequency * weight across all word concepts Normalize score Compare to threshold –Verification range in TTI –Major references, strong passing references, weak passing references in indexing tools Add controlled vocabulary term or marker to document if score >= threshold –Add score, any associated secondary CVTs Calculating Topic Scores

Similar field functions, different field names and locations Database and file information to guide production processes The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types Source-dependent, -independent

Build each definition using iterative manual process Use supervised learning? –TTI’s chi-square and regression –Cost of creating training samples Automate repetitive, labor-intensive tasks –Generate name variants Cheap labor cost – few minutes to 8 hours Manual vs. Automatic

Business unit benchmarks prior to adoption Development process test cases Internal benchmarks with 3 rd party technologies Sorry, not TREC Most tests, topics, sources – recall and precision both in the 90-95% range Test, Test, Test

TIS Model? 16 years old TTI? In production for 11 years Term Mapping? 9 years old Entity Indexing? 6-7 years old Topical Indexing? 3 years old Complemented by SRA NetOwl-based indexing 2 years ago No movement afoot to replace any of them The End?

TTI –Leigh, S. (1991). The Use of Natural Language Processing in the Development of Topic Specific Databases. Proceedings of the 12 th National Online Meeting. Company Concept Indexing –Wasson, M. (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference. Related Papers