Download presentation
Presentation is loading. Please wait.
Published byBritton Price Modified over 9 years ago
1
Classification Technology at LexisNexis SIGIR 2001 Workshop on Operational Text Classification Mark Wasson LexisNexis mark.wasson@lexisnexis.com September 13, 2001
2
Our Boolean Origins
4
The Topic Identification System Model –Term-based Topic Identification (TTI) –Term Mapping System –Company Concept Indexing –Named Entity Indexing (Companies, People, Organizations, Places) –Subject Indexing Prototype (not released) –NEXIS Topical Indexing The Topic Identification System
5
Propositional Language Model Underlies Surface Forms Word Concepts Semantic Priming, Additive up to a Point Spreading Activation Psycholinguistics Features
6
All words and phrases are searchable – no stop words No automatic morphological or thesaurus expansion –Exception – name variant generation, but subject to human verification Word Concept: a set of functionally equivalent terms with respect to a given topic; 1 to 100s of terms in a single word concept Terms and Word Concepts
7
Frequency & weighting at word concept level rather than at individual term level TTI used chi-square to compare individual word concepts to supervised training set TTI used stepwise linear regression to test in combination and suggest weights Allow both positive and negative weights in addition to absolute yes/no Boolean functionality Frequency & Weighting
8
5 documents: 3 relevant (G), 2 irrelevant (B) W1 in G1, G2, B1 W2 in G2, G3, B2 W3 in G1, G3, B1 Each W by itself produces 67% recall, 67% precision W1 + W2 -> 100% recall, 60% precision W1 + W3 -> 100% recall, 75% precision W2 + W3 -> 100% recall, 60% precision W1 + W2 + W3 -> 100% recall, 60% precision Also, fewer terms -> faster processing Problem Word Concepts
9
Count a term extra in key document parts –Headlines –Leading text –Captions Count all potential matches –American gets counted for 100s of companies Don’t count a term when part of another –Mead in Mead Corp. –French in French Fry Looking Up Terms in Documents
10
Summation of frequency * weight across all word concepts Normalize score Compare to threshold –Verification range in TTI –Major references, strong passing references, weak passing references in indexing tools Add controlled vocabulary term or marker to document if score >= threshold –Add score, any associated secondary CVTs Calculating Topic Scores
11
Similar field functions, different field names and locations Database and file information to guide production processes The source specification file allows us to reuse a single topic definition across a wide variety of sources and source types Source-dependent, -independent
12
Build each definition using iterative manual process Use supervised learning? –TTI’s chi-square and regression –Cost of creating training samples Automate repetitive, labor-intensive tasks –Generate name variants Cheap labor cost – few minutes to 8 hours Manual vs. Automatic
13
Business unit benchmarks prior to adoption Development process test cases Internal benchmarks with 3 rd party technologies Sorry, not TREC Most tests, topics, sources – recall and precision both in the 90-95% range Test, Test, Test
14
TIS Model? 16 years old TTI? In production for 11 years Term Mapping? 9 years old Entity Indexing? 6-7 years old Topical Indexing? 3 years old Complemented by SRA NetOwl-based indexing 2 years ago No movement afoot to replace any of them The End?
15
TTI –Leigh, S. (1991). The Use of Natural Language Processing in the Development of Topic Specific Databases. Proceedings of the 12 th National Online Meeting. Company Concept Indexing –Wasson, M. (2000). Large-scale Controlled Vocabulary Indexing for Named Entities. Proceedings of the ANLP-NAACL 2000 Conference. Related Papers
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.