Lecture 16: Filtering & TDT Principles of Information Retrieval Prof. Ray Larson University of California, Berkeley School of Information Management & Systems Tuesday and Thursday 10:30 am - 12:00 pm Spring 2006 http://www.sims.berkeley.edu/academics/courses/is240/s06/ IS 240 – Spring 2006
Overview Review Filtering & Routing TDT – Topic Detection and Tracking LSI Filtering & Routing TDT – Topic Detection and Tracking IS 240 – Spring 2006
Overview Review Filtering & Routing TDT – Topic Detection and Tracking LSI Filtering & Routing TDT – Topic Detection and Tracking IS 240 – Spring 2006
How LSI Works Start with a matrix of terms by documents Analyze the matrix using SVD to derive a particular “latent semantic structure model” Two-Mode factor analysis, unlike conventional factor analysis, permits an arbitrary rectangular matrix with different entities on the rows and columns Such as Terms and Documents IS 240 – Spring 2006
How LSI Works The rectangular matrix is decomposed into three other matices of a special form by SVD The resulting matrices contain “singular vectors” and “singular values” The matrices show a breakdown of the original relationships into linearly independent components or factors Many of these components are very small and can be ignored – leading to an approximate model that contains many fewer dimensions IS 240 – Spring 2006
How LSI Works Titles C1: Human machine interface for LAB ABC computer applications C2: A survey of user opinion of computer system response time C3: The EPS user interface management system C4: System and human system engineering testing of EPS C5: Relation of user-percieved response time to error measurement M1: The generation of random, binary, unordered trees M2: the intersection graph of paths in trees M3: Graph minors IV: Widths of trees and well-quasi-ordering M4: Graph minors: A survey Italicized words occur and multiple docs and are indexed IS 240 – Spring 2006
How LSI Works Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 Human 1 0 0 1 0 0 0 0 0 Interface 1 0 1 0 0 0 0 0 0 Computer 1 1 0 0 0 0 0 0 0 User 0 1 1 0 1 0 0 0 0 System 0 1 1 2 0 0 0 0 0 Response 0 1 0 0 1 0 0 0 0 Time 0 1 0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 Survey 0 1 0 0 0 0 0 0 0 Trees 0 0 0 0 0 1 1 1 0 Graph 0 0 0 0 0 0 1 1 1 Minors 0 0 0 0 0 0 0 1 1 IS 240 – Spring 2006
How LSI Works Dimension 2 SVD to 2 dimensions Dimension 1 11graph M2(10,11,12) 10 Tree 12 minor 9 survey M1(10) 7 time 3 computer 4 user 6 response 5 system 2 interface 1 human M4(9,11,12) M2(10,11) C2(3,4,5,6,7,9) C5(4,6,7) C1(1,2,3) C3(2,4,5,8) C4(1,5,8) Q(1,3) Blue dots are terms Documents are red squares Blue square is a query Dotted cone is cosine .9 from Query “Human Computer Interaction” -- even docs with no terms in common (c3 and c5) lie within cone. SVD to 2 dimensions IS 240 – Spring 2006
How LSI Works X = T0S0D0’ X T0 = S0 D0’ txd txm mxm mxd docs terms T0 has orthogonal, unit-length columns (T0’ T0 = 1) D0 has orthogonal, unit-length columns (D0’ D0 = 1) S0 is the diagonal matrix of singular values t is the number of rows in X d is the number of columns in X m is the rank of X (<= min(t,d) IS 240 – Spring 2006
Overview Review Filtering & Routing TDT – Topic Detection and Tracking LSI Filtering & Routing TDT – Topic Detection and Tracking IS 240 – Spring 2006
Filtering Characteristics of Filtering systems: Designed for unstructured or semi-structured data Deal primarily with text information Deal with large amounts of data Involve streams of incoming data Filtering is based on descriptions of individual or group preferences – profiles. May be negative profiles (e.g. junk mail filters) Filtering implies removing non-relevant material as opposed to selecting relevant. IS 240 – Spring 2006
Filtering Similar to IR, with some key differences Similar to Routing – sending relevant incoming data to different individuals or groups is virtually identical to filtering – with multiple profiles Similar to Categorization systems – attaching one or more predefined categories to incoming data objects – is also similar, but is more concerned with static categories (might be considered information extraction) IS 240 – Spring 2006
Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Potentially Relevant Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Information Storage and Retrieval System Adapted from Soergel, p. 19 IS 240 – Spring 2006
Structure of an Filtering System Interest profiles Raw Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Incoming Data Stream Potentially Relevant Documents Comparison/ filtering Store1: Profiles/ Search requests Doc surrogate Indexing/ Categorization/ Extraction Formulating query in terms of descriptors Storage of profiles Information Filtering System Adapted from Soergel, p. 19 Individual or Group users IS 240 – Spring 2006
Major differences between IR and Filtering IR concerned with single uses of the system IR recognizes inherent faults of queries Filtering assumes profiles can be better than IR queries IR concerned with collection and organization of texts Filtering is concerned with distribution of texts IR is concerned with selection from a static database. Filtering concerned with dynamic data stream IR is concerned with single interaction sessions Filtering concerned with long-term changes IS 240 – Spring 2006
Contextual Differences In filtering the timeliness of the text is often of greatest significance Filtering often has a less well-defined user community Filtering often has privacy implications (how complete are user profiles?, what to they contain?) Filtering profiles can (should?) adapt to user feedback Conceptually similar to Relevance feedback IS 240 – Spring 2006
Methods for Filtering Adapted from IR Collaborative filtering E.g. use a retrieval ranking algorithm against incoming documents. Collaborative filtering Individual and comparative profiles IS 240 – Spring 2006
TREC Filtering Track Original Filtering Track Participants are given a starting query They build a profile using the query and the training data The test involves submitting the profile (which is not changed) and then running it against a new data stream New Adaptive Filtering Track Same, except the profile can be modified as each new relevant document is encountered. Since streams are being processed, there is no ranking of documents IS 240 – Spring 2006
TREC-8 Filtering Track Following Slides from the TREC-8 Overview by Ellen Voorhees http://trec.nist.gov/presentations/TREC8/overview/index.htm IS 240 – Spring 2006
IS 240 – Spring 2006
IS 240 – Spring 2006
IS 240 – Spring 2006
IS 240 – Spring 2006
Overview Review Filtering & Routing TDT – Topic Detection and Tracking LSI Filtering & Routing TDT – Topic Detection and Tracking IS 240 – Spring 2006
TDT: Topic Detection and Tracking Intended to automatically identify new topics – events, etc. – from a stream of text and follow the development/further discussion of those topics IS 240 – Spring 2006
Topic Detection and Tracking Introduction and Overview The TDT3 R&D Challenge TDT3 Evaluation Methodology Topic Detection and Tracking Slides from “Overview NIST Topic Detection and Tracking Introduction and Overview” by G. Doddington http://www.itl.nist.gov/iaui/894.01/tests/tdt/tdt99/presentations/index.htm IS 240 – Spring 2006
TDT Task Overview* 5 R&D Challenges: TDT3 Corpus Characteristics:† Story Segmentation Topic Tracking Topic Detection First-Story Detection Link Detection TDT3 Corpus Characteristics:† Two Types of Sources: Text • Speech Two Languages: English 30,000 stories Mandarin 10,000 stories 11 Different Sources: _8 English__ 3 Mandarin ABC CNN VOA PRI VOA XIN NBC MNB ZBN APW NYT * see http://www.itl.nist.gov/iaui/894.01/tdt3/tdt3.htm for details † see http://morph.ldc.upenn.edu/Projects/TDT3/ for details IS 240 – Spring 2006
Preliminaries A topic is … A story is … a seminal event or activity, along with all directly related events and activities. A story is … a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event. IS 240 – Spring 2006
Example Topic Title: Mountain Hikers Lost WHAT: 35 or 40 young Mountain Hikers were lost in an avalanche in France around the 20th of January. WHERE: Orres, France WHEN: January 1998 RULES OF INTERPRETATION: 5. Accidents IS 240 – Spring 2006
The Segmentation Task: To segment the source stream into its constituent stories, for all audio sources. (for Radio and TV only) Transcription: text (words) Story: Non-story: IS 240 – Spring 2006
Story Segmentation Conditions 1 Language Condition: 3 Audio Source Conditions: 3 Decision Deferral Conditions: IS 240 – Spring 2006
The Topic Tracking Task: To detect stories that discuss the target topic, in multiple source streams. Find all the stories that discuss a given target topic Training: Given Nt sample stories that discuss a given target topic, Test: Find all subsequent stories that discuss the target topic. on-topic unknown training data test data New This Year: not guaranteed to be off-topic IS 240 – Spring 2006
Topic Tracking Conditions 9 Training Conditions: 1 Language Test Condition: 3 Source Conditions: 2 Story Boundary Conditions: IS 240 – Spring 2006
The Topic Detection Task: To detect topics in terms of the (clusters of) stories that discuss them. Unsupervised topic training A meta-definition of topic is required - independent of topic specifics. New topics must be detected as the incoming stories are processed. Input stories are then associated with one of the topics. a topic! IS 240 – Spring 2006
Topic Detection Conditions 3 Language Conditions: 3 Source Conditions: Decision Deferral Conditions: 2 Story Boundary Conditions: IS 240 – Spring 2006
The First-Story Detection Task: To detect the first story that discusses a topic, for all topics. There is no supervised topic training (like Topic Detection) Time First Stories Not First Stories = Topic 1 = Topic 2 IS 240 – Spring 2006
First-Story Detection Conditions 1 Language Condition: 3 Source Conditions: Decision Deferral Conditions: 2 Story Boundary Conditions: IS 240 – Spring 2006
The Link Detection Task To detect whether a pair of stories discuss the same topic. The topic discussed is a free variable. Topic definition and annotation is unnecessary. The link detection task represents a basic functionality, needed to support all applications (including the TDT applications of topic detection and tracking). The link detection task is related to the topic tracking task, with Nt = 1. same topic? IS 240 – Spring 2006
Link Detection Conditions 1 Language Condition: 3 Source Conditions: Decision Deferral Conditions: 1 Story Boundary Condition: IS 240 – Spring 2006
TDT3 Evaluation Methodology All TDT3 tasks are cast as statistical detection (yes-no) tasks. Story Segmentation: Is there a story boundary here? Topic Tracking: Is this story on the given topic? Topic Detection: Is this story in the correct topic-clustered set? First-story Detection: Is this the first story on a topic? Link Detection: Do these two stories discuss the same topic? Performance is measured in terms of detection cost, which is a weighted sum of miss and false alarm probabilities: CDet = CMiss • PMiss • Ptarget + CFA • PFA • (1- Ptarget) Detection Cost is normalized to lie between 0 and 1: (CDet)Norm = CDet / min{CMiss • Ptarget, CFA • (1- Ptarget)} IS 240 – Spring 2006
Example Performance Measures: Tracking Results on Newswire Text (BBN) 0.01 0.1 1 English Mandarin Normalized Tracking Cost IS 240 – Spring 2006
More on TDT Some slides from James Allan from the HICSS meeting in January 2005 IS 240 – Spring 2006