2010.03.17 - SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Dimensionality Reduction PCA -- SVD
What is missing? Reasons that ideal effectiveness hard to achieve: 1. Users’ inability to describe queries precisely. 2. Document representation loses.
Information Retrieval in Practice
Search Engines and Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
1 Latent Semantic Indexing Jieping Ye Department of Computer Science & Engineering Arizona State University
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Retrieval in Practice
Singular Value Decomposition in Text Mining Ram Akella University of California Berkeley Silicon Valley Center/SC Lecture 4b February 9, 2011.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Indexing by Latent Semantic Analysis Scot Deerwester, Susan Dumais,George Furnas,Thomas Landauer, and Richard Harshman Presented by: Ashraf Khalil.
IR Models: Latent Semantic Analysis. IR Model Taxonomy Non-Overlapping Lists Proximal Nodes Structured Models U s e r T a s k Set Theoretic Fuzzy Extended.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central.
Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang National Central University
Paper Summary of: Modelling Retrieval and Navigation in Context by: Massimo Melucci Ahmed A. AlNazer May 2008 ICS-542: Multimedia Computing – 072.
Ranking by Odds Ratio A Probability Model Approach let be a Boolean random variable: document d is relevant to query q otherwise Consider document d as.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 9 Latent Semantic Indexing.
Software Engineering Laboratory, Department of Computer Science, Graduate School of Information Science and Technology, Osaka University Automatic Categorization.
Overview of Search Engines
Information Retrieval in Practice
Homework Define a loss function that compares two matrices (say mean square error) b = svd(bellcore) b2 = b$u[,1:2] %*% diag(b$d[1:2]) %*% t(b$v[,1:2])
Challenges in Information Retrieval and Language Modeling Michael Shepherd Dalhousie University Halifax, NS Canada.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
Topic Detection and Tracking Introduction and Overview.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement and Relevance Feedback.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
LATENT SEMANTIC INDEXING Hande Zırtıloğlu Levent Altunyurt.
SINGULAR VALUE DECOMPOSITION (SVD)
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Alternative IR models DR.Yeni Herdiyeni, M.Kom STMIK ERESHA.
1 CS 430: Information Discovery Lecture 12 Latent Semantic Indexing.
Modern information retreival Chapter. 02: Modeling (Latent Semantic Indexing)
Results of the 2000 Topic Detection and Tracking Evaluation in Mandarin and English Jonathan Fiscus and George Doddington.
1 Latent Concepts and the Number Orthogonal Factors in Latent Semantic Analysis Georges Dupret
1 Information Retrieval LECTURE 1 : Introduction.
1 CS 430: Information Discovery Lecture 11 Latent Semantic Indexing.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
Date: 2012/5/28 Source: Alexander Kotov. al(CIKM’11) Advisor: Jia-ling, Koh Speaker: Jiun Jia, Chiou Interactive Sense Feedback for Difficult Queries.
Carnegie Mellon School of Computer Science Language Technologies Institute CMU Team-1 in TDT 2004 Workshop 1 CMU TEAM-A in TDT 2004 Topic Tracking Yiming.
Natural Language Processing Topics in Information Retrieval August, 2002.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Search Engine and Optimization 1. Agenda Indexing Algorithms Latent Semantic Indexing 2.
Lecture 16: Filtering & TDT
CSE 635 Multimedia Information Retrieval
Restructuring Sparse High Dimensional Data for Effective Retrieval
Latent Semantic Analysis
Presentation transcript:

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 16: Filtering & TDT

SLIDE 2IS 240 – Spring 2010 Overview Review –LSI Filtering & Routing TDT – Topic Detection and Tracking

SLIDE 3IS 240 – Spring 2010 Overview Review –LSI Filtering & Routing TDT – Topic Detection and Tracking

SLIDE 4IS 240 – Spring 2010 How LSI Works Start with a matrix of terms by documents Analyze the matrix using SVD to derive a particular “latent semantic structure model” Two-Mode factor analysis, unlike conventional factor analysis, permits an arbitrary rectangular matrix with different entities on the rows and columns –Such as Terms and Documents

SLIDE 5IS 240 – Spring 2010 How LSI Works The rectangular matrix is decomposed into three other matices of a special form by SVD –The resulting matrices contain “singular vectors” and “singular values” –The matrices show a breakdown of the original relationships into linearly independent components or factors –Many of these components are very small and can be ignored – leading to an approximate model that contains many fewer dimensions

SLIDE 6IS 240 – Spring 2010 How LSI Works Titles C1: Human machine interface for LAB ABC computer applications C2: A survey of user opinion of computer system response time C3: The EPS user interface management system C4: System and human system engineering testing of EPS C5: Relation of user-percieved response time to error measurement M1: The generation of random, binary, unordered trees M2: the intersection graph of paths in trees M3: Graph minors IV: Widths of trees and well-quasi-ordering M4: Graph minors: A survey Italicized words occur and multiple docs and are indexed

SLIDE 7IS 240 – Spring 2010 How LSI Works Terms Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 Human Interface Computer User System Response Time EPS Survey Trees Graph Minors

SLIDE 8IS 240 – Spring 2010 How LSI Works Dimension 2 Dimension 1 11graph M2(10,11,12) 10 Tree 12 minor 9 survey M1(10) 7 time 3 computer 4 user 6 response 5 system 2 interface 1 human M4(9,11,12) M2(10,11) C2(3,4,5,6,7,9) C5(4,6,7) C1(1,2,3) C3(2,4,5,8) C4(1,5,8) Q(1,3) Blue dots are terms Documents are red squares Blue square is a query Dotted cone is cosine.9 from Query “Human Computer Interaction” -- even docs with no terms in common (c3 and c5) lie within cone. SVD to 2 dimensions

SLIDE 9IS 240 – Spring 2010 How LSI Works XT0T0 = S0S0 D 0’ t x d t x m m x m m x d X = T 0 S 0 D 0’ docs terms T 0 has orthogonal, unit-length columns (T 0’ T 0 = 1) D 0 has orthogonal, unit-length columns (D 0’ D 0 = 1) S 0 is the diagonal matrix of singular values t is the number of rows in X d is the number of columns in X m is the rank of X (<= min(t,d)

SLIDE 10IS 240 – Spring 2010 Overview Review –LSI Filtering & Routing TDT – Topic Detection and Tracking

SLIDE 11IS 240 – Spring 2010 Filtering Characteristics of Filtering systems: –Designed for unstructured or semi-structured data –Deal primarily with text information –Deal with large amounts of data –Involve streams of incoming data –Filtering is based on descriptions of individual or group preferences – profiles. May be negative profiles (e.g. junk mail filters) –Filtering implies removing non-relevant material as opposed to selecting relevant.

SLIDE 12IS 240 – Spring 2010 Filtering Similar to IR, with some key differences Similar to Routing – sending relevant incoming data to different individuals or groups is virtually identical to filtering – with multiple profiles Similar to Categorization systems – attaching one or more predefined categories to incoming data objects – is also similar, but is more concerned with static categories (might be considered information extraction)

SLIDE 13IS 240 – Spring 2010 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

SLIDE 14IS 240 – Spring 2010 Structure of an Filtering System Interest profiles Raw Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Incoming Data Stream Potentially Relevant Documents Comparison/ filtering Store1: Profiles/ Search requests Doc surrogate Stream Indexing/ Categorization/ Extraction Formulating query in terms of descriptors Storage of profiles Information Filtering System Adapted from Soergel, p. 19 Individual or Group users

SLIDE 15IS 240 – Spring 2010 Major differences between IR and Filtering IR concerned with single uses of the system IR recognizes inherent faults of queries –Filtering assumes profiles can be better than IR queries IR concerned with collection and organization of texts –Filtering is concerned with distribution of texts IR is concerned with selection from a static database. –Filtering concerned with dynamic data stream IR is concerned with single interaction sessions –Filtering concerned with long-term changes

SLIDE 16IS 240 – Spring 2010 Contextual Differences In filtering the timeliness of the text is often of greatest significance Filtering often has a less well-defined user community Filtering often has privacy implications (how complete are user profiles?, what to they contain?) Filtering profiles can (should?) adapt to user feedback –Conceptually similar to Relevance feedback

SLIDE 17IS 240 – Spring 2010 Methods for Filtering Adapted from IR –E.g. use a retrieval ranking algorithm against incoming documents. Collaborative filtering –Individual and comparative profiles

SLIDE 18IS 240 – Spring 2010 TREC Filtering Track Original Filtering Track –Participants are given a starting query –They build a profile using the query and the training data –The test involves submitting the profile (which is not changed) and then running it against a new data stream New Adaptive Filtering Track –Same, except the profile can be modified as each new relevant document is encountered. Since streams are being processed, there is no ranking of documents

SLIDE 19IS 240 – Spring 2010 TREC-8 Filtering Track Following Slides from the TREC-8 Overview by Ellen Voorhees

SLIDE 20IS 240 – Spring 2010

SLIDE 21IS 240 – Spring 2010

SLIDE 22IS 240 – Spring 2010

SLIDE 23IS 240 – Spring 2010

SLIDE 24IS 240 – Spring 2010 Overview Review –LSI Filtering & Routing TDT – Topic Detection and Tracking

SLIDE 25IS 240 – Spring 2010 TDT: Topic Detection and Tracking Intended to automatically identify new topics – events, etc. – from a stream of text and follow the development/further discussion of those topics

SLIDE 26IS 240 – Spring 2010 Topic Detection and Tracking Introduction and Overview – The TDT3 R&D Challenge – TDT3 Evaluation Methodology Slides from “Overview NIST Topic Detection and Tracking -Introduction and Overview” by G. Doddington -

SLIDE 27IS 240 – Spring 2010 TDT Task Overview* 5 R&D Challenges: –Story Segmentation –Topic Tracking –Topic Detection –First-Story Detection –Link Detection TDT3 Corpus Characteristics:† –Two Types of Sources: Text Speech –Two Languages: English30,000 stories Mandarin10,000 stories –11 Different Sources: _8 English__ 3 Mandarin ABCCNN VOA PRIVOA XIN NBCMNB ZBN APWNYT * * see for details † see for details

SLIDE 28IS 240 – Spring 2010 Preliminaries topic A topic is … event a seminal event or activity, along with all directly related events and activities. story A story is … a topically cohesive segment of news that includes two or more DECLARATIVE independent clauses about a single event.

SLIDE 29IS 240 – Spring 2010 Example Topic Title: Mountain Hikers Lost – WHAT: 35 or 40 young Mountain Hikers were lost in an avalanche in France around the 20th of January. – WHERE: Orres, France – WHEN: January 1998 – RULES OF INTERPRETATION: 5. Accidents

SLIDE 30IS 240 – Spring 2010 (for Radio and TV only) Transcription: text (words) Story: Non-story: The Segmentation Task: To segment the source stream into its constituent stories, for all audio sources.

SLIDE 31IS 240 – Spring 2010 Story Segmentation Conditions 1 Language Condition: 3 Audio Source Conditions: 3 Decision Deferral Conditions:

SLIDE 32IS 240 – Spring 2010 The Topic Tracking Task: To detect stories that discuss the target topic, in multiple source streams. Find all the stories that discuss a given target topic –Training: Given N t sample stories that discuss a given target topic, –Test: Find all subsequent stories that discuss the target topic. on-topic unknown training data test data New This Year: not guaranteed to be off-topic

SLIDE 33IS 240 – Spring 2010 Topic Tracking Conditions 9 Training Conditions: 1 Language Test Condition: 3 Source Conditions: 2 Story Boundary Conditions:

SLIDE 34IS 240 – Spring 2010 The Topic Detection Task: To detect topics in terms of the (clusters of) stories that discuss them. –Unsupervised topic training A meta-definition of topic is required  independent of topic specifics. –New topics must be detected as the incoming stories are processed. –Input stories are then associated with one of the topics. a topic!

SLIDE 35IS 240 – Spring 2010 Topic Detection Conditions 3 Language Conditions: 3 Source Conditions: Decision Deferral Conditions: 2 Story Boundary Conditions:

SLIDE 36IS 240 – Spring 2010 There is no supervised topic training (like Topic Detection) Time First Stories Not First Stories = Topic 1 = Topic 2 The First-Story Detection Task: To detect the first story that discusses a topic, for all topics.

SLIDE 37IS 240 – Spring 2010 First-Story Detection Conditions 1 Language Condition: 3 Source Conditions: Decision Deferral Conditions: 2 Story Boundary Conditions:

SLIDE 38IS 240 – Spring 2010 The Link Detection Task To detect whether a pair of stories discuss the same topic. The topic discussed is a free variable. Topic definition and annotation is unnecessary. The link detection task represents a basic functionality, needed to support all applications (including the TDT applications of topic detection and tracking). The link detection task is related to the topic tracking task, with Nt = 1. same topic?

SLIDE 39IS 240 – Spring 2010 Link Detection Conditions 1 Language Condition: 3 Source Conditions: Decision Deferral Conditions: 1 Story Boundary Condition:

SLIDE 40IS 240 – Spring 2010 TDT3 Evaluation Methodology All TDT3 tasks are cast as statistical detection (yes-no) tasks. –Story Segmentation: Is there a story boundary here? –Topic Tracking: Is this story on the given topic? –Topic Detection: Is this story in the correct topic- clustered set? –First-story Detection: Is this the first story on a topic? –Link Detection: Do these two stories discuss the same topic? Performance is measured in terms of detection cost, which is a weighted sum of miss and false alarm probabilities: C Det = C Miss P Miss P target + C FA P FA (1- P target ) Detection Cost is normalized to lie between 0 and 1: (C Det ) Norm = C Det / min{C Miss P target, C FA (1- P target )}

SLIDE 41IS 240 – Spring 2010 Example Performance Measures: English Mandarin Normalized Tracking Cost Tracking Results on Newswire Text (BBN)

SLIDE 42IS 240 – Spring 2010 More on TDT Some slides from James Allan from the HICSS meeting in January 2005