The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik.

Slides:



Advertisements
Similar presentations
Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Search and Ye Shall Find (maybe) Seminar on Emergent Information Technology August 20, 2007 Douglas W. Oard.
LBSC 796/INFM 718R: Week 3 Boolean and Vector Space Models Jimmy Lin College of Information Studies University of Maryland Monday, February 13, 2006.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
CSE3201/CSE4500 Information Retrieval Systems Introduction to Information Retrieval.
Ranked Retrieval INST 734 Module 3 Doug Oard. Agenda  Ranked retrieval Similarity-based ranking Probability-based ranking.
INFM 603: Information Technology and Organizational Context Jimmy Lin The iSchool University of Maryland Thursday, November 7, 2013 Session 10: Information.
Leveraging Your Taxonomy to Increase User Productivity MAIQuery and TM Navtree.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Web- and Multimedia-based Information Systems. Assessment Presentation Programming Assignment.
Evidence from Metadata LBSC 796/CMSC 828o Session 6 – March 1, 2004 Douglas W. Oard.
Sunita Sarawagi.  Enables richer forms of queries  Facilitates source integration and queries spanning sources “Information Extraction refers to the.
Information Retrieval Review
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
LBSC 690: Week 11 Information Retrieval and Search Jimmy Lin College of Information Studies University of Maryland Monday, April 16, 2007.
Full-Text Indexing Session 10 INFM 718N Web-Enabled Databases.
Evidence from Metadata LBSC 796/INFM 718R Session 9: November 5, 2007 Douglas W. Oard.
WMES3103 : INFORMATION RETRIEVAL
The Vector Space Model LBSC 796/CMSC828o Session 3, February 9, 2004 Douglas W. Oard.
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
LBSC 690: Session 11 Information Retrieval and Search Jimmy Lin College of Information Studies University of Maryland Monday, November 19, 2007.
Text Retrieval and Spreadsheets Class 4 LBSC 690 Information Technology.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
Natural Language Query Interface Mostafa Karkache & Bryce Wenninger.
Advance Information Retrieval Topics Hassan Bashiri.
LBSC 690 Information Retrieval and Search
LBSC 690 Session #9 Unstructured Information: Search Engines Jimmy Lin The iSchool University of Maryland Wednesday, October 29, 2008 This work is licensed.
Latent Semantic Analysis (LSA). Introduction to LSA Learning Model Uses Singular Value Decomposition (SVD) to simulate human learning of word and passage.
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Chapter 5: Information Retrieval and Web Search
Indexing and Complexity. Agenda Inverted indexes Computational complexity.
Processing of large document collections Part 3 (Evaluation of text classifiers, applications of text categorization) Helena Ahonen-Myka Spring 2005.
Search Engines and Information Retrieval Chapter 1.
Processing of large document collections Part 2 (Text categorization) Helena Ahonen-Myka Spring 2006.
Indexing LBSC 708A/CMSC 838L Session 7, October 23, 2001 Philip Resnik.
Evidence from Metadata LBSC 796/INFM 718R Session 9: April 6, 2011 Douglas W. Oard.
Information Filtering LBSC 796/INFM 718R Douglas W. Oard Session 10, April 13, 2011.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
BAA - Big Mechanism using SIRA Technology Chuck Rehberg CTO at Trigent Software and Chief Scientist at Semantic Insights™
Matching LBSC 708A/CMSC 828O March 8, 1999 Douglas W. Oard and Dagobert Soergel.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
The Structure of Information Retrieval Systems LBSC 708A/CMSC 838L Douglas W. Oard and Philip Resnik Session 1: September 4, 2001.
Structure of IR Systems INST 734 Module 1 Doug Oard.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
How Do We Find Information?. Key Questions  What are we looking for?  How do we find it?  Why is it difficult? “A prudent question is one-half of wisdom”
Information Retrieval
Evidence from Content INST 734 Module 2 Doug Oard.
Structure of IR Systems LBSC 796/INFM 718R Session 1, January 26, 2011 Doug Oard.
Evidence from Metadata INST 734 Doug Oard Module 8.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Human Computer Interaction Lecture 21 User Support
Lecture 1: Introduction and the Boolean Model Information Retrieval
Text Based Information Retrieval
CSE 635 Multimedia Information Retrieval
Information Retrieval
PolyAnalyst Web Report Training
Information Retrieval and Web Design
Presentation transcript:

The Boolean Retrieval Model LBSC 708A/CMSC 838L Session 2 - September 11, 2001 Philip Resnik

Agenda Questions General model for detection The “bag of words” representation Boolean “free text” retrieval Proximity operators Controlled vocabulary retrieval Automating controlled vocabulary Retrieval versus filtering

But First... Rate the textbook reading: –Was it easy to understand? –How long did it take you to read?

Retrieval System Model Source Selection Search Query Selection Ranked List Examination Document Delivery Document Query Formulation IR System Query Reformulation and Relevance Feedback Source Reselection NominateChoose Predict

Search Goal Choose the same documents a human would –Without human intervention (less work) –Faster than a human could (less time) –As accurately as possible (less accuracy) Humans start with an information need –Machines start with a query Humans match documents to information needs –Machines match document & query representations

Search Component Model Comparison Function Representation Function Query Formulation Human Judgment Representation Function Retrieval Status Value Utility Query Information NeedDocument Query RepresentationDocument Representation Query Processing Document Processing

Detection Component Model “Retrieval status value” is an estimate of utility –Utility  what the user would pay for the document A co-design problem –Document representation function –Query representation function –Comparison function Boolean “free text” retrieval is one way of allocating functionality to each function

“Bag of Words” Representation Bag = multiset: keeps track of members and counts The quick brown fox jumped over the lazy dog’s back  {back, brown, dog, fox, jumped, lazy, over, quick, ‘s, the, the} A “term” is any lexical item that you chose –A fixed-length sequence of characters (an “n-gram”) –A word (delimited by “white space” or punctuation) –“Root form” of each word (destroyed  destroy) –“Stem” of each word (destroyed  destr) –A phrase (e.g., phrases listed in a dictionary) Counts can be recorded in any consistent order

Bag of Words Example The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the quick brown fox over lazy dog back now is time for all good men to come jump aid of their party Indexed Term Document 1Document 2 Stopword List ‘s

Boolean “Free Text” Retrieval Limit the bag of words to “absent” and “present” –“Boolean” values, represented as 0 and 1 Represent terms as a “bag of documents” –Same representation, but rows rather than columns Combine the rows using “Boolean operators” –AND, OR, NOT Any document with a 1 remaining is “detected”

Boolean Operators A OR B A AND BA NOT B A B A B A B B NOT B (= A AND NOT B)

Boolean Free Text Example quick brown fox over lazy dog back now time all good men come jump aid their party Term Doc 1 Doc Doc 3Doc Doc 5Doc Doc 7Doc 8 dog AND fox –Doc 3, Doc 5 dog NOT fox –Empty fox NOT dog –Doc 7 dog OR fox –Doc 3, Doc 5, Doc 7 good AND party –Doc 6, Doc 8 good AND party NOT over –Doc 6

Why Boolean Retrieval Works Boolean operators approximate natural language –Find documents about a good party that is not over AND can discover relationships between concepts –good party OR can discover alternate terminology –excellent party NOT can discover alternate meanings –Democratic party

The Perfect Query Paradox Every information need has a perfect doc set –If not, there would be no sense doing retrieval Almost every document set has a perfect query –AND every word to get a query for document 1 –Repeat for each document in the set –OR every document query to get the set query But users find Boolean query formulation hard –They get too much, too little, useless stuff, …

Why Boolean Retrieval Fails Natural language is way more complex –She saw the man on the hill with a telescope AND “discovers” nonexistent relationships –Terms in different paragraphs, chapters, … Guessing terminology for OR is hard –good, nice, excellent, outstanding, awesome, … Guessing terms to exclude is even harder! –Democratic party, party to a lawsuit, …

Proximity Operators More precise versions of AND –“NEAR n” allows at most n-1 intervening terms –“WITH” requires terms to be adjacent and in order Easy to implement, but less efficient –Store a list of positions for each word in each doc Stopwords become very important! –Perform normal Boolean computations Treat WITH and NEAR like AND with an extra constraint

Proximity Operator Example time AND come –Doc 2 time (NEAR 2) come –Empty quick (NEAR 2) fox –Doc 1 quick WITH fox –Empty quick brown fox over lazy dog back now time all good men come jump aid their party 01 (9) Term 1 (13) 1 (6) 1 (7) 1 (8) 1 (16) 1 (1) 1 (2) 1 (15) 1 (4) (5) 1 (9) 1 (3) 1 (4) 1 (8) 1 (6) 1 (10) Doc 1Doc 2

Concept Retrieval Goal: retrieve using “concepts,” not just words –Some words have many meanings (e.g., bank) This is a bigger problem for large diverse collections –Some meanings are associated with many words Especially when shades of meaning are unimportant This is the holy grail of information retrieval –Everyone agrees that it is a good idea –But every known approach has some limitations

Controlled Vocabulary Retrieval A straightforward concept retrieval approach –Works equally well for non-text materials –Index terms are a form of meta-data Assign a unique “descriptor” to each concept –Can be done by hand for collections of limited scope –In theory, descriptors are unambiguous Assign some descriptors to each document –Practical for valuable collections of limited size Use Boolean retrieval based on descriptors

Controlled Vocabulary Example Canine AND Fox –Doc 1 Canine AND Political action –Empty Canine OR Political action –Doc 1, Doc 2 The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. Volunteerism Political action Fox Canine Descriptor Doc 1Doc 2 [Canine] [Fox] [Political action] [Volunteerism]

Thesaurus Design Thesauri contain descriptors and relationships –Broader term (  IS-A), narrower term, used for, … Indexers select descriptors for each document –Thesaurus must match the document collection Searchers select descriptors for each query –Thesaurus must match information needs Indexers must anticipate searchers’ info needs –Or searchers must discern indexers’ perspective –Or thesaurus itself must be accessible/browsable

Challenges Thesaurus design is expensive –Shifting concepts generate continuing expense Manual indexing is even more expensive –And consistent indexing is very expensive User needs are often difficult to anticipate –Challenge for thesaurus designers and indexers End users find thesauri hard to use –Co-design problem with query formulation

Applications When implied concepts must be captured –Political action, volunteerism, … When terminology selection is impractical –Searching foreign language materials When no words are present –Photos w/o captions, videos w/o transcripts, … When user needs are easily anticipated –Weather reports, yellow pages*, … *But cf. Bill Woods’ classic example of the paraphrase problem: “car washing” vs. “automobile cleaning”

Yahoo

Machine Assisted Indexing Goal: Automatically suggest descriptors –Better consistency with lower cost Chosen by a rule-based expert system –Design thesaurus by hand in the usual way –Design an expert system to process text String matching, proximity operators, … –Write rules for each thesaurus/collection/language –Try it out and fine tune the rules by hand

Machine Assisted Indexing Example //TEXT: science IF (all caps) USE research policy USE community program ENDIF IF (near “Technology” AND with “Development”) USE community development USE development aid ENDIF near: within 250 words with: in the same sentence Access Innovations system:

Text Categorization Goal: fully automatic descriptor assignment Machine learning approach –Assign descriptors manually for a “training set” –Design a learning algorithm find and use patterns Bayesian classifier, neural network, genetic algorithm, … –Present new documents System assigns descriptors like those in training set

Supervised Learning f 1 f 2 f 3 f 4 … f N v 1 v 2 v 3 v 4 … v N CvCv w 1 w 2 w 3 w 4 … w N CwCw LearnerClassifier New example x 1 x 2 x 3 x 4 … x N CxCx Labelled training examples CwCw

Retrieval vs. Filtering Retrospective retrieval: relatively static collection; constant flow of queries Information filtering: relatively static profile (query); constant stream of new documents Examples: –Yahoo categorization of new Web pages (could also be viewed as an ongoing indexing task) –Personalized newspaper

Case Study: Individual Inc. First of the personalized newspapers (original delivery mechanism: 8am fax) Core technology: SMART + extended Boolean Key insights: –Targeted, industry-specific marketing –Large staff of non-technical domain specialists –“Building block” Boolean profiles –Nightly update of profiles based on data stream e.g. (OJ or “orange juice”) and not Simpson –Inexpensive detection and selection, more costly examination/delivery.

Things to Do This Week Homework 1 –Due next week Do the readings Note reading list changes

One Minute Paper Brief answers, no names, online –In your opinion, what is the most important positive and most important negative characteristic of Boolean retrieval? Please provide exactly one of each. –What was the muddiest point in today’s lecture? –What was the most interesting point in today’s lecture? I’ll summarize the answers next class