9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
INSTRUCTOR: DR.NICK EVANGELOPOULOS PRESENTED BY: QIUXIA WU CHAPTER 2 Information retrieval DSCI 5240.
Modern Information Retrieval Chapter 1: Introduction
Query Models Use Types What do search engines do.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Properties of Text CS336 Lecture 3:. 2 Generating Document Representations Want to automatically generate with little human intervention Use significant.
UCLA : GSE&IS : Department of Information StudiesJF : 276lec1.ppt : 5/2/2015 : 1 I N F S I N F O R M A T I O N R E T R I E V A L S Y S T E M S Week.
Intelligent Information Retrieval CS 336 –Lecture 3: Text Operations Xiaoyan Li Spring 2006.
Text Operations: Preprocessing. Introduction Document preprocessing –to improve the precision of documents retrieved –lexical analysis, stopwords elimination,
Search Engines and Information Retrieval
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
9/11/2000Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Marti Hearst University of California,
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Modern Information Retrieval Chapter 1: Introduction
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
9/11/2001Information Organization and Retrieval Content Analysis and Statistical Properties of Text Ray Larson & Warren Sack University of California,
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
WMES3103 : INFORMATION RETRIEVAL
Information Retrieval Concerned with the: Representation of Storage of Organization of, and Access to Information items.
9/4/2001Information Organization and Retrieval Introduction to Information Retrieval University of California, Berkeley School of Information Management.
Current Topics in Information Access: IR Background
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
Properties of Text CS336 Lecture 3:. 2 Information Retrieval Searching unstructured documents Typically text –Newspaper articles –Web pages Other documents.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
What is a document? Information need: From where did the metaphor, doing X is like “herding cats”, arise? quotation? “Managing senior programmers is like.
Interfaces for Querying Collections. Information Retrieval Activities Selecting a collection –Lists, overviews, wizards, automatic selection Submitting.
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
SLIDE 1IS 202 – FALL 2003 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2003
SLIDE 1IS 202 – FALL 2002 Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall 2002
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Overview of Search Engines
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Search Engines and Information Retrieval Chapter 1.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Collocations and Information Management Applications Gregor Erbach Saarland University Saarbrücken.
Web- and Multimedia-based Information Systems Lecture 2.
Introduction to Information Retrieval Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 Information Retrieval LECTURE 1 : Introduction.
National Technical University of Ukraine “Kiev Polytechnic Institute” Heat and energy design faculty Department of automation design of energy processes.
Information Retrieval
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
1 i206: Lecture 3: Boolean Logic, Logic Circuits Marti Hearst Spring 2012.
1 CS 430: Information Discovery Lecture 8 Collection-Level Metadata Vector Methods.
Feature Assignment LBSC 878 February 22, 1999 Douglas W. Oard and Dagobert Soergel.
SIMS 202, Marti Hearst Final Review Prof. Marti Hearst SIMS 202.
Information Storage and Retrieval Fall Lecture 1: Introduction and History.
Query Models Use Types What do search engines do.
What is Information Retrieval (IR)?
Text Based Information Retrieval
Why the interest in Queries?
Query Models Use Types What do search engines do.
Thanks to Bill Arms, Marti Hearst
Token generation - stemming
Introduction to Information Retrieval
Content Analysis of Text
Information Retrieval and Web Design
Presentation transcript:

9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of Information Management and Systems SIMS 202: Information Organization and Retrieval Lecture authors: Marti Hearst & Ray Larson

9/6/2001Information Organization and Retrieval The Standard Retrieval Interaction Model

9/6/2001Information Organization and Retrieval IR is an Iterative Process Repositories Workspace Goals

9/6/2001Information Organization and Retrieval A sketch of a searcher… “moving through many actions towards a general goal of satisfactory completion of research related to an information need.” (after Bates 89) Q0 Q1 Q2 Q3 Q4 Q5

9/6/2001Information Organization and Retrieval Restricted Form of the IR Problem The system has available only pre-existing, “canned” text passages. Its response is limited to selecting from these passages and presenting them to the user. It must select, say, 10 or 20 passages out of millions or billions!

9/6/2001Information Organization and Retrieval Information Retrieval Revised Task Statement: Build a system that retrieves documents that users are likely to find relevant to their queries. This set of assumptions underlies the field of Information Retrieval.

9/6/2001Information Organization and Retrieval Some IR History –Roots in the scientific “Information Explosion” following WWII –Interest in computer-based IR from mid 1950’s H.P. Luhn at IBM (1958) Probabilistic models at Rand (Maron & Kuhns) (1960) Boolean system development at Lockheed (‘60s) Vector Space Model (Salton at Cornell 1965) Statistical Weighting methods and theoretical advances (‘70s) Refinements and Advances in application (‘80s) User Interfaces, Large-scale testing and application (‘90s)

9/6/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/6/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/6/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/6/2001Information Organization and Retrieval Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

9/6/2001Information Organization and Retrieval Relevance (introduction) In what ways can a document be relevant to a query? –Answer precise question precisely. –Who is buried in grant’s tomb? Grant. –Partially answer question. –Where is Danville? Near Walnut Creek. –Suggest a source for more information. –What is lymphodema? Look in this Medical Dictionary. –Give background information. –Remind the user of other knowledge. –Others... Ideally, IR systems should retrieve ALL and ONLY the RELEVANT documents for a user…

9/6/2001Information Organization and Retrieval Query Languages A way to express the question (information need) Types: –Boolean –Natural Language –Stylized Natural Language –Form-Based (GUI)

9/6/2001Information Organization and Retrieval Simple query language: Boolean –Terms + Connectors (or operators) –terms words normalized (stemmed) words phrases thesaurus terms –connectors AND OR NOT

9/6/2001Information Organization and Retrieval Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

9/6/2001Information Organization and Retrieval Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations works: Catxxxx Dogxxxxx Collarxxxx Leashxxxx

9/6/2001Information Organization and Retrieval Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations work: Catxx Dogxx Collarxx Leashxx

9/6/2001Information Organization and Retrieval Boolean Logic A B

9/6/2001Information Organization and Retrieval Boolean Queries –Usually expressed as INFIX operators in IR ((a AND b) OR (c AND b)) –NOT is UNARY PREFIX operator ((a AND b) OR (c AND (NOT b))) –AND and OR can be n-ary operators (a AND b AND c AND d) –Some rules - (De Morgan revisited) NOT(a) AND NOT(b) = NOT(a OR b) NOT(a) OR NOT(b)= NOT(a AND b) NOT(NOT(a)) = a

9/6/2001Information Organization and Retrieval Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

9/6/2001Information Organization and Retrieval Boolean Searching “Measurement of the width of cracks in prestressed concrete beams” Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

9/6/2001Information Organization and Retrieval Psuedo-Boolean Queries A new notation, from web search –+cat dog +collar leash Does not mean the same thing! Need a way to group combinations. Phrases: –“stray cat” AND “frayed collar” –+“stray cat” + “frayed collar”

Information need Index Pre-process Parse Collections Rank Query text input

9/6/2001Information Organization and Retrieval Result Sets Run a query, get a result set Two choices –Reformulate query, run on entire collection –Reformulate query, run on result set Example: Dialog query (Redford AND Newman) -> S documents (S1 AND Sundance) ->S2 898 documents

Information need Index Pre-process Parse Collections Rank Query text input Reformulated Query Re-Rank

9/6/2001Information Organization and Retrieval Ordering of Retrieved Documents Pure Boolean has no ordering In practice: –order chronologically –order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated –p-norm is most famous usually impractical to implement usually hard for user to understand

9/6/2001Information Organization and Retrieval Boolean Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted –too much returned, or too little –ordering not well determined Dominant language in commercial systems until the WWW

9/6/2001Information Organization and Retrieval Faceted Boolean Query Strategy: break query into facets (polysemous with earlier meaning of facets) –conjunction of disjunctions a1 OR a2 OR a3 b1 OR b2 c1 OR c2 OR c3 OR c4 –each facet expresses a topic “rain forest” OR jungle OR amazon medicine OR remedy OR cure Smith OR Zhou AND

9/6/2001Information Organization and Retrieval Faceted Boolean Query Query still fails if one facet missing Alternative: Coordination level ranking –Order results in terms of how many facets (disjuncts) are satisfied –Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: assign weights to facets

9/6/2001Information Organization and Retrieval Proximity Searches Proximity: terms occur within K positions of one another –pen w/5 paper A “Near” function can be more vague –near(pen, paper) Sometimes order can be specified Also, Phrases and Collocations –“United Nations” “Bill Clinton” Phrase Variants –“retrieval of information” “information retrieval”

9/6/2001Information Organization and Retrieval Filters Filters: Reduce set of candidate docs Often specified simultaneous with query Usually restrictions on metadata –restrict by: date range internet domain (.edu.com.berkeley.edu) author size limit number of documents returned

9/6/2001Information Organization and Retrieval How are the texts handled? What happens if you take the words exactly as they appear in the original text? What about punctuation, capitalization, etc.? What about spelling errors? What about plural vs. singular forms of words What about cases and declension in non- english languages? What about non-roman alphabets?

9/6/2001Information Organization and Retrieval Content Analysis Automated Transformation of raw text into a form that represent some aspect(s) of its meaning Including, but not limited to: –Automated Thesaurus Generation –Phrase Detection –Categorization –Clustering –Summarization

9/6/2001Information Organization and Retrieval Techniques for Content Analysis Statistical –Single Document –Full Collection Linguistic –Syntactic –Semantic –Pragmatic Knowledge-Based (Artificial Intelligence) Hybrid (Combinations)

9/6/2001Information Organization and Retrieval Text Processing Standard Steps: –Recognize document structure titles, sections, paragraphs, etc. –Break into tokens usually space and punctuation delineated special issues with Asian languages –Stemming/morphological analysis –Store in inverted index (to be discussed later)

Information need Index Pre-process Parse Collections Rank Query text input How is the query constructed? How is the text processed?

Information Organization and Retrieval Document Processing Steps

9/6/2001Information Organization and Retrieval Stemming and Morphological Analysis Goal: “normalize” similar words Morphology (“form” of words) –Inflectional Morphology E.g,. inflect verb endings and noun number Never change grammatical class –dog, dogs –tengo, tienes, tiene, tenemos, tienen –Derivational Morphology Derive one word from another, Often change grammatical class –build, building; health, healthy

9/6/2001Information Organization and Retrieval Automated Methods Powerful multilingual tools exist for morphological analysis –PCKimmo, Xerox Lexical technology –Require a grammar and dictionary –Use “two-level” automata Stemmers: –Very dumb rules work well (for English) –Porter Stemmer: Iteratively remove suffixes –Improvement: pass results through a lexicon

9/6/2001Information Organization and Retrieval Errors Generated by Porter Stemmer (Krovetz 93)

9/6/2001Information Organization and Retrieval Next Statistical Properties of Text Preparing information for search: Lexical analysis Introduction to the Vector Space model of IR.