2011.02.02 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:



Advertisements
Similar presentations
Lecture 6: Boolean to Vector
Advertisements

Traditional IR models Jian-Yun Nie.
Chapter 5: Introduction to Information Retrieval
Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Modern Information Retrieval Chapter 1: Introduction
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Query Models Use Types What do search engines do.
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
CS 430 / INFO 430 Information Retrieval
IR Models: Overview, Boolean, and Vector
Search and Retrieval: More on Term Weighting and Document Ranking Prof. Marti Hearst SIMS 202, Lecture 22.
ISP 433/533 Week 2 IR Models.
Modern Information Retrieval Chapter 1: Introduction
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
ISP433/633 Week 3 Query Structure and Query Operations.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Chapter 2Modeling 資工 4B 陳建勳. Introduction.  Traditional information retrieval systems usually adopt index terms to index and retrieve documents.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
9/6/2001Information Organization and Retrieval Introduction to Information Retrieval (cont.): Boolean Model University of California, Berkeley School of.
Modeling Modern Information Retrieval
DOK 324: Principles of Information Retrieval Hacettepe University Department of Information Management.
8/28/97Information Organization and Retrieval IR Implementation Issues, Web Crawlers and Web Search Engines University of California, Berkeley School of.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Vector Space Model CS 652 Information Extraction and Integration.
9/21/2000Information Organization and Retrieval Ranking and Relevance Feedback Ray Larson & Marti Hearst University of California, Berkeley School of Information.
September 7, 2000Information Organization and Retrieval Introduction to Information Retrieval Ray Larson & Marti Hearst University of California, Berkeley.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
IR Models: Review Vector Model and Probabilistic.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
1 CS 502: Computing Methods for Digital Libraries Lecture 11 Information Retrieval I.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Querying Structured Text in an XML Database By Xuemei Luo.
Information Retrieval Models - 1 Boolean. Introduction IR systems usually adopt index terms to process queries Index terms:  A keyword or group of selected.
IR Models J. H. Wang Mar. 11, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text quer y.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Information Retrieval Model Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Web- and Multimedia-based Information Systems Lecture 2.
1 Information Retrieval LECTURE 1 : Introduction.
Information Retrieval
The Boolean Model Simple model based on set theory
C.Watterscsci64031 Classical IR Models. C.Watterscsci64032 Goal Hit set of relevant documents Ranked set Best match Answer.
Set Theoretic Models 1. IR Models Non-Overlapping Lists Proximal Nodes Structured Models Retrieval: Adhoc Filtering Browsing U s e r T a s k Classic Models.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.
Search and Retrieval: Finding Out About Prof. Marti Hearst SIMS 202, Lecture 18.
Search and Retrieval: Query Languages Prof. Marti Hearst SIMS 202, Lecture 19.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
VECTOR SPACE INFORMATION RETRIEVAL 1Adrienn Skrop.
Why indexing? For efficient searching of a document
Query Models Use Types What do search engines do.
Why the interest in Queries?
CS 430: Information Discovery
Chapter 12: Query Processing
Query Models Use Types What do search engines do.
Representation of documents and queries
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Recuperação de Informação B
Recuperação de Informação B
Advanced information retrieval
Presentation transcript:

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 5: Boolean and Extended Boolean

SLIDE 2IS 240 – Spring 2011 Today Review –IR Components –Inverted Files IR Models The Boolean Model Fuzzy sets, Rubric, P-norm, etc.

SLIDE 3IS 240 – Spring 2011 Structure of an IR System Search Line Interest profiles & Queries Documents & data Rules of the game = Rules for subject indexing + Thesaurus (which consists of Lead-In Vocabulary and Indexing Language Storage Line Potentially Relevant Documents Comparison/ Matching Store1: Profiles/ Search requests Store2: Document representations Indexing (Descriptive and Subject) Formulating query in terms of descriptors Storage of profiles Storage of Documents Information Storage and Retrieval System Adapted from Soergel, p. 19

SLIDE 4 Document Processing Steps

SLIDE 5IS 240 – Spring 2011 Boolean Implementation: Inverted Files We will look at “Vector files” in detail later. But conceptually, an Inverted File is a vector file “inverted” so that rows become columns and columns become rows

SLIDE 6IS 240 – Spring 2011 How Are Inverted Files Created Documents are parsed to extract words (or stems) and these are saved with the Document ID. Now is the time for all good men to come to the aid of their country Doc 1 It was a dark and stormy night in the country manor. The time was past midnight Doc 2 Text Proc Steps

SLIDE 7IS 240 – Spring 2011 How Inverted Files are Created After all document have been parsed the inverted file is sorted

SLIDE 8IS 240 – Spring 2011 How Inverted Files are Created Multiple term entries for a single document are merged and frequency information added

SLIDE 9IS 240 – Spring 2011 Inverted Files The file is commonly split into a Dictionary and a Postings file

SLIDE 10IS 240 – Spring 2011 Inverted files Permit fast search for individual terms Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) These lists can be used to solve Boolean queries: –country: d1, d2 –manor: d2 –country and manor: d2

SLIDE 11IS 240 – Spring 2011 Inverted Files Lots of alternative implementations –E.g.: Cheshire builds within-document frequency using a hash table during document parsing. Then Document IDs and frequency info are stored in a BerkeleyDB B- tree index keyed by the term.

SLIDE 12IS 240 – Spring 2011 Btree (conceptual) B | | D | | F | Aces Boilers Cars F | | P | | Z | R | | S | | Z |H | | L | | P | Devils Minors Panthers Seminoles Flyers Hawkeyes Hoosiers

SLIDE 13IS 240 – Spring 2011 Btree with Postings B | | D | | F | Aces Boilers Cars F | | P | | Z | R | | S | | Z |H | | L | | P | Devils Minors Panthers Seminoles Flyers Hawkeyes Hoosiers 2,4,8,12 5, 7, 200 2,4,8,12 8,120

SLIDE 14IS 240 – Spring 2011 Inverted files Permit fast search for individual terms Search results for each term is a list of document IDs (and optionally, frequency and/or positional information) These lists can be used to solve Boolean queries: –country: d1, d2 –manor: d2 –country and manor: d2

SLIDE 15IS 240 – Spring 2011 Today Review –IR Components –Inverted Files IR Models The Boolean Model Fuzzy sets, Rubric, P-norm, etc.

SLIDE 16IS 240 – Spring 2011 IR Models Set Theoretic Models –Boolean –Fuzzy –Extended Boolean Vector Models (Algebraic) Probabilistic Models (probabilistic) Others (e.g., neural networks, etc.)

SLIDE 17IS 240 – Spring 2011 Boolean Model for IR Based on Boolean Logic (Algebra of Sets). Fundamental principles established by George Boole in the 1850’s Deals with set membership and operations on sets Set membership in IR systems is usually based on whether (or not) a document contains a keyword (term)

SLIDE 18IS 240 – Spring 2011 Intersection – Boolean ‘AND’ Union – Boolean ‘OR’ Negation – Boolean ‘NOT’ –Usually means “AND NOT” in IR Exclusive OR – ‘XOR’ – seldom used, –Instead Boolean Operations on Sets

SLIDE 19IS 240 – Spring 2011 Boolean Logic AB

SLIDE 20IS 240 – Spring 2011 Query Languages A way to express the query (formal expression of the information need) Types: –Boolean –Natural Language –Stylized Natural Language –Form-Based (GUI)

SLIDE 21IS 240 – Spring 2011 Simple query language: Boolean Terms + Connectors –terms words normalized (stemmed) words phrases thesaurus terms –connectors AND OR NOT –parentheses (for grouping operations)

SLIDE 22IS 240 – Spring 2011 Boolean Queries Cat Cat OR Dog Cat AND Dog (Cat AND Dog) (Cat AND Dog) OR Collar (Cat AND Dog) OR (Collar AND Leash) (Cat OR Dog) AND (Collar OR Leash)

SLIDE 23IS 240 – Spring 2011 Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –Each of the following combinations works:

SLIDE 24IS 240 – Spring 2011 Boolean Queries (Cat OR Dog) AND (Collar OR Leash) –None of the following combinations works:

SLIDE 25IS 240 – Spring 2011 Boolean Queries Usually expressed as INFIX operators in IR –((a AND b) OR (c AND b)) NOT is UNARY PREFIX operator –((a AND b) OR (c AND (NOT b))) AND and OR can be n-ary operators –(a AND b AND c AND d) Some rules - (De Morgan revisited) –NOT(a) AND NOT(b) = NOT(a OR b) –NOT(a) OR NOT(b)= NOT(a AND b) –NOT(NOT(a)) = a

SLIDE 26IS 240 – Spring 2011 Boolean Searching Formal Query: cracks AND beams AND Width_measurement AND Prestressed_concrete Cracks Beams Width measurement Prestressed concrete Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P) Relaxed Query: (C AND B AND P) OR (C AND B AND W) OR (C AND W AND P) OR (B AND W AND P)

SLIDE 27IS 240 – Spring 2011 Boolean Logic 3t33t3 1t11t1 2t22t2 1D11D1 2D22D2 3D33D3 4D44D4 5D55D5 6D66D6 8D88D8 7D77D7 9D99D9 10 D D 11 m1m1 m2m2 m3m3 m5m5 m4m4 m7m7 m8m8 m6m6 m 2 = t 1 t 2 t 3 m 1 = t 1 t 2 t 3 m 4 = t 1 t 2 t 3 m 3 = t 1 t 2 t 3 m 6 = t 1 t 2 t 3 m 5 = t 1 t 2 t 3 m 8 = t 1 t 2 t 3 m 7 = t 1 t 2 t 3

SLIDE 28IS 240 – Spring 2011 Precedence Ordering In what order do we evaluate the components of the Boolean expression? –Parenthesis get done first (a or b) and (c or d) (a or (b and c) or d) –Usually start from the left and work right (in case of ties) –Usually (if there are no parentheses) NOT before AND AND before OR

SLIDE 29IS 240 – Spring 2011 Faceted Boolean Query Strategy: break query into facets (polysemous with earlier meaning of facets) –conjunction of disjunctions (a1 OR a2 OR a3) (b1 OR b2) (c1 OR c2 OR c3 OR c4) –each facet expresses a topic (“rain forest” OR jungle OR amazon) (medicine OR remedy OR cure) (Smith OR Zhou) AND

SLIDE 30IS 240 – Spring 2011 Ordering of Retrieved Documents Pure Boolean has no ordering In practice: –order chronologically –order by total number of “hits” on query terms What if one term has more hits than others? Is it better to one of each term or many of one term? Fancier methods have been investigated –p-norm is most famous usually impractical to implement usually hard for user to understand

SLIDE 31IS 240 – Spring 2011 Faceted Boolean Query Query still fails if one facet missing Alternative: –Coordination level ranking –Order results in terms of how many facets (disjuncts) are satisfied –Also called Quorum ranking, Overlap ranking, and Best Match Problem: Facets still undifferentiated Alternative: –Assign weights to facets

SLIDE 32IS 240 – Spring 2011 Boolean Processing Boolean Processing (classic Boolean) –Data structures for Query representation and Boolean Operations Boolean processing logic and algorithms Extended Boolean Models –Fuzzy Logic –Others

SLIDE 33IS 240 – Spring 2011 Boolean Processing All processing takes place on postings lists Different methods can be used for sorted or unsorted postings lists

SLIDE 34IS 240 – Spring 2011 Boolean Query Processing The query must be parsed to determine what the: –Search Words –Optional field or index qualifications –Boolean Operators Are and how they relate to one-another Typical parsing uses lexical analysers (like lex or flex) along with parser generators like YACC, BISON or Llgen –These produce code to be compiled into programs. –Example…

SLIDE 35IS 240 – Spring 2011 Z39.50 Query Structure (ASN-1 Notation) -- Query Definitions Query ::= CHOICE{ type-0 [0] ANY, type-1 [1] IMPLICIT RPNQuery, type-2 [2] OCTET STRING, type-100 [100] OCTET STRING, type-101 [101] IMPLICIT RPNQuery, type-102 [102] OCTET STRING}

SLIDE 36IS 240 – Spring 2011 Z39.50 RPN Query (ASN-1 Notation) -- Definitions for RPN query RPNQuery ::= SEQUENCE{ attributeSet AttributeSetId, rpn RPNStructure}

SLIDE 37IS 240 – Spring 2011 RPN Structure RPNStructure ::= CHOICE{ op [0] Operand, rpnRpnOp [1] IMPLICIT SEQUENCE{ rpn1 RPNStructure, rpn2 RPNStructure, op Operator } }

SLIDE 38IS 240 – Spring 2011 Operand Operand ::= CHOICE{ attrTerm AttributesPlusTerm, resultSet ResultSetId, -- If version 2 is in force: -- - If query type is 1, one of the above two must be chosen; -- - resultAttr (below) may be used only if query type is 101. resultAttr ResultSetPlusAttributes}

SLIDE 39IS 240 – Spring 2011 Operator Operator ::= [46] CHOICE{ and [0] IMPLICIT NULL, or [1] IMPLICIT NULL, and-not [2] IMPLICIT NULL, -- If version 2 is in force: -- - For query type 1, one of the above three must be chosen; -- - prox (below) may be used only if query type is 101. prox [3] IMPLICIT ProximityOperator}

SLIDE 40IS 240 – Spring 2011 Parse Result (Query Tree) Z39.50 queries… Operator: AND Title XXX and Subject YYY Operand: Index = Title Value = XXX Operand: Index = Subject Value = YYY left right

SLIDE 41IS 240 – Spring 2011 Parse Results Subject XXX and (title yyy and author zzz) Op: AND Oper: Index: Subject Value: XXX Oper: Index: Title Value: YYY Oper: Index: Author Value: ZZZ

SLIDE 42IS 240 – Spring 2011 Boolean AND (Sorted) Algorithm Choose the shortest list (why?) Create new list the same length as the short list –For each item in the short list Compare next item in longer list –If greater than – go to next item in longer list –If equal - add to new list and go to next item in both lists –If less than - go to next item in short list

SLIDE 43IS 240 – Spring 2011 Boolean AND Algorithm AND =

SLIDE 44IS 240 – Spring 2011 Boolean OR (Sorted) Algorithm Choose the longer list Create new list the same length both lists combined –For each item in the longer list If less than or equal to the first item in the short list –Add to new list Otherwise –Add item from short list –Compare next items in short and long lists »If long item less then short item add long item and go to next long item »Otherwise – add from short list and go to next short item –Once the short list runs out, add the remaining items in the long list

SLIDE 45IS 240 – Spring 2011 Boolean OR Algorithm OR =

SLIDE 46IS 240 – Spring 2011 Boolean AND NOT(Sorted) Algorithm Create new list the same length as the left- hand list –For each item in the left-hand list Compare next item in not list –If greater than – add to new list and go to next item in not list –If equal - go to next item in both lists –If less than - go to next item in not list

SLIDE 47IS 240 – Spring 2011 Boolean AND NOTAlgorithm AND NOT =

SLIDE 48IS 240 – Spring 2011 Hashed Boolean AND (unsorted) Put each item in shortest list into hash table –For each item in other lists If hash entry exists, set flag in hash table entry (or increment counter) Scan hash table contents –If flag set (or counter == number of lists) add to new list

SLIDE 49IS 240 – Spring 2011 Hashed Boolean OR (unsorted) Put each item in EACH list into hash table –If match increment counter (optional) Scan hash table contents and add to new list

SLIDE 50IS 240 – Spring 2011 Hashed Boolean AND NOT (unsorted) Put each item in left-hand list into hash table –For each item in NOT list If hash entry exists, remove it Scan hash table contents and add to new list

SLIDE 51IS 240 – Spring 2011 Boolean Summary Advantages –simple queries are easy to understand –relatively easy to implement Disadvantages –difficult to specify what is wanted, particularly in complex situations –too much returned, or too little –ordering not well determined Dominant IR model in commercial systems until the WWW

SLIDE 52IS 240 – Spring 2011 Basic Concepts for Extended Boolean Instead of binary values, terms in documents and queries have a weight (importance or some other statistical property) Instead of binary set membership, sets are “fuzzy” and the weights are used to determine degree of membership. Degree of set membership can be used to rank the results of a query

SLIDE 53IS 240 – Spring 2011 Fuzzy Sets Introduced by Zadeh in If set {A} has value v(A) and {B} has value v(B), where 0  v  1 v(A  B) = min(v(A), v(B)) v(A  B) = max(v(A), v(B)) v(~A) = 1-v(A)

SLIDE 54IS 240 – Spring 2011 Fuzzy Sets If we have three documents and three terms… –D 1 =(.4,.2,1), D 2 =(0,0,.8), D 3 =(.7,.4,0) For search: t 1  t 2  t 3 v(D 1 ) = max(.4,.2, 1) = 1 v(D 2 ) = max(0, 0,.8) =.8 v(D 3 ) = max(.7,.4, 0) =.7 For search: t 1  t 2  t 3 v(D 1 ) = min(.4,.2, 1) =.2 v(D 2 ) = min(0, 0,.8) = 0 v(D 3 ) = min(.7,.4, 0) = 0

SLIDE 55IS 240 – Spring 2011 Fuzzy Sets Fuzzy set membership of term to document is f(A)  [0,1] D 1 = {(mesons,.8), (scattering,.4)} D 2 = {(mesons,.5), (scattering,.6)} Query = MESONS AND SCATTERING RSV(D 1 ) = MIN(.8,.4) =.4 RSV(D 2 ) = MIN(.5,.6) =.5 D 2 is ranked before D 1 in the result set.

SLIDE 56IS 240 – Spring 2011 Fuzzy Sets The set membership function can be, for example, the relative term frequency within a document, the IDF or any other function providing weights to terms This means that the fuzzy methods use sets of criteria for term weighting that are the same or similar to those used in other ranked retrieval methods (e.g., vector and probabilistic methods)

SLIDE 57IS 240 – Spring 2011 Robertson’s Critique of Fuzzy Sets D 1 = {(mesons,.4), (scattering,.4)} D 2 = {(mesons,.39), (scattering,.99)} Query = MESONS AND SCATTERING RSV(D 1 ) = MIN(.4,.4) =.4 RSV(D 2 ) = MIN(.39,.99) =.39 However, consistent with the Boolean model: –Query = t 1  t 2  t 3  …  t 100 –If D not indexed by t 1 then it fails, even if D is indexed by t 2,…,t 100

SLIDE 58IS 240 – Spring 2011 Robertson’s critique of Fuzzy Fuzzy sets suffer from the same kind of lack of discrimination among the retrieval results almost to the same extent as standard Boolean The rank of a document depends entirely on the lowest or highest weighted term in an AND or OR operation

SLIDE 59IS 240 – Spring 2011 Other Fuzzy Approaches As described in the Modern Information Retrieval (optional) text, a keyword correlation matrix can be used to determine set membership values, and algebraic sums and products can be used in place of MAX and MIN Not clear how this approach works in real applications (or in tests like TREC) because the testing has been on a small scale

SLIDE 60IS 240 – Spring 2011 Extended Boolean (P-Norm) Ed Fox’s Dissertation work with Salton Basic notion is that terms in a Boolean query, and the Boolean Operators themselves can have weights assigned to them Binary weights means that queries behave like standard Boolean 0 < Weights < 1 mean that queries behave like a ranking system The system requires similarity measures

SLIDE 61IS 240 – Spring 2011 Probabilistic Inclusion of Boolean Most probabilistic models attempt to predict the probability that given a particular query Q and document D, that the searcher would find D relevant If we assume that Boolean criteria are to be ANDed with a probabilistic query…

SLIDE 62IS 240 – Spring 2011 Rubric – Extended Boolean Scans full text of documents and stores them User develops a hierarchy of concepts which becomes the query Leaf nodes of the hierarchy are combinations of text patterns A “fuzzy calculus” is used to propagate values obtained at leaves up through the hierarchy to obtain a single retrieval status value (or “relevance” value) RUBRIC returns a ranked list of documents in descending order of “relevance” values.

SLIDE 63IS 240 – Spring 2011 RUBRIC Rules for Concepts & Weights Team | event => World_Series St._Louis_Cardinals | Milwaukee_Brewers => Team “Cardinals” => St._Louis_Cardinals (0.7) Cardinals_full_name => St._Louis_Cardinals (0.9) Saint & “Louis” & “Cardinals” => Cardinals_full_name “St.” => Saint (0.9) “Saint” => Saint “Brewers” => Milwaukee_Brewers (0.5)

SLIDE 64IS 240 – Spring 2011 RUBRIC Rules for Concepts & Weights “Milwaukee Brewers” => Milwaukee_Brewers (0.9) “World Series” => event Baseball_championship => event (0.9) Baseball & Championship => Baseball_championship “ball” => Baseball (0.5) “baseball” => Baseball “championship” => Championship (0.7)

SLIDE 65IS 240 – Spring 2011 RUBRIC combination methods V(V 1 or V 2 ) = MAX(V 1, V 2 ) V(V 1 and V 2 ) = MIN(V 1, V 2 ) i.e., classic fuzzy matching, but with the addition… V(level n) = C n *V(level n-1)

SLIDE 66IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0) Baseball (0) Championship (0) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (0)“championship” (0)“ball” (0)

SLIDE 67IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0) Baseball (0) Championship (0) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0) Document containing “ball”, “baseball” & “championship”

SLIDE 68IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 69IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0) “World Series”Baseball_championship (0.7) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 70IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0) Event (0.63) “World Series”Baseball_championship (0.7) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 71IS 240 – Spring 2011 Rule Evaluation Tree World_Series (0.63) Event (0.63) “World Series”Baseball_championship (0.7) Baseball (1.0) Championship (0.7) St._Louis_Cardinals (0) Team (0) “Cardinals” (0) Milwaukee_brewers (0) Cardinals_full_name (0) “Milwaukee Brewers” (0)“Brewers” (0) Saint (0)“Louis” (0) “Saint” (0)“St.” (0) “Cardinals” (0) “baseball” (1.0)“championship” (1.0)“ball” (1.0)

SLIDE 72IS 240 – Spring 2011 RUBRIC Terrorism Query Terrorism Event Actoreffect Reason Takeover Killing Bombing Encounter Device Explosion Slaying Shooting Specific actor General actor Kidnapping ransom Kidnap event

SLIDE 73IS 240 – Spring 2011 Non-Boolean IR Need to measure some similarity between the query and the document The basic notion is that documents that are somehow similar to a query, are likely to be relevant responses for that query We will revisit this notion again and see how the Language Modelling approach to IR has taken it to a new level

SLIDE 74IS 240 – Spring 2011 Similarity Measures (Set-based) Simple matching (coordination level match) Dice’s Coefficient Jaccard’s Coefficient Cosine Coefficient Overlap Coefficient Assuming that Q and D are the sets of terms associated with a Query and Document:

SLIDE 75IS 240 – Spring 2011 Next Week Moving beyond Boole… The vector space model