CMPS 561 Boolean Retrieval

CMPS 561 Boolean Retrieval
Ryan Benton August 30, 2010

Agenda Indices IR System Models Processing Boolean Query
Algorithms for Intersection

Indices

Indices Question: How do we store documents and terms such that we can retrieve documents Efficiently Effectively With reasonable space requirements?

Term-Document Matrix Create table Official Name: Rows: Terms
Columns: Document Ids Official Name: Term-Document Incidence Matrix Also called: Inverted View of Collection What are terms: indexed Units (can think as words, but not always).

Term-Document Incidence Matrix
Record 1 Record 2 Record 3 Record 4 Term 1 1 Term 2 Term 3 Term 4

Term-Document Matrix Why
Rows  Vectors of documents containing term X. Columns  Vectors of terms contained by document Y. Technically, take the transpose of Matrix to get the columns vectors. What are terms: indexed Units (can think as words, but not always).

Document-Term Incidence Matrix
Record 1 1 Record 2 Record 3 Record 4 Note, some people would prefer this view instead. However, the term-Document is useful for other reasons  to be seen.

Term-Document Matrix Naïve Way of Building Some Calculations
Create and store the Matrix Some Calculations 500,000 Terms 1,000,000 Documents ½ trillion entries : 500,000,000,00 All 0’s and 1’s. Memory Impact As documents and/or term list grows Can’t keep in memory

Term-Document Matrix Observation: Term-Document Matrix  Sparse
Typically, only a small number of terms in any given document. If typical document contains 1,000 terms Matrix, in previous example, has 1 billion 1’s 1,000,000,000 Thus, 99.8% of matrix has 0’s.

Inverted Index Also called: Inverted File Dictionary of Terms
Vocabulary Lexicon Each term List of documents in which it appears. Each document sometimes called a posting.

Term-Document Incidence Matrix
Record 1 Record 2 Record 3 Record 4 Term 1 1 Term 2 Term 3 Term 4

Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2
4 Term 4 Record 1 Record 2 Record 3 Record 4

Inverted Index Note: Storage: Dictionary sorted alphabetically
Each ‘posting list’  sorted by ID Storage: Dictionary kept in memory Postings  Depends on space. In memory on disk.

Inverted Index, Some Change
Term 1: 2 Record 1 Record 3 Term 2: 2 Record 1 Record 2 Term 3: 3 Record 2 Record 3 Record 4 Term 4: 4 Record 1 Record 2 Record 3 Record 4

IR System Models

Model S = (D, Q, T, V, F) Retrieval Status Values (RSV)
d Î D q Î Q F: D C Q  V Fq: D  V Retrieval Status Values (RSV) T: “index terms” Note F and Fq are the same in purpose – just different ways of expressing.

Model S = (D, Q, T, V, F) £ defined over elements of V is simple order
£¢ defined over elements of D by F is weak order Breaks element of D into number of subsets Each subset are simply ordered

Subject Catalog Model S = (D, Q, T, V, F) T = set of subject headings
D = 2T V = { 0, 1 } Fq(d) where q Î Q, d Î D 1, if q Î d 0, otherwise

Coordination Level System
S = (D, Q, T, V, F) Q = 2T D = 2T V = { 0, 1 } Fq(d) where 1, if q Í d 0, otherwise F¢q(d) where 1, if |q Ç d| > k

Boolean Systems S = (D, Q, T, V, F) D = 2T Q = E V = { 0, 1 }
Fq(d) where 1, if q evaluates to True With respect to document 0, otherwise

What is E? Let t Î T, Then If e Î E, Then If e1, e2 Î E, Then
t Î E If e Î E, Then Øe Î E If e1, e2 Î E, Then e1 Ú e2 Î E e1 Ù e2 Î E Nothing else is in E!

Document Representation
Set of Document IDs D = {da} a=1,2,…,p Set of all term IDs: T = {ti} i = 1,2,…,n

Document Representation
Relation D = { < da, ti, mD(da, ti)> } mD: D x T  {0,1} mD(da, ti) 1, if da contains ti 0, otherwise D t = {da Î D | mD(da, t) = 1} d º D d = {ti Î T | mD(d, ti) = 1}

Retrieval Function Retrieval Status Value (RSV) º F
RSVt(da) = mD(da, t) RSVØe(da) = 1 - RSVe(da) RSVe1Ùe2(da) = RSVe1(da) Ù RSVe2(da) RSVe1Úe2(da) = RSVe1(da) Ú RSVe2(da)

Processing Boolean Queries

Boolean example q = Ø(d Ú e) Ù (c Ú (a Ù b)) Ù Ø Ú Ú c Ù b d e a

Boolean Query Example (Method 1- based on documents )
q = Ø(d Ú e) Ù (c Ú (a Ù b)) Dda = {a,c} RSVq(da) = 1 Ú Ù d Ø e c b a 1 1 1 1

Boolean Query Example (Method 2- based on inverted lists)
D t1Ùt2 = {da Î D | da Î Dt1 Ù da Î Dt2} D t1Út2 = {da Î D | da Î Dt1 Ú da Î Dt2} Dt = set of Documents containing term t T = {a, b, c, d, e} Da, Db, Dc, Dd, De,

Boolean Query Example (Method 2)
Output Da Ç Db Input a Ù b b a Ù

Processing Boolean Query (Method 2)
Output Dt De1 De2 De1Ùe2 De1Úe2 D \ De1 Query t e1 e2 e1Ùe2 e1Úe2 Øe1

Boolean Queries (Method 2)
and-queries (tiÙtj) Construct a merged list M for Dti and Dtj. Transfer all duplicated records Od on merge list to output or-queries (tiÚtj) Transfer all unique records Ou on merge list to output.

Boolean Queries (Method 2)
not-queries (tiÙØtj) Construct a merged list M for Dti and Dtj. FIRST_List Remove all items appearing only once from First List Transfer remaining items to output (i.e. Od ). Create merge list composed of First_List and list composed of Dti SECOND_List Remove items appearing more than once from SECOND_List Transfer remaining items to output (i.e. Oa).

Reminder - Inverted Index
Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4

Example Query ((t1Út2) ÙØ t3) Let’s do the first part (t1Út2)
Dt1: {R1, R3) Dt2: {R1, R2) M(Dt1, Dt2) : {R1, R1, R2, R3} Ou(t1Út2) : {R1, R2, R3} M  Merging Operation, O  Output Selection

Example Query (cont’d)
Now, let’s handle the second part ((t1Út2) ÙØ t3) Dt3: {R2, R3, R4) M(Dt1Út2, Dt3) : {R1, R2, R2, R3, R3, R4} Od((t1Út2)Ùt3) : {R2 , R3} M(Dt1Út2, D(t1Út2)Ùt3) : {R1, R2, R2, R3, R3} Oa((t1Út2)ÙØt3) : {R1}

Algorithms for Intersection

Algorithms – Basic Intersection (aka Merging)
Intersect(p1, p2) answer  {} While (p1 != NIL) and (p2 != NIL) Do if docID(p1) = docID(p2) Then ADD(answer, docID(p1)) p1  next(p1) p2  next(p2) Else if (docID(p1) < docID(p2)) Then p1 next(p1) Else p2  next(p2) Return answer

Algorithms – Intersection
Complexity: O(x + y) For any given two posting lists List A has size x List B has size y Note, this is upper bound. Formally, Complexity: Q(N) N can be either Number of documents in collection Note, this is a tight bound.

Observation In many cases, Boolean queries
Conjunctive in nature Allows for a possible improvement based on posting size (term frequency)

Algorithms – Conjunctive Query Merging
IntersectConjunct(t1, t2, …, tz) Terms  SortByIncreasingFrequency((t1, t2, …, tz)) Results  postings(first(Terms)) Terms  rest(Terms) while (Terms != NIL) and (Results != NIL) Do Results  Intersect(result, postings(first(Terms))) Return Results

Why? By using least frequent term In practice
All results guaranteed to be no larger than least frequent term In practice The ‘intermediate’ list always places upper bounds on the size.

Variations on Boolean Extended Boolean Fuzzy Has standard operations:
AND, OR and NOT Plus Term Proximity Within X words, sentences, paragraphs Wildcard Matching Fuzzy Allow for range Function F no longer restricted to {0,1}

Thank-you Questions?

References Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Chapter 1, 2008. Abraham Bookstein and William Cooper, “A General Mathematical Model for Information Retrieval Systems”, The Library Quarterly, Vol 26, no. 2, pp Vijay V. Raghavan’s Notes/Lecture Material Material in Slides ued with permission

CMPS 561 Boolean Retrieval

Similar presentations

Presentation on theme: "CMPS 561 Boolean Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPS 561 Boolean Retrieval

Similar presentations

Presentation on theme: "CMPS 561 Boolean Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback