CMPS 561 Boolean Retrieval

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Tables and Information Retrieval
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
CS 430 / INFO 430 Information Retrieval
1 Chapter 10 Query Processing: The Basics. 2 External Sorting Sorting is used in implementing many relational operations Problem: –Relations are typically.
DYNAMIC ELEMENT RETRIEVAL IN A STRUCTURED ENVIRONMENT MAYURI UMRANIKAR.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
Information Retrieval IR 4. Plan This time: Index construction.
DL Introduction – Beeri/Feitelson1 Information Retrieval scope, basic concepts system architectures, modes of operation.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,
PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Chapter 5: Information Retrieval and Web Search
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.
LIS618 lecture 2 the Boolean model Thomas Krichel
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.
Vector Space Models.
CMPS 561 Fuzzy Set Retrieval Ryan Benton September 1, 2010.
1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.
1 Information Retrieval LECTURE 1 : Introduction.
Fast Phrase Querying With Combined Indexes HUGH E. WILLIAMS, JUSTIN ZOBEL, and DIRK BAHLE RMIT University 2004 Burak Görener Doğuş University.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Secondary Indexes Secondary Indexes By Jignesh Borisa(111) By Jignesh Borisa(111)
Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.
Chapter 10 The Basics of Query Processing. Copyright © 2005 Pearson Addison-Wesley. All rights reserved External Sorting Sorting is used in implementing.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
CS315 Introduction to Information Retrieval Boolean Search 1.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Take-away Administrativa
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Text Indexing and Search
Lecture 1: Introduction and the Boolean Model Information Retrieval
Indexing Structures for Files and Physical Database Design
Indexing & querying text
Database Management System
Slides from Book: Christopher D
Modified from Stanford CS276 slides Lecture 4: Index Construction
Information Retrieval and Web Search
Implementation Issues & IR Systems
Chapter 12: Query Processing
Relational Algebra Chapter 4, Part A
Boolean Retrieval Term Vocabulary and Posting Lists Web Search Basics
Chapter 15 QUERY EXECUTION.
Boolean Retrieval.
CSCE 561 Information Retrieval System Models
Information Retrieval and Web Search
Lecture 7: Index Construction
Indexing and Hashing Basic Concepts Ordered Indices
Relational Algebra Chapter 4, Sections 4.1 – 4.2
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 2- Query Processing (continued)
6. Implementation of Vector-Space Retrieval
Indexing 1.
Chapter 5: Information Retrieval and Web Search
Vectors and Matrices In MATLAB a vector can be defined as row vector or as a column vector. A vector of length n can be visualized as matrix of size 1xn.
General External Merge Sort
External Sorting Sorting is used in implementing many relational operations Problem: Relations are typically large, do not fit in main memory So cannot.
Efficient Retrieval Document-term matrix t1 t tj tm nf
Information Retrieval and Web Design
Presentation transcript:

CMPS 561 Boolean Retrieval Ryan Benton August 30, 2010

Agenda Indices IR System Models Processing Boolean Query Algorithms for Intersection

Indices

Indices Question: How do we store documents and terms such that we can retrieve documents Efficiently Effectively With reasonable space requirements?

Term-Document Matrix Create table Official Name: Rows: Terms Columns: Document Ids Official Name: Term-Document Incidence Matrix Also called: Inverted View of Collection What are terms: indexed Units (can think as words, but not always).

Term-Document Incidence Matrix Record 1 Record 2 Record 3 Record 4 Term 1 1 Term 2 Term 3 Term 4

Term-Document Matrix Why Rows  Vectors of documents containing term X. Columns  Vectors of terms contained by document Y. Technically, take the transpose of Matrix to get the columns vectors. What are terms: indexed Units (can think as words, but not always).

Document-Term Incidence Matrix Record 1 1 Record 2 Record 3 Record 4 Note, some people would prefer this view instead. However, the term-Document is useful for other reasons  to be seen.

Term-Document Matrix Naïve Way of Building Some Calculations Create and store the Matrix Some Calculations 500,000 Terms 1,000,000 Documents ½ trillion entries : 500,000,000,00 All 0’s and 1’s. Memory Impact As documents and/or term list grows Can’t keep in memory

Term-Document Matrix Observation: Term-Document Matrix  Sparse Typically, only a small number of terms in any given document. If typical document contains 1,000 terms Matrix, in previous example, has 1 billion 1’s 1,000,000,000 Thus, 99.8% of matrix has 0’s.

Inverted Index Also called: Inverted File Dictionary of Terms Vocabulary Lexicon Each term List of documents in which it appears. Each document sometimes called a posting.

Term-Document Incidence Matrix Record 1 Record 2 Record 3 Record 4 Term 1 1 Term 2 Term 3 Term 4

Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 4 Term 4 Record 1 Record 2 Record 3 Record 4

Inverted Index Note: Storage: Dictionary sorted alphabetically Each ‘posting list’  sorted by ID Storage: Dictionary kept in memory Postings  Depends on space. In memory on disk.

Inverted Index, Some Change Term 1: 2 Record 1 Record 3 Term 2: 2 Record 1 Record 2 Term 3: 3 Record 2 Record 3 Record 4 Term 4: 4 Record 1 Record 2 Record 3 Record 4

IR System Models

Model S = (D, Q, T, V, F) Retrieval Status Values (RSV) d Î D q Î Q F: D C Q  V Fq: D  V Retrieval Status Values (RSV) T: “index terms” Note F and Fq are the same in purpose – just different ways of expressing.

Model S = (D, Q, T, V, F) £ defined over elements of V is simple order £¢ defined over elements of D by F is weak order Breaks element of D into number of subsets Each subset are simply ordered

Subject Catalog Model S = (D, Q, T, V, F) T = set of subject headings D = 2T V = { 0, 1 } Fq(d) where q Î Q, d Î D 1, if q Î d 0, otherwise

Coordination Level System S = (D, Q, T, V, F) Q = 2T D = 2T V = { 0, 1 } Fq(d) where 1, if q Í d 0, otherwise F¢q(d) where 1, if |q Ç d| > k

Boolean Systems S = (D, Q, T, V, F) D = 2T Q = E V = { 0, 1 } Fq(d) where 1, if q evaluates to True With respect to document 0, otherwise

What is E? Let t Î T, Then If e Î E, Then If e1, e2 Î E, Then t Î E If e Î E, Then Øe Î E If e1, e2 Î E, Then e1 Ú e2 Î E e1 Ù e2 Î E Nothing else is in E!

Document Representation Set of Document IDs D = {da} a=1,2,…,p Set of all term IDs: T = {ti} i = 1,2,…,n

Document Representation Relation D = { < da, ti, mD(da, ti)> } mD: D x T  {0,1} mD(da, ti) 1, if da contains ti 0, otherwise D t = {da Î D | mD(da, t) = 1} d º D d = {ti Î T | mD(d, ti) = 1}

Retrieval Function Retrieval Status Value (RSV) º F RSVt(da) = mD(da, t) RSVØe(da) = 1 - RSVe(da) RSVe1Ùe2(da) = RSVe1(da) Ù RSVe2(da) RSVe1Úe2(da) = RSVe1(da) Ú RSVe2(da)

Processing Boolean Queries

Boolean example q = Ø(d Ú e) Ù (c Ú (a Ù b)) Ù Ø Ú Ú c Ù b d e a

Boolean Query Example (Method 1- based on documents ) q = Ø(d Ú e) Ù (c Ú (a Ù b)) Dda = {a,c} RSVq(da) = 1 Ú Ù d Ø e c b a 1 1 1 1

Boolean Query Example (Method 2- based on inverted lists) D t1Ùt2 = {da Î D | da Î Dt1 Ù da Î Dt2} D t1Út2 = {da Î D | da Î Dt1 Ú da Î Dt2} Dt = set of Documents containing term t T = {a, b, c, d, e} Da, Db, Dc, Dd, De,

Boolean Query Example (Method 2) Output Da Ç Db Input a Ù b b a Ù

Processing Boolean Query (Method 2) Output Dt De1 De2 De1Ùe2 De1Úe2 D \ De1 Query t e1 e2 e1Ùe2 e1Úe2 Øe1

Boolean Queries (Method 2) and-queries (tiÙtj) Construct a merged list M for Dti and Dtj. Transfer all duplicated records Od on merge list to output or-queries (tiÚtj) Transfer all unique records Ou on merge list to output.

Boolean Queries (Method 2) not-queries (tiÙØtj) Construct a merged list M for Dti and Dtj. FIRST_List Remove all items appearing only once from First List Transfer remaining items to output (i.e. Od ). Create merge list composed of First_List and list composed of Dti SECOND_List Remove items appearing more than once from SECOND_List Transfer remaining items to output (i.e. Oa).

Reminder - Inverted Index Term 1 Record 1 Record 3 Term 2 Record 1 Record 2 Term 3 Record 2 Record 3 Record 4 Term 4 Record 1 Record 2 Record 3 Record 4

Example Query ((t1Út2) ÙØ t3) Let’s do the first part (t1Út2) Dt1: {R1, R3) Dt2: {R1, R2) M(Dt1, Dt2) : {R1, R1, R2, R3} Ou(t1Út2) : {R1, R2, R3} M  Merging Operation, O  Output Selection

Example Query (cont’d) Now, let’s handle the second part ((t1Út2) ÙØ t3) Dt3: {R2, R3, R4) M(Dt1Út2, Dt3) : {R1, R2, R2, R3, R3, R4} Od((t1Út2)Ùt3) : {R2 , R3} M(Dt1Út2, D(t1Út2)Ùt3) : {R1, R2, R2, R3, R3} Oa((t1Út2)ÙØt3) : {R1}

Algorithms for Intersection

Algorithms – Basic Intersection (aka Merging) Intersect(p1, p2) answer  {} While (p1 != NIL) and (p2 != NIL) Do if docID(p1) = docID(p2) Then ADD(answer, docID(p1)) p1  next(p1) p2  next(p2) Else if (docID(p1) < docID(p2)) Then p1 next(p1) Else p2  next(p2) Return answer

Algorithms – Intersection Complexity: O(x + y) For any given two posting lists List A has size x List B has size y Note, this is upper bound. Formally, Complexity: Q(N) N can be either Number of documents in collection Note, this is a tight bound.

Observation In many cases, Boolean queries Conjunctive in nature Allows for a possible improvement based on posting size (term frequency)

Algorithms – Conjunctive Query Merging IntersectConjunct(t1, t2, …, tz) Terms  SortByIncreasingFrequency((t1, t2, …, tz)) Results  postings(first(Terms)) Terms  rest(Terms) while (Terms != NIL) and (Results != NIL) Do Results  Intersect(result, postings(first(Terms))) Return Results

Why? By using least frequent term In practice All results guaranteed to be no larger than least frequent term In practice The ‘intermediate’ list always places upper bounds on the size.

Variations on Boolean Extended Boolean Fuzzy Has standard operations: AND, OR and NOT Plus Term Proximity Within X words, sentences, paragraphs Wildcard Matching Fuzzy Allow for range Function F no longer restricted to {0,1}

Thank-you Questions?

References Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Chapter 1, 2008. Abraham Bookstein and William Cooper, “A General Mathematical Model for Information Retrieval Systems”, The Library Quarterly, Vol 26, no. 2, pp 153-67. Vijay V. Raghavan’s Notes/Lecture Material http://www.cacs.louisiana.edu/~cmps561/561/notes/Model.pdf Material in Slides ued with permission