Space-Efficient Algorithms for Document Retrieval Veli Mäkinen University of Helsinki Joint work with Niko Välimäki
Space-Efficient Document Retrieval2 Introduction Information Retrieval Document Retrieval Inverted Index Combinatorial Pattern Matching Text IndexingSuffix tree Field Problem Solution [PST06] [Mut02] practice: space limits theory: time limits [Sad07 & this paper]
Space-Efficient Document Retrieval3 Text Indexing Let T = t 1 t 2... t n be a text string from an ordered alphabet Σ. Text Indexing problem is to build an index structure for T that supports the following operations on a given pattern P=p 1 p 2... p m : Count(P): How many times P occurs in T? List(P): list the occurrence positions of P in T.
Space-Efficient Document Retrieval4 Document Retrieval Let D={T 1,T 2,...T k } be a set of text documents of total length n. Document Retrieval problem is to build an index for D that supports the following operation on a given pattern P=p 1 p 2... p m : - Find(P): List the documents that contain P (in the order of relevance,...)
Space-Efficient Document Retrieval5 Inverted Index & Document Retrieval... be: (d1,4) (d1,18)... (d2,74) (d2,139) to: (d1,1) (d1,15)...(d2,136) Find("to be")= Remove duplicates((Find("to")+3)∩Find("be")) = d1 (Hamlet), d2 (Merchant of Venice),... To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching. Creating inverted file over Shakespeare's plays
Space-Efficient Document Retrieval6 Suffix Array & Document Retrieval (1/2) Build generalized suffix array of D: To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune, Or to take arms against a sea of troubles, And by opposing end them? To die: to sleep; PORTIA: If to do were as easy as to know what were good to do, chapels had been churches and poor men's cottages princes' palaces. It is a good divine that follows his own instructions: I can easier teach twenty what were good to be done, than be one of the twenty to follow mine own teaching.
Space-Efficient Document Retrieval7 Suffix Array & Document Retrieval Build generalized suffix array of D: Locate the interval containing all occurrences of pattern P: Remove duplicates: "to be" d1 (Hamlet), d2 (Merchant of Venice),...
Space-Efficient Document Retrieval8 Muthukrishnan's improvement doc "to be" prev min min>
Space-Efficient Document Retrieval9 Time-Optimal Document Retrieval Theorem [Mut02]: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size O(n log n) bits, where ndoc is the number of documents matching the query. Observation: The solution is not space- optimal, as the document collection can be represented in n log |Σ| bits.
Space-Efficient Document Retrieval10 Space-Optimal Document Retrieval Theorem [Sad02]: Document retrieval problem can be solved in O(f(m,n)+ndoc·g(n)) time using an index structure of size |CSA|+4n+o(n)+O(k log (n/k)) bits, where |CSA| ≤ n log |Σ| (1+o(1)) is the size of the compressed suffix array used; f(m,n)=O(m log n) is the pattern search time; and Ω(log ε n)=g(n) is the time to decode a suffix array value.
Space-Efficient Document Retrieval11 Our Result: Space- and Time- Efficient Document Retrieval Theorem: Document retrieval problem can be solved in the optimal O(m+ndoc) time using an index structure of size |CSA|+2n+o(n)+n log k(1+o(1)) bits, when |Σ|,k polylog(n); for unbounded |Σ|,k the time bound components become O(m log |Σ|) and O(ndoc log k), respectively.
Space-Efficient Document Retrieval12 Details of Our Result (1/3) We use the alphabet-friendly FM-index [FMMN07] to find the suffix array interval containing the pattern occurrences. We use the generalized wavelet tree [GGV03,FMMN07] to store document numbers according to the suffix array order.
Space-Efficient Document Retrieval13 Details of Our Result (2/3) Observation: prev[i]=select doc[i] (doc,rank doc[i] (doc,i)-1), where rank k' (A,i) gives the number of times value k' appears in A[1,i]; and select k' (A,j) gives the position of the j-th occurrence of value k' in A.
Space-Efficient Document Retrieval14 Details of Our Result (3/3) The generalized wavelet tree representation of doc-array provides constant time rank and select when k polylog (n). Constant time Range Minimum Queries (RMQ) on implicit prev-array can be supported using 2n+o(n) bits [FH07].
Space-Efficient Document Retrieval15 A simpler way to obtain the O(ndoc log k) result doc |CSA|+2n+o(n)+n log k(1+o(1)) bits
Space-Efficient Document Retrieval16 Extensions The approach can easily be extended to report the documents in relevance order under standard scoring schemes like TF*IDF; and show context around the first/several/all occurrences in selected documents.
Space-Efficient Document Retrieval17 Small experiment 50MB English text k=200 inverted index 98 MB17.46 s4.29 s our index 169 MB3.7 s2.7 s size query time m=3 query time m=4