Download presentation
Presentation is loading. Please wait.
1
Document Retrieval Problems S. Muthukrishnan
2
Storyline Zvi Galil gave a talk on the 13 th on 13 open problems he posed 13 years ago in string matching ….. –Update on the status of open problems. Eric Allender invited me to give a string matching talk at Rutgers U. Gives me a chance to look through 30 years of history. Fernand Braudel History may be divided into three movements: what moves rapidly, what moves slowly, and what appears not to move at all.
3
The Key Problem Given a set of documents D to be preprocessed, query is to list all the locations in the documents where a given pattern occurs. occurrence listing Given a set of documents D to be preprocessed, query is to list all the documents in which a given pattern occurs. document listing D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bc P= aa O={ (1,1), (1,4), (2,3), (2,4) } D={ aabaa, abaaa, bc } d1=aabaa, d2=abaaa, d3=bc P= aa O={ 1, 2} Muthu: Use this problem to frame the discussion, Muthu: Use this problem to frame the discussion,
4
Occurrence Vs Document Listing Given n documents of total length N, occurrence listing can be solved with –O(N) preprocessing and. –O(m + output) time for query pattern of size m. –Elegant 1973 paper by Weiner introduced suffix trees and solved this problem – optimal, output sensitive. No such optimal result for document listing. –O( (m+out) log n ) time query processing. –log n loglog n by fractional cascading. muthu: assuming you don’t hastily give the answer without looking at the entire document or the pattern!:
5
Other Document Listing Problems Find all document that contain at least K occurrences of the given pattern. (mining) Find all documents that contain two occurrences of the pattern separated by at most distance d. (proximity repeat) Find all documents that do NOT contain the given pattern. (negative query) Find all documents that contain pattern P but not Q. (boolean query) Combinations thereof… Muthu: Normally. Negative queries are not selective, but work within selectedsubset or in conjuction with other patterns. Muthu: Normally. Negative queries are not selective, but work within selectedsubset or in conjuction with other patterns.
6
Nature of Document Retrieval Problems Document listing versions are natural. –Occurrence listing versions primarily studied in Computational Biology and Data Mining. No optimal algorithms previously known. –Bounds are off by factors of log n … n in the worst case depending on the problem. We will provide (near) optimal algorithms. –Optimal algorithm for key document listing problem. Muthu: Motivated the discussion with this problem, It is also framed in history. Muthu: Motivated the discussion with this problem, It is also framed in history. Theory following Practice? Inverted word index + variants, in IR.
7
Talk Overview Optimal algorithm for the document listing problem. –List all documents that contain the given pattern. Efficient algorithm for the document mining problem. –List all documents that contain at least K occurrences of the given pattern. Techniques. –Colored range query data structural problems.
8
Preamble: Occurrence Listing Construct a suffix tree (compressed trie) of all the documents. D= {abaa, aabaa, bc } S = {abaa#, baa#, aa#, a#, aabaa#, bc#, c#} a b c# aa# # # a baa# (1,4), (2,5) (1,3), (2,4) (2,1) (1,1), (2,2) (1,2), (2,3) (3,1) (3,2) http://commfaculty.fullerton.edu/lester/writings/1000_words.html
9
Preamble: Occurrence Listing Find all occurrences of pattern aa. –Trace down the path aa and report all the leaves [Weiner 73]. Input: D= {abaa, aabaa, bc } Output: (1,3), (2,4), (2,1) a b c# aa# # # a baa# (1,4), (2,5) (1,3), (2,4) (2,1) (1,1), (2,2) (1,2), (2,3) (3,1) (3,2)
10
Document Listing Find all documents that contain pattern aa. –Trace down the path aa and report the distinct “colors” on leaves. a b c# aa# # # a baa# 1, 2 2 3 3 Input: D= {abaa, aabaa, bc } Output sought: 1, 2 Colors: 1, 2, 3 Challenge: Avoid reporting duplicate colors. muthu: Use hot pink sparingly muthu: Use hot pink sparingly
11
Document Listing: Our Approach a b c# aa# # # a baa# 1, 2 2 3 3 1 2 1 2 2 1 2 1 2 3 3 Colored range query: Return distinct colors in given range. Mathematics is the art of giving the same name to different things. --- Jules Henri Poincare
12
Document Listing: Our Approach 1 2 3 4 5 6 7 8 9 10 11 1 2 1 2 2 1 2 1 2 3 3 1 2 3 4 5 6 7 8 9 10 11 -1 -1 1 2 4 3 5 1 7 -1 10 List distinct colors List numbers less than 3. Colors do not matter anymore.
13
Document Listing: Our Approach 1 2 3 4 5 6 7 8 9 10 11 -1 -1 1 2 4 3 5 1 7 -1 10 List numbers less than 3. R = (l,r). Find all integers smaller than x in A[l,r]: 1. Perform rangemin(R) to determine i such that A[i] is smallest in A[l,r]. 2. If A[i] is smaller than x, recurse on A[l,i-1] and A[i+1,r] and return A[i]. O(1) time per rangemin query O(output) time.
14
Document Listing: Summary Given a set of documents of total size N, document listing problem can be solved in –O(N) time and space for preprocessing, and. –O(m + output) time for a query of size m. –Uses Weiner’s O(N) time suffix tree construction. Overview of techniques –Reduce the problem to colored range searching. –“Chain” occurrences of suffixes from each document, Necessity is not necessarily the mother of invention. Ruth Benedict in Patterns of Culture. Muthu: Now, let us get started with fun stuff.
15
Document Mining Find all documents that contain at least K occurrences of given pattern. Find colors that appear at least K times in this range.
16
Document Mining: First Approach Fix K. Chain to the Kth occurrence of red to the left. Given range [l,r], determine all numbers in A[l,r] that are less than l. Does not work: output * K Yesterday it worked Today it is not working Windows is like that.
17
Document Mining: Second Approach Given a set of colored intervals to be preprocessed, query is some interval I and we must determine the distinct colored intervals that are contained in I. Chain to the Kth occurrence of red to the left. Replace by red intervals. No optimal results known
18
Document Mining: Fixed K Mark Least Common Ancestor (L,R) with red color. L R Each query Find the set of distinct colors in a subtree. O(N) preprocessing, O( m + output) time per query
19
Document Mining: Variable K K is part of the query: o(NK) preprocessing? 1K23K+1K+22K-1 For a fixed K, all LCAs lie in paths separated by K occurrences. Suffices to keep the lowest in each path. muthu: that deserves the hot pink.
20
Document Mining: Variable K For a fixed K, find the lowest LCA on each of the paths separated by O(K) occurrences of each document. Preprocessing time: bin searching paths. Query processing in O(m + output) time.
21
Summary Solving other document listing problems. –Optimal for negative query: list absent colors. –(Near) optimal for proximity repeats: structural properties of “gaps.” –Best known for two patterns: breaking the quadratic preprocessing bottleneck. Techniques: Chaining, Colored range queries (7+ such problems in the paper), Combinatorial structure. Muthu: Solving these colored range searching problems are of independent interest…. muthu: Hope that whetted your appetite for algorithmics. muthu: Hope that whetted your appetite for algorithmics.
22
Discussion “non” local chaining? –Find documents in which no two occurrences of the pattern are within distance K. OPEN Try it in IPScope: Interactive Patents Analysis System.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.