Download presentation
Presentation is loading. Please wait.
Published bySamuel Ryan Modified over 9 years ago
1
Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de 17.09.2009
2
Thursday, September 17, 2009 Using Fingerprints in n-Gram Indices Overview Introduction –Inverted Index –N-Gram Index –Bitmaps –Signature Files n-Gram Fingerprints n-Gram Fingerprints in Combination with Posting Lists Fingerprint Compression Conclusion and Future Work
3
Thursday, September 17, 2009 INTRODUCTION
4
Thursday, September 17, 2009 Inverted Index Very common index structure Term-oriented Every term is linked to its postings
5
Thursday, September 17, 2009 n-Gram Index Uses n-Grams as indexing terms Any kind of subsequence can be searched n-Gram is a subsequence of a text with Postings for longer subsequences can be calculated:
6
Thursday, September 17, 2009 n-Gram Index Index structure is very similar to an inverted index Searching is more complex
7
Thursday, September 17, 2009 Bitmaps Bitmaps are occurrence maps Each bit signals an occurrence of a specific term in a specific document
8
Thursday, September 17, 2009 Signature Files
9
Thursday, September 17, 2009 N-GRAM FINGERPRINT
10
Thursday, September 17, 2009 N-Gram Fingerprint The idea: Create fingerprints that: Have a fixed size Contain information about the postings
11
Thursday, September 17, 2009 N-Gram Fingerprint A 2D-Fingerprint is a bit-matrix
12
Thursday, September 17, 2009 N-Gram Fingerprint Given two 1-grams and their fingerprints B w1 and B w2 the fingerprint B w1w2 can be aproximated: B’ w2 is constructed by cyclic shifting each column of B w2 by one position to the left.
13
Thursday, September 17, 2009 N-Gram Fingerprint
14
Thursday, September 17, 2009 N-Gram Fingerprint QueryBit- matrix Time for verification Hits rhinolo219 ms94 ms18 sanfilipo290 ms0 ms0 itracon266 ms336 ms64 oxyuria197 ms48 ms6 Search Speed Results from the “Online Encyclopedia of Dermatology from P. Altmeyer”
15
Thursday, September 17, 2009 Term Frequencies and Query Probability
16
Thursday, September 17, 2009 N-GRAM FINGERPRINTS IN COMBINATION WITH POSTING LISTS
17
Thursday, September 17, 2009 Combining Fingerprints and Posting Lists By combining fingerprints and posting lists No verification step is needed Posting lists are partitioned into smaller subsets. Each bit of the fingerprint corresponds to a separate posting list Costs for intersection of posting lists are being reduced
18
Thursday, September 17, 2009 Combining Fingerprints and Posting Lists
19
Thursday, September 17, 2009 Managing n-Gram Posting Lists Very large number of posting-subsets have to be managed: For example: 1024 residue classes for the fileID 128 residue classes for the offset 14.000 different n-grams Subsets are stored in a hash The hash value is a function of the residue classes
20
Thursday, September 17, 2009 Managing n-Gram Posting Lists
21
Thursday, September 17, 2009 Managing n-Gram Posting Lists
22
Thursday, September 17, 2009 Results Performance improved by 40% compared to the setup without posting lists QueryBit- matrix Time for verification Hits rhinolo230 ms10 ms18 sanfilipo271 ms0 ms0 itracon245 ms15 ms64 oxyuria210 ms12 ms6
23
Thursday, September 17, 2009 FINGERPRINT COMPRESSION
24
Thursday, September 17, 2009 Fingerprint Compression Fingerprints with high or low densities do not contain much information Fingerprints can be compressed by reducing the resolution Dictionary based compression
25
Thursday, September 17, 2009 Fingerprint Compression Density threshold for convolution Performance loss Fingerprint index reduction no convolution0 % 0-0,025 and 0.975-13.1 %23 % 0-0.05 and 0.95-13.2 %27 % 0-0.1 and 0.9-110 %29 % 0-0.2 and 0.8-125 %31 % Results: Fingerprint convolution In combination with the dictionary based compression the index size is being reduced by additional 30%
26
Thursday, September 17, 2009 CONCLUSION AND FUTURE WORK
27
Thursday, September 17, 2009 Conclusion Fingerprints improve the scalability of n-gram indices Fingerprints improve the performance of n-gram indices The index structure can be adjusted to user behavior, so that common queries can be processed more efficiently The fingerprints can be stored in a compressed index with loosing only a minimum of performance
28
Thursday, September 17, 2009 Future Work Combination of term based inverted index and n- Gram fingerprint index Profit from the advantages of both using terms and n-Grams as indexing terms –Substring search –Ranking –Thesaurus information
29
Digital Libraries: Advanced Methods and Technologies, Digital Collections 17.09.2009 Thank You!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.