Presentation is loading. Please wait.

Presentation is loading. Please wait.

Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach

Similar presentations


Presentation on theme: "Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach"— Presentation transcript:

1 Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach selbach@informatik.uni-wuerzburg.de 17.09.2009

2 Thursday, September 17, 2009 Using Fingerprints in n-Gram Indices Overview Introduction –Inverted Index –N-Gram Index –Bitmaps –Signature Files n-Gram Fingerprints n-Gram Fingerprints in Combination with Posting Lists Fingerprint Compression Conclusion and Future Work

3 Thursday, September 17, 2009 INTRODUCTION

4 Thursday, September 17, 2009 Inverted Index Very common index structure Term-oriented Every term is linked to its postings

5 Thursday, September 17, 2009 n-Gram Index Uses n-Grams as indexing terms Any kind of subsequence can be searched n-Gram is a subsequence of a text with Postings for longer subsequences can be calculated:

6 Thursday, September 17, 2009 n-Gram Index Index structure is very similar to an inverted index Searching is more complex

7 Thursday, September 17, 2009 Bitmaps Bitmaps are occurrence maps Each bit signals an occurrence of a specific term in a specific document

8 Thursday, September 17, 2009 Signature Files

9 Thursday, September 17, 2009 N-GRAM FINGERPRINT

10 Thursday, September 17, 2009 N-Gram Fingerprint The idea: Create fingerprints that: Have a fixed size Contain information about the postings

11 Thursday, September 17, 2009 N-Gram Fingerprint A 2D-Fingerprint is a bit-matrix

12 Thursday, September 17, 2009 N-Gram Fingerprint Given two 1-grams and their fingerprints B w1 and B w2 the fingerprint B w1w2 can be aproximated: B’ w2 is constructed by cyclic shifting each column of B w2 by one position to the left.

13 Thursday, September 17, 2009 N-Gram Fingerprint

14 Thursday, September 17, 2009 N-Gram Fingerprint QueryBit- matrix Time for verification Hits rhinolo219 ms94 ms18 sanfilipo290 ms0 ms0 itracon266 ms336 ms64 oxyuria197 ms48 ms6 Search Speed Results from the “Online Encyclopedia of Dermatology from P. Altmeyer”

15 Thursday, September 17, 2009 Term Frequencies and Query Probability

16 Thursday, September 17, 2009 N-GRAM FINGERPRINTS IN COMBINATION WITH POSTING LISTS

17 Thursday, September 17, 2009 Combining Fingerprints and Posting Lists By combining fingerprints and posting lists No verification step is needed Posting lists are partitioned into smaller subsets. Each bit of the fingerprint corresponds to a separate posting list Costs for intersection of posting lists are being reduced

18 Thursday, September 17, 2009 Combining Fingerprints and Posting Lists

19 Thursday, September 17, 2009 Managing n-Gram Posting Lists Very large number of posting-subsets have to be managed: For example: 1024 residue classes for the fileID 128 residue classes for the offset 14.000 different n-grams Subsets are stored in a hash The hash value is a function of the residue classes

20 Thursday, September 17, 2009 Managing n-Gram Posting Lists

21 Thursday, September 17, 2009 Managing n-Gram Posting Lists

22 Thursday, September 17, 2009 Results Performance improved by 40% compared to the setup without posting lists QueryBit- matrix Time for verification Hits rhinolo230 ms10 ms18 sanfilipo271 ms0 ms0 itracon245 ms15 ms64 oxyuria210 ms12 ms6

23 Thursday, September 17, 2009 FINGERPRINT COMPRESSION

24 Thursday, September 17, 2009 Fingerprint Compression Fingerprints with high or low densities do not contain much information Fingerprints can be compressed by reducing the resolution Dictionary based compression

25 Thursday, September 17, 2009 Fingerprint Compression Density threshold for convolution Performance loss Fingerprint index reduction no convolution0 % 0-0,025 and 0.975-13.1 %23 % 0-0.05 and 0.95-13.2 %27 % 0-0.1 and 0.9-110 %29 % 0-0.2 and 0.8-125 %31 % Results: Fingerprint convolution In combination with the dictionary based compression the index size is being reduced by additional 30%

26 Thursday, September 17, 2009 CONCLUSION AND FUTURE WORK

27 Thursday, September 17, 2009 Conclusion Fingerprints improve the scalability of n-gram indices Fingerprints improve the performance of n-gram indices The index structure can be adjusted to user behavior, so that common queries can be processed more efficiently The fingerprints can be stored in a compressed index with loosing only a minimum of performance

28 Thursday, September 17, 2009 Future Work Combination of term based inverted index and n- Gram fingerprint index Profit from the advantages of both using terms and n-Grams as indexing terms –Substring search –Ranking –Thesaurus information

29 Digital Libraries: Advanced Methods and Technologies, Digital Collections 17.09.2009 Thank You!


Download ppt "Using Fingerprints in n-Gram Indices Digital Libraries: Advanced Methods and Technologies, Digital Collections Stefan Selbach"

Similar presentations


Ads by Google