Download presentation
Presentation is loading. Please wait.
Published byHannah Parks Modified over 9 years ago
1
Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010
2
06.06.2010Efficient Two-Sided Error-Tolerant Search2 Motivation Query side – users make mistakes typing the query Either due to mistyping Or because we do not know the correct spelling (have incomplete knowledge about the underlying data) Handling uncertainty in text search is important
4
06.06.2010Efficient Two-Sided Error-Tolerant Search4 Motivation Query side – user mistakes when typing the query Either due to mistyping Or because we do not know the correct spelling or have incomplete knowledge about the underlying data Handling uncertainty in text search is important Document side – mistakes in the documents Those who type the documents also make mistakes OCR errors
6
State Of The Art A lot of work done on approximate string matching / searching 06.06.2010Efficient Two-Sided Error-Tolerant Search6 Not so much work on fast error-tolerant search There is prior work done on document-side error tolerance Overall only few relevant papers in the literature BASELINE: Replace each query word by a disjunction of similar words
7
BASELINE is all but efficient 06.06.2010Efficient Two-Sided Error-Tolerant Search7 fast AND list AND intersction Example fast AND list AND (intersection OR interrsection OR intersession OR intersacitionn OR intrasection OR …) There can be hundreds of similar words! Large list merging and disk I/O overhead But the current state-of-the-art is not much faster than BASELINE …
8
Our Approach - Clustering 06.06.2010Efficient Two-Sided Error-Tolerant Search8 Based on clustering of the vocabulary A vocabulary V is the set of all words in a corpus The clusters may overlap i.e. words can belong to few clusters Definition (cover) Let q be a keyword, K a clustering of V and be the set of all words within a threshold T. An exact cover of is a set of clusters from K with union. An approximate cover of does not necessarily contain all of
9
Our Approach - Clustering 06.06.2010Efficient Two-Sided Error-Tolerant Search9 Based on clustering of the vocabulary A vocabulary V is the set of all words in a corpus The clusters may overlap i.e. words can belong to few clusters Definition (cover) Let q be a keyword, K a clustering of V and be the set of all words within a threshold T. An exact cover of is a set of clusters from K with union. An approximate cover of does not necessarily contain all of The number of sets n in the cover is called cover index Precision of a cover is defined as Recall of a cover is defined as
10
06.06.2010Efficient Two-Sided Error-Tolerant Search10 Our Approach - Clustering Compute a clustering, so that for each q we can compute a good cover: (C1) with cover index as small as possible (C2) with recall as large as possible (C3) with precision as large as possible (C4) frequency-weighted overlap as small as possible
11
Using the Clustering – Indexing 06.06.2010Efficient Two-Sided Error-Tolerant Search11 For each occurrence of a word, determine its clusters house Doc. 7012 C:165:house Doc. 7012 C:9823:house Doc. 7012 In clusters 165 and 9823 Add corresponding artificial postings to the index by prepending the cluster ids, e.g.
12
Using the Clustering – Query Time 06.06.2010 Efficient Two-Sided Error-Tolerant Search12 Compute Minimal Cover Index Given a cover recall (and precision), there is no cover with smaller cover index (similar to the set cover problem) algoritm algorithm alggorithm algoithm algoirthm alggorithluq logarithm aglorithm algorithmica algorithmic … cluster 59 cluster 1017 C:59:* OR C:1017:* … 59, 201<- 59, 221<- 59, 1017,56<- 1017, 221<- 1017<- 61, 472<- 59, 201<- 1017<- 59, 472<- For each q, compute and all affected cluster ids Transform q into a disjunction of prefix queries Use efficient prefix search to process the transformed query (we use the HYB index)
13
Computing a Clustering 06.06.2010Efficient Two-Sided Error-Tolerant Search13 How to compute a clustering with favorable properties (C1) – (C4) ? It’s easy to optimize for (C1) alone, but then (C2) will suffer It’s easy to optimize for (C1) - (C3) alone,but then (C4) will suffer etc. algorithm algoithm aglorithmm algoritw2 algortm algoirtm a1gor1thm algoritluq algorithm x y z v C:x:algorithm C:y:algorithm C:z:algorithm C:v:algorithm … =
14
Experimental results 06.06.2010Efficient Two-Sided Error-Tolerant Search14 DBLP-MetaDBLP-FullWikipedia ORDINARY 1.26 ms6.8 ms61.0 ms OUR METHOD 2.0 ms11.2 ms112.6 ms BASELINE 11.9 ms121.9 ms1468.2 ms DBLP-MetaDBLP-FullWikipedia Avg. clusters2.42.24.1 Avg. similar words207077 Average query times Average number of clusters and similar words DBLP-MetaDBLP-FullWikipedia Collection Size 1.3 million records31,211 articles9.8 million articles Vocabulary Size 250,000 words1 million words8.5 million words
15
06.06.2010Efficient Two-Sided Error-Tolerant Search15 Experimental results DBLP-FullWikipedia Frequencyhighlowhighlow Recall0.960.990.930.95 Precision0.690.610.630.57 Average cover precision and recall DBLP-MetaDBLP-FullWikipedia ORDINARY91 MB414 MB8.4 GB OUR METHOD115 MB472 MB9.5 GB Index sizes
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.