Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.

Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010

06.06.2010Efficient Two-Sided Error-Tolerant Search2 Motivation  Query side – users make mistakes typing the query  Either due to mistyping  Or because we do not know the correct spelling (have incomplete knowledge about the underlying data)  Handling uncertainty in text search is important

06.06.2010Efficient Two-Sided Error-Tolerant Search4 Motivation  Query side – user mistakes when typing the query  Either due to mistyping  Or because we do not know the correct spelling or have incomplete knowledge about the underlying data  Handling uncertainty in text search is important  Document side – mistakes in the documents  Those who type the documents also make mistakes  OCR errors

State Of The Art  A lot of work done on approximate string matching / searching 06.06.2010Efficient Two-Sided Error-Tolerant Search6  Not so much work on fast error-tolerant search  There is prior work done on document-side error tolerance  Overall only few relevant papers in the literature  BASELINE: Replace each query word by a disjunction of similar words

BASELINE is all but efficient 06.06.2010Efficient Two-Sided Error-Tolerant Search7 fast AND list AND intersction  Example fast AND list AND (intersection OR interrsection OR intersession OR intersacitionn OR intrasection OR …) There can be hundreds of similar words!  Large list merging and disk I/O overhead  But the current state-of-the-art is not much faster than BASELINE …

Our Approach - Clustering 06.06.2010Efficient Two-Sided Error-Tolerant Search8  Based on clustering of the vocabulary  A vocabulary V is the set of all words in a corpus  The clusters may overlap i.e. words can belong to few clusters  Definition (cover)  Let q be a keyword, K a clustering of V and be the set of all words within a threshold T. An exact cover of is a set of clusters from K with union. An approximate cover of does not necessarily contain all of

Our Approach - Clustering 06.06.2010Efficient Two-Sided Error-Tolerant Search9  Based on clustering of the vocabulary  A vocabulary V is the set of all words in a corpus  The clusters may overlap i.e. words can belong to few clusters  Definition (cover)  Let q be a keyword, K a clustering of V and be the set of all words within a threshold T. An exact cover of is a set of clusters from K with union. An approximate cover of does not necessarily contain all of  The number of sets n in the cover is called cover index  Precision of a cover is defined as  Recall of a cover is defined as

06.06.2010Efficient Two-Sided Error-Tolerant Search10 Our Approach - Clustering  Compute a clustering, so that for each q we can compute a good cover:  (C1) with cover index as small as possible  (C2) with recall as large as possible  (C3) with precision as large as possible  (C4) frequency-weighted overlap as small as possible

Using the Clustering – Indexing 06.06.2010Efficient Two-Sided Error-Tolerant Search11  For each occurrence of a word, determine its clusters house Doc. 7012 C:165:house Doc. 7012 C:9823:house Doc. 7012 In clusters 165 and 9823  Add corresponding artificial postings to the index by prepending the cluster ids, e.g.

Using the Clustering – Query Time 06.06.2010 Efficient Two-Sided Error-Tolerant Search12  Compute Minimal Cover Index  Given a cover recall (and precision), there is no cover with smaller cover index (similar to the set cover problem) algoritm algorithm alggorithm algoithm algoirthm alggorithluq logarithm aglorithm algorithmica algorithmic … cluster 59 cluster 1017 C:59:* OR C:1017:* … 59, 201<- 59, 221<- 59, 1017,56<- 1017, 221<- 1017<- 61, 472<- 59, 201<- 1017<- 59, 472<-  For each q, compute and all affected cluster ids Transform q into a disjunction of prefix queries Use efficient prefix search to process the transformed query (we use the HYB index)

Computing a Clustering 06.06.2010Efficient Two-Sided Error-Tolerant Search13  How to compute a clustering with favorable properties (C1) – (C4) ?  It’s easy to optimize for (C1) alone, but then (C2) will suffer  It’s easy to optimize for (C1) - (C3) alone,but then (C4) will suffer etc. algorithm algoithm aglorithmm algoritw2 algortm algoirtm a1gor1thm algoritluq algorithm x y z v C:x:algorithm C:y:algorithm C:z:algorithm C:v:algorithm … =

Experimental results 06.06.2010Efficient Two-Sided Error-Tolerant Search14 DBLP-MetaDBLP-FullWikipedia ORDINARY 1.26 ms6.8 ms61.0 ms OUR METHOD 2.0 ms11.2 ms112.6 ms BASELINE 11.9 ms121.9 ms1468.2 ms DBLP-MetaDBLP-FullWikipedia Avg. clusters2.42.24.1 Avg. similar words207077 Average query times Average number of clusters and similar words DBLP-MetaDBLP-FullWikipedia Collection Size 1.3 million records31,211 articles9.8 million articles Vocabulary Size 250,000 words1 million words8.5 million words

06.06.2010Efficient Two-Sided Error-Tolerant Search15 Experimental results DBLP-FullWikipedia Frequencyhighlowhighlow Recall0.960.990.930.95 Precision0.690.610.630.57 Average cover precision and recall DBLP-MetaDBLP-FullWikipedia ORDINARY91 MB414 MB8.4 GB OUR METHOD115 MB472 MB9.5 GB Index sizes

Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.

Similar presentations

Presentation on theme: "Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010.

Similar presentations

Presentation on theme: "Fast Two-Sided Error-Tolerant Search Hannah Bast, Marjan Celikik University of Freiburg, Germany KEYS 2010."— Presentation transcript:

Similar presentations

About project

Feedback