Dictionary data structures for the Inverted Index

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Lecture 4: Dictionaries and tolerant retrieval
CpSc 881: Information Retrieval. 2 Dictionaries The dictionary is the data structure for storing the term vocabulary. Term vocabulary: the data Dictionary:
Inverted Index Hongning Wang
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
CS276A Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
1 ITCS 6265 Lecture 3 Dictionaries and Tolerant retrieval.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 Data Mining Feb 17, 2010 Lecture 3: The dictionary and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
Zone indexes Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 6.1.
Tries. (Compacted) Trie y s 1 z stile zyg 5 etic ial ygy aibelyite czecin omo systile syzygetic syzygial syzygy szaibelyite szczecin.
Text Search and Fuzzy Matching
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 3: Dictionaries.
Inverted Index Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Dictionary search Exact string search Paper on Cuckoo Hashing.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Dictionaries and Tolerant retrieval
Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 3: Dictionaries and tolerant retrieval.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Dictionary indexing.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Skip Pointers, Dictionaries and tolerant retrieval.
Information Retrieval On the use of the Inverted Lists.
Introduction to Information Retrieval Introduction to Information Retrieval Lectures 4-6: Skip Pointers, Dictionaries and tolerant retrieval.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Generic Trees—Trie, Compressed Trie, Suffix Trie (with Analysi
Lectures 5: Dictionaries and tolerant retrieval
Why indexing? For efficient searching of a document
Information Retrieval Christopher Manning and Prabhakar Raghavan
Query processing: optimizations
Advanced Indexing Issues
Dictionary data structures for the Inverted Index
Dictionary data structures for the Inverted Index
Modified from Stanford CS276 slides
Inverted Index Hongning Wang
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
CS 430: Information Discovery
Query Languages.
Lecture 3: Dictionaries and tolerant retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Dictionary data structures for the Inverted Index
Auto-completion Search
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Lecture 3: Dictionaries and tolerant retrieval
Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Google) and
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Space-for-time tradeoffs
Query processing: phrase queries and positional indexes
Minwise Hashing and Efficient Search
CS210- Lecture 16 July 11, 2005 Agenda Maps and Dictionaries Map ADT
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Dictionary data structures for the Inverted Index Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Dictionary data structures for the Inverted Index

This lecture Dictionary data structures “Tolerant” retrieval Ch. 3 This lecture Dictionary data structures Exact search Prefix search “Tolerant” retrieval Edit-distance queries Wild-card queries Spelling correction Soundex

Sec. 3.1 Basics The dictionary data structure stores the term vocabulary, but… in what data structure?

A naïve dictionary An array of struct: char[20] int Postings * Sec. 3.1 A naïve dictionary An array of struct: char[20] int Postings * 20 bytes 4/8 bytes 4/8 bytes How do we store a dictionary in memory efficiently? How do we quickly look up elements at query time?

Dictionary data structures Sec. 3.1 Dictionary data structures Two main choices: Hash table Tree Trie Some IR systems use hashes, some trees/tries

Prof. Paolo Ferragina, Univ. Pisa Hashing with chaining The current version is MurmurHash (vers 3) yields a 32-bit or 128-bit hash value.

Another “provably good” hash Prof. Paolo Ferragina, Univ. Pisa Another “provably good” hash L = bit len of string m = table size ≈log2 m Key k0 k1 k2 kr r ≈ L / log2 m a a0 a1 a2 ar Each ai is selected at random in [0,m) prime not necessarily: (...mod p) mod m

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Prefix search

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Prefix-string Search Given a dictionary D of K strings, of total length N, store them in a way that we can efficiently support prefix searches for a pattern P over them. Ex. Pattern P is pa Dict = {abaco, box, paolo, patrizio, pippo, zoo}

Trie: speeding-up searches Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Trie: speeding-up searches s 1 y z 2 2 omo stile aibelyite zyg 7 5 5 czecin 1 etic ygy ial 6 2 4 3 Pro: O(p) search time = path scan Cons: edge + node labels + tree structure

Tries Do exist many variants and their implementations Pros: Sec. 3.1 Tries Do exist many variants and their implementations Pros: Solves the prefix problem Cons: Slower: O(p) time, many cache misses From 10 to 60 (or, even more) bytes per node

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" 2-level indexing 2 advantages: Search ≈ typically 1 I/O + in-mem comp. Space ≈ trie built over a subset of strings Disk Internal Memory CT on a sample A disadvantage: Trade-off ≈ speed vs space (because of bucket size) systile szaibelyite systile syzygetic syzigial syzygygy szaibelyite szczecin szomo

Front-coding: squeezing dict Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Front-coding: squeezing dict systile syzygetic syzygial syzygy 2 5 3345% 0 http://checkmate.com/All/Natural/Washcloth.html... 0 http://checkmate.com/All_Natural/ 33 Applied.html 34 roma.html 38 1.html 38 tic_Art.html 34 yate.html 35 er_Soap.html 35 urvedic_Soap.html 33 Bath_Salt_Bulk.html 42 s.html 25 Essence_Oils.html 25 Mineral_Bath_Crystals.html 38 Salt.html 33 Cream.html http://checkmate.com/All_Natural/ http://checkmate.com/All_Natural/Applied.html http://checkmate.com/All_Natural/Aroma.html http://checkmate.com/All_Natural/Aroma1.html http://checkmate.com/All_Natural/Aromatic_Art.html http://checkmate.com/All_Natural/Ayate.html http://checkmate.com/All_Natural/Ayer_Soap.html http://checkmate.com/All_Natural/Ayurvedic_Soap.html http://checkmate.com/All_Natural/Bath_Salt_Bulk.html http://checkmate.com/All_Natural/Bath_Salts.html http://checkmate.com/All/Essence_Oils.html http://checkmate.com/All/Mineral_Bath_Crystals.html http://checkmate.com/All/Mineral_Bath_Salt.html http://checkmate.com/All/Mineral_Cream.html http://checkmate.com/All/Natural/Washcloth.html ... Gzip may be much better...

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" 2-level indexing 2 advantages: Search ≈ typically 1 I/O + in-mem comp. Space ≈ Trie built over a subset of strings and front-coding over buckets Disk Internal Memory CT on a sample A disadvantage: Trade-off ≈ speed vs space (because of bucket size) systile szaielyite ….70systile 92zygeti c85ial 65y 110szaibelyite 82czecin92omo….

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Spelling correction

Spell correction Two principal uses Two main flavors: Sec. 3.3 Spell correction Two principal uses Correcting document(s) being indexed Correcting queries to retrieve “right” answers Two main flavors: Isolated word Check each word on its own for misspelling But what about: from  form Context-sensitive is more effective Look at surrounding words e.g., I flew form Heathrow to Narita.

Isolated word correction Sec. 3.3.2 Isolated word correction Fundamental premise – there is a lexicon from which the correct spellings come Two basic choices for this A standard lexicon such as Webster’s English Dictionary An “industry-specific” lexicon – hand-maintained The lexicon of the indexed corpus E.g., all words on the web All names, acronyms etc. (including the mis-spellings) Mining algorithms to derive the possible corrections

Isolated word correction Sec. 3.3.2 Isolated word correction Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What’s “closest”? We’ll study several measures Edit distance (Levenshtein distance) Weighted edit distance n-gram overlap

Brute-force check of ED Sec. 3.3.4 Brute-force check of ED Given query Q, enumerate all character sequences within a preset (weighted) edit distance (e.g., 2) Intersect this set with list of “correct” words Show terms you found to user as suggestions How the time complexity grows with #errors allowed and string length ?

Sec. 3.3.3 Edit distance Given two strings S1 and S2, the minimum number of operations to convert one to the other Operations are typically character-level Insert, Delete, Replace (possibly, Transposition) E.g., the edit distance from dof to dog is 1 From cat to act is 2 (Just 1 with transpose) from cat to dog is 3. Generally found by dynamic programming.

DynProg for Edit Distance Let E(i,j) = edit distance to transform S1[1,i] in S2[1,j] Example: cat versus dea Consider the sequence of ops: ins, del, subst, match Model the edit distance as an alignment problem where insertion in S2 correspond to a – in S1 whereas deletion from S1 correspond to a – in S2. If S1 [i] = S2 [j] then last op is a match, and thus it is not counted Otherwise the last op is: subst(S1[i], S2[j]) or ins(S2[j]) or del(S1[i]) E(i,0)=i, E(0,j)=j E(i, j) = E(i–1, j–1) if S1[i]= S2[j] E(i, j) = 1 + min{E(i, j–1), E(i–1, j), E(i–1, j–1)} if S1[i]  S2[j]

Example +1 S2 1 2 3 4 5 6 p t a S1 +1 +1

Weighted edit distance Sec. 3.3.3 Weighted edit distance As above, but the weight of an operation depends on the character(s) involved Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller cost than by q Requires weighted matrix as input Modify DP to handle weights

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" One-error correction

The problem Do we need to make the enumeration? Sec. 3.3.5 The problem 1-error = insertion, deletion or substitution of one single character Farragina  Ferragina (substitution) Feragina  Ferragina (insertion) Ferrragina  Ferragina (deletion) A string of length L over A chars  #variants = L (A-1) + (L+1)A + L = A * (2L+1) You could have many candidates You can still deploy keyboard layout or other statistical information (web, wikipedia,…) Do we need to make the enumeration?

A possible approach Assume a fast string-check in a dictionary Sec. 3.3.5 A possible approach Create two dictionaries: D1 = { strings } D2 = { strings of D1 with one deletion } D1 = {cat, cast, cst, dag} D2 = { ag [ dag], da [dag], dg [dag], ast [cast], cas [cast], cat [cast], cst [cast] cs [cst], ct [cst], st [cst]} at [cat], ca [cat], ct [cat] } Assume a fast string-check in a dictionary

An example D1 = {cat, cast, cst, dag} Sec. 3.3.5 An example D1 = {cat, cast, cst, dag} D2 = { ag [dag], ast [cast], at [cat], ca [cat], cas [cast], cat [cast], cs [cst], cst [cast], ct [cat; cst], da [dag], dg [dag], st [cst] } Query(«cat»): Perfect match: Search(P) in D1, 1 query [ cat] P 1-char less: Search(P) in D2, 1 query [ cat [cast]] P 1-char more: Search(P -1 char) in D1, p queries [search for {at, ct, ca} in D1  No match] Substitution: Search(P -1 char) in D2, p queries [e.g. {at,ct, ca} in D2  at [cat], ct [cat], ct [cst], ca [cat]]

A possible approach Query(P): We need 2p + 2 hash computations for P Sec. 3.3.5 A possible approach Query(P): Perfect match: Search for P in D1, 1 query P 1-char less: Search for P in D2 , 1 query P 1-char more: Search for P -1 char in D1 , p queries Substitution: Search for P -1 char in D2 , p queries We need 2p + 2 hash computations for P Pro: CPU efficient, no cache misses for computing P’s hashes; but O(p) cache misses to search in D1 and D2 Cons: Large space because of the many strings in D2 which must be stored to search in the hash table of D2, unless we avoid collision (perfect hash) FALSE MATCHES: ctw matches cat and cst as SUBST.

Overlap vs Edit distances Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Overlap vs Edit distances

K-gram index (useful for >1 errors) Sec. 3.2.2 K-gram index (useful for >1 errors) The k-gram index contains for every k-gram (here k=2) all terms including that k-gram. $m mace madden mo among amortize on among arond Append k-1 symbol $ at the front of each string, in order to generate a number L of k-grams for a string of length L.

Overlap Distance Enumerate all 2-grams in Q = mon ($m, mo, on) Sec. 3.3.4 Overlap Distance Enumerate all 2-grams in Q = mon ($m, mo, on) $m mace moon mo among amortize moon on among arond moon Use the 2-gram index to rank all lexicon terms according to the number of matching 2-grams (moon)

Overlap Distance versus Edit Distance Sec. 3.3.4 Overlap Distance versus Edit Distance Select terms by threshold on matching k-grams If the term is L chars long (it consists of L k-grams) If E is the number of allowed errors (E*k of the k- grams of Q might be different from term’s ones because of the E errors) So at least L – E*k of the k-grams in Q must match a dictionary term to be a candidate answer If ED is required, post-filter results dyn prog Necessary but not sufficient condition !! T=$mom, Q=$omo, L-E*k=3-1*2=1, TQ = {mo,om} but EditDist = 2

Example with trigrams (K=3) Sec. 3.3.4 Example with trigrams (K=3) Suppose the term to compare is $$november Trigrams are $$n, $no, nov, ove, vem, emb, mbe, ber The query Q = $$december Trigrams are $$d, $de, dec, ece, cem, emb, mbe, ber |Q|=8, K=3, if E=1  L – E*k = 8 – 1*3 = 5 NO! |Q|=8, K=3, if E=2  L – E*k = 8 – 2*3 = 2 OK! Post-filter is needed to check that the distance > 2

Fast searching for k-grams Sec. 3.3.4 Fast searching for k-grams Assume we wish to match the following bi-grams (lo, or, rd) against a 2-gram index built over a dictionary of terms. lo alone lord sloth or border lord morbid rd ardent border card Standard postings “merge” will enumerate terms with their cardinality Then choose terms if #occ possibly filter out < L – k*E

Context-sensitive spell correction Sec. 3.3.5 Context-sensitive spell correction Text: I flew from Heathrow to Narita. Consider the phrase query “flew form Heathrow” We’d like to respond Did you mean “flew from Heathrow”? because no docs matched the query phrase.

General issues in spell correction Sec. 3.3.5 General issues in spell correction We enumerate multiple alternatives and then need to figure out which to present to the user for “Did you mean?” Use heuristics The alternative hitting most docs Query log analysis + tweaking For especially popular, topical queries Spell-correction is computationally expensive Run only on queries that matched few docs

Other sophisticated queries Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Other sophisticated queries

Sec. 3.2 Wild-card queries: * mon*: find all docs containing words beginning with “mon”. Use a Prefix-search data structure *mon: find words ending in “mon” Maintain a prefix-search data structure for reverse terms. How can we solve the wild-card query pro*cent ?

What about * in the middle? Sec. 3.2 What about * in the middle? co*tion We could look up co* AND *tion and intersect the two lists (expensive) se*ate AND fil*er This may result in many Boolean ANDs. The solution: transform wild-card queries so that the *’s occur at the end This gives rise to the Permuterm Index.

Permuterm index For term hello, index under: Queries: Sec. 3.2.1 Permuterm index For term hello, index under: hello$, ello$h, llo$he, lo$hel, o$hell, $hello where $ is a special symbol. Queries: X lookup on X$ X* lookup on $X* *X lookup on X$* *X* lookup on X* X*Y lookup on Y$X* X*Y*Z ??? Exercise!

Permuterm query processing Sec. 3.2.1 Permuterm query processing Rotate query wild-card to the right P*Q  Q$P* Now use prefix-search data structure Permuterm problem: ≈ 4x lexicon size Empirical observation for English.

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Soundex

Sec. 3.4 Soundex Class of heuristics to expand a query into phonetic equivalents Language specific – mainly for names E.g., chebyshev  tchebycheff Invented for the U.S. census in 1918

Soundex – typical algorithm Sec. 3.4 Soundex – typical algorithm Turn every token to be indexed into a reduced form consisting of 4 chars Do the same with query terms Build and search an index on the reduced forms

Soundex – typical algorithm Sec. 3.4 Soundex – typical algorithm Retain the first letter of the word. Herman  H… Change all occurrences of the following letters to '0' (zero):  'A', E', 'I', 'O', 'U', 'H', 'W', 'Y'. Herman  H0rm0n Change letters to digits as follows: B, F, P, V  1 C, G, J, K, Q, S, X, Z  2 D,T  3 L  4 M, N  5 R  6 H0rm0n  H06505

Soundex contd Remove all pairs of consecutive equal digits. H06505  H06505 Remove all zeros from the resulting string. H06505  H655 Pad the resulting string with trailing zeros and return the first four positions, which will be of the form <uppercase letter> <digit> <digit> <digit>. E.g., Hermann also becomes H655.

Sec. 3.4 Soundex Soundex is the classic algorithm, provided by most databases (Oracle, Microsoft, …) Not very useful – for information retrieval Okay for “high recall” tasks (e.g., Interpol), though biased to names of certain nationalities Other algorithms for phonetic matching perform much better in the context of IR

(SPELL(moriset) AND toron*to) OR SOUNDEX(chaikofski) Conclusion We have Positional inverted index (with skip pointers) Wild-card index Spell-correction Soundex We could solve queries such as (SPELL(moriset) AND toron*to) OR SOUNDEX(chaikofski)