Dictionaries and Tolerant Retrieval Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.

Slides:



Advertisements
Similar presentations
Indexing DNA Sequences Using q-Grams
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Binary Trees CSC 220. Your Observations (so far data structures) Array –Unordered Add, delete, search –Ordered Linked List –??
Augmenting Data Structures Advanced Algorithms & Data Structures Lecture Theme 07 – Part I Prof. Dr. Th. Ottmann Summer Semester 2006.
File Processing : Hash 2015, Spring Pusan National University Ki-Joune Li.
B+-Trees (PART 1) What is a B+ tree? Why B+ trees? Searching a B+ tree
CpSc 881: Information Retrieval. 2 Dictionaries The dictionary is the data structure for storing the term vocabulary. Term vocabulary: the data Dictionary:
Inverted Index Hongning Wang
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
1 Suffix Trees and Suffix Arrays Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 8)
1 Data structures for Pattern Matching Suffix trees and suffix arrays are a basic data structure in pattern matching Reported by: Olga Sergeeva, Saint.
2-dimensional indexing structure
BTrees & Bitmap Indexes
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Tirgul 8 Universal Hashing Remarks on Programming Exercise 1 Solution to question 2 in theoretical homework 2.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.
R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.
Modern Information Retrieval Chapter 4 Query Languages.
1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
E.G.M. PetrakisB-trees1 Multiway Search Tree (MST)  Generalization of BSTs  Suitable for disk  MST of order n:  Each node has n or fewer sub-trees.
1 Query Languages. 2 Boolean Queries Keywords combined with Boolean operators: –OR: (e 1 OR e 2 ) –AND: (e 1 AND e 2 ) –BUT: (e 1 BUT e 2 ) Satisfy e.
Chapter 4 Query Languages.... Introduction Cover different kinds of queries posed to text retrieval systems Keyword-based query languages  include simple.
Text Search and Fuzzy Matching
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Binary Search Trees continued Trees Draw the BST Insert the elements in this order 50, 70, 30, 37, 43, 81, 12, 72, 99 2.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
§6 B+ Trees 【 Definition 】 A B+ tree of order M is a tree with the following structural properties: (1) The root is either a leaf or has between 2 and.
Different Tree Data Structures for Different Problems
Comp 249 Programming Methodology Chapter 15 Linked Data Structure - Part B Dr. Aiman Hanna Department of Computer Science & Software Engineering Concordia.
Chapter 19: Binary Trees. Objectives In this chapter, you will: – Learn about binary trees – Explore various binary tree traversal algorithms – Organize.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Search A Basic Overview Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata October 20, 2014.
CS Data Structures Chapter 10 Search Structures.
Information Retrieval CSE 8337 Spring 2007 Query Languages & Matching Material for these slides obtained from: Modern Information Retrieval by Ricardo.
© 2006 Pearson Addison-Wesley. All rights reserved13 B-1 Chapter 13 (continued) Advanced Implementation of Tables.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Starting at Binary Trees
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
1 CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
ALGORITHMS.
Dictionaries and Tolerant retrieval
CS 367 Introduction to Data Structures Lecture 9.
Week 15 – Wednesday.  What did we talk about last time?  Review up to Exam 1.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 i206: Lecture 16: Data Structures for Disk; Advanced Trees Marti Hearst Spring 2012.
Week 9 - Monday.  What did we talk about last time?  Practiced with red-black trees  AVL trees  Balanced add.
1 Lexicographic Search:Tries All of the searching methods we have seen so far compare entire keys during the search Idea: Why not consider a key to be.
TU/e Algorithms (2IL15) – Lecture 4 1 DYNAMIC PROGRAMMING II
Information Storage & Retrieval Department of Information Management School of Information Engineering Nanjing University of Finance & Economics
1 i206: Lecture 17: Exam 2 Prep ; Intro to Regular Expressions Marti Hearst Spring 2012.
1 R-Trees Guttman. 2 Introduction Range queries in multiple dimensions: Computer Aided Design (CAD) Geo-data applications Support special data objects.
COMP261 Lecture 23 B Trees.
Azita Keshmiri CS 157B Ch 12 indexing and hashing
COMP108 Algorithmic Foundations Polynomial & Exponential Algorithms
CS 430: Information Discovery
Query Languages.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 589 Applied Algorithms Spring 1999
Advanced Implementation of Tables
Data Structures in Ethereum
Minwise Hashing and Efficient Search
Presentation transcript:

Dictionaries and Tolerant Retrieval Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata

Pre-processing of a document 2 document text, word, XML, … simple text sequence of characters ASCII, UTF-8 sequence of tokens ASCII, UTF-8 decoding tokenizing sequence of processed tokens ASCII, UTF-8 linguistic processing The dictionary

Pre-processing of a document  Removal of stopwords: of, the, and, … – Modern search does not completely remove stopwords – Such words add meaning to sentences as well as queries  Stemming: words  stem (root) of words – Statistics, statistically, statistical  statistic (same root) – Loss of slight information (the form of the word also matters) – But unifies differently expressed queries on the same topic  Lemmatization: words  morphological root – saw  see, not saw  s  Normalization: unify equivalent words as much as possible – U.S.A, USA – Windows, windows  We will cover details of these later in this course  Left for you to read the book 3

The dictionary  User sends the query  The engine – Determine the query terms – Determine if each query term is present in the dictionary – Dictionary: Lookup – Search trees or hashing 4 Query Dictionary Posting lists ……… UserSearch engine

Binary search trees Binary search tree  Each node has two children  O(log M) comparisons if the tree is balanced Problem  Balancing the tree when terms are inserted and deleted 5 Root 0-9, a-k l-z aaai zzzz …….. M = number of terms log M

B-tree  Number of children for each node is between a and b for some predetermined a and b  O(log a M) comparisons  Very few rebalancing required B+ tree  Similar to B-tree  All data (pointers to posting lists) are in leaf nodes  Linear scan of data easier 6 Root 0-7 x-z aaai zzzz …….. M = number of terms log a M ……..

WILDCARD QUERIES 7

Wildcard queries  Wildcard: is a character that may be substituted for any of a defined subset of all possible characters  Wildcard queries: queries with wildcards – Sydney/Sidney: s*dney – Sankhadeep/Shankhadeep/Sankhadip: s*ankhad*p – Judicial/Judiciary: judicia*  Trailing wildcard queries – Simplest: search trees work well – Determine the node(s) which correspond to the range of terms specified by the query – Retrieve the posting lists for the set W of all the terms in the entire sub-tree under those nodes 8 Trailing wildcard query

Queries with a single *  Leading wildcard queries: *ata – Matching kolkata, bata, …  Use a reverse B-tree – B-tree obtained by considering terms backwards – Consider the leading wildcard query backwards, it becomes a trailing wildcard query – Lookup as before for a trailing wildcard query on a normal B-tree Queries with a single *  Queries of the form: s*dney – Matching sydney, sidney, …  Use a B-tree and a reverse B-tree – Use the B-tree to get the set W of all terms matching s* – Use the reverse B-tree to get the set R of all terms matching *dney – Intersect W and R 9 Root 0-7 x-z iaaa zzzz …….. M = number of terms log a M …….. ataklok

General wildcard queries The permuterm index  Special character $ as end of term  The term sydney  sydney$  Enumerate all rotations of sydney$ and have all of them in the B-tree, finally pointing to sydney Wildcard queries  A single *: sy*ey – Query ey$sy* to the B-tree – One rotation of sydney$ will be a match  General: s*dn*y – Query y$s* to the B-tree – Works equivalent to s*y, not all of the matches would have “dn” in the middle – Filter the others by exhaustive checks Problems  Blows up the dictionary  Empirically seen to be 10 times for English 10 sydney sydney$ ydney$s dney$sy ney$syd ey$sydn y$sydne $sydney

k-gram index for wildcard queries  k-gram: sequence of k characters  k-gram index:  words in which the k- gram appears, sorted lexicographically – Consider all words with the beginning and ending marker $ 11 etr beetrootmetric……retrievalsymmetry on$ aviation…… sonxeon $bo book…… boxboy k is predetermined

Wildcard queries with k-gram index  User query: re*ve – Send the Boolean query $re AND ve$ to the k-gram index – Will return terms such as revive, remove, … – Proceed with those terms and retrieve from inverted index  User query: red* – Send the query $re AND red to the 3-gram index – Returns all results starting with “re” and containing “red” – Post-processing to keep only the ones matching red*  Exercise: more general wildcard query s*dn*y – Can we do this using the k-gram index (assume 3-gram)? 12

Discussion on wildcard queries  Semantics – What does re*d AND fe*ri mean? – (Any term matching re*d) AND (Any term matching fe*ri) – Once the terms are identified, the operation on posting lists – ( … Union … ) Intersection ( … Union … ) – Expensive operations, particularly if there are many matching terms  Expensive even without Boolean combinations  Hidden functionality in search engines – Otherwise users would “play around” even when not necessary – For example, a query “s*” produces huge number of terms for which the union of posting lists need to be computed 13

Why search trees are better than hashing?  Possible hash collision  Prefix queries cannot be performed – red and re may be hashed to entirely different range of values  Almost similar terms do not hash to almost similar integers  A hash function designed now may not be suitable if the data grows to a much larger size 14

SPELLING CORRECTIONS Did you mean? 15

Misspelled queries  People type a lot of misspelled queries – britian spears, britney’s spears, brandy spears, prittany spears  britney spears  What to do? 1.Among the possible corrections, choose the “nearest” one 2.Among the possible “near” corrections, choose the most frequent one (probability of that being the user’s intention is the highest) 3.Context sensitive correction 4.The query may not be actually incorrect. Retrieve results for the original as well as possible correction of the query debapriyo majumder  returns results for debapriyo majumdar and majumder both  Approaches for spelling correction – Edit distance – k-gram overlap 16

Edit distance  Edit distance E(A,B) = minimum number of operations required to obtain B from A – Operations allowed: insertion, deletion, substitution  Example: E(food, money) = 4 – food  mood  mond  moned  money  Computing edit distance in O(|A|. |B|) time  Spelling correction – Given a (possibly misspelled) query term, need to find other terms (in the dictionary) with very small edit distance – Precomputing edit distance for all pairs of terms  absurd – Use several heuristics to limit possible pairs Only consider pairs of terms starting with same letter 17

Computing edit distance Observation:  E(food, money) = 4 – One sequence: food  mood  mond  moned  money  E(food, moned) =  Why? – If E(food, moned) < 3, then E(food, money) < 4 Prefix property: If we remove the last step of an optimal edit sequence then the remaining steps represent an optimal edit sequence for the remaining substrings 18 3

Computing edit distance  Fix the strings A and B. Let |A| = m, |B| = n.  Define: E(i, j) = E(A[1, …, i], B[1, …, j]) – That is, edit distance between the length i prefix of A and length j prefix of B  Note: E(m, n) = E(A,B)  Recursive formulation (a) E(i, 0) = i (b) E(0, j) = j  The last step: 4 possibilities – Insertion: E(i, j) = E(i, j – 1) + 1 – Deletion: E(i, j) = E(i – 1, j) + 1 – Substitution: E(i, j) = E(i – 1, j – 1) + 1 – No action: E(i, j) = E(i – 1, j – 1) 19

Computing edit distance: dynamic programming The recursion 20 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD 0 1M 2O 3N 4E 5Y

Computing edit distance: dynamic programming The recursion 21 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M1 2O2 3N3 4E4 5Y5 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from

Computing edit distance: dynamic programming The recursion 22 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M11 2O2 3N3 4E4 5Y5 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from

Computing edit distance: dynamic programming The recursion 23 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M11 2O22 3N3 4E4 5Y5 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from

Computing edit distance: dynamic programming The recursion 24 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M O22 3N33 4E44 5Y55 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from

Computing edit distance: dynamic programming The recursion 25 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M O221 3N33 4E44 5Y55 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from

Computing edit distance: dynamic programming The recursion 26 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M O N332 4E443 5Y554 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from

Computing edit distance: dynamic programming The recursion 27 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M O N E Y55444 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from

Computing edit distance: dynamic programming The recursion 28 P is an indicator variable P = 1 if A[i] ≠ B[j], 0 otherwise FOOD M O N E Y55444 Backtrace: Compute E(i, j) and also keep track of where E(i, j) came from Backtrace to find an optimal edit path

Spelling correction using k-gram index  The k-grams are small portions of words  Misspelled word would still have some k-grams intact  Misspelled query: bord 29 bo or rd aboardboardroom……borderboring borderlord……morbidnorth aboardboardroom……borderhard  Intersect the list of words for k-grams  Problem: long words which contain the k-grams but are not good corrections

Phonetic correction  Some users misspell because they don’t know the spelling  Types as it “sounds”  Approach for correction: use a phonetic hash function – Hash similarly sounding terms to the same hash value  Soundex algorithm – Several variants 30

Soundex algorithm 1.Retain the first letter of the term 2.Change all A, E, I, O, U, H, W, Y  0 B, F, P, V  1. C, G, J, K, Q, S, X, Z  2. D,T to 3. L to 4. M, N to 5. R to 6. 3.Repeat: remove one of each pair of same digit 4.Remove all 0s. Pad the result with trailing 0s. Return the first 4 positions: one letter, 3 digits Example: Hermann  H  H06505  H655 31

References and acknowledgements  Primarily: IR Book by Manning, Raghavan and Schuetze:  The part on edit distance: lectures notes by John Reif, Duke University: s230/Lectures/L-04.pdfhttps:// s230/Lectures/L-04.pdf 32