Advanced Indexing Issues

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Greedy Algorithms Amihood Amir Bar-Ilan University.
Web Graph representation and compression Thanks to Luciana Salete Buriol and Debora Donato.
Inverted Index Hongning Wang
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
Tries Standard Tries Compressed Tries Suffix Tries.
CS 430 / INFO 430 Information Retrieval
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Near Duplicate Detection
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Web- and Multimedia-based Information Systems Lecture 2.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Information Retrieval
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
Hashing COMP171. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Computer Sciences Department1.  Property 1: each node can have up to two successor nodes (children)  The predecessor node of a node is called its.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
1 Search Engines. 2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure.
Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing.
Tolerant Retrieval Some of these slides are based on Stanford IR Course slides at 1.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
Advanced Data Structures Lecture 8 Mingmin Xie. Agenda Overview Trie Suffix Tree Suffix Array, LCP Construction Applications.
Lectures 5: Dictionaries and tolerant retrieval
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Tolerant Retrieval Review Questions
Text Indexing and Search
Dictionary data structures for the Inverted Index
Near Duplicate Detection
Modified from Stanford CS276 slides Lecture 4: Index Construction
Greedy Method 6/22/2018 6:57 PM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015.
Modified from Stanford CS276 slides
Inverted Index Hongning Wang
DATA STRUCTURES AND OBJECT ORIENTED PROGRAMMING IN C++
The Greedy Method and Text Compression
IST 516 Fall 2011 Dongwon Lee, Ph.D.
13 Text Processing Hongfei Yan June 1, 2016.
Distance Functions for Sequence Data and Time Series
Lecture 3: Dictionaries and tolerant retrieval
Dictionary data structures for the Inverted Index
Representation of documents and queries
Selected Topics: External Sorting, Join Algorithms, …
CS Data Structure: Heaps.
CMSC 202 Trees.
Lecture 3: Dictionaries and tolerant retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Space-for-time tradeoffs
Data Structures & Algorithms
Minwise Hashing and Efficient Search
Hashing.
Important Problem Types and Fundamental Data Structures
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Discussion Class 3 Stemming Algorithms.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
CSE 326: Data Structures Lecture #14
Error Correction Coding
Presentation transcript:

Advanced Indexing Issues

Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing of text Spelling correction

Indexing the Web Structure

Connectivity Server Support for fast queries on the web graph Which URLs point to a given URL? Which URLs does a given URL point to? Applications Crawl control Web graph analysis (Connectivity, crawl optimization) Link analysis

WebGraph The WebGraph Framework I: Compression Techniques. Boldi and Vigna, WWW2004 Goal: maintain node adjacency lists in memory For this, compressing the adjacency lists is the critical component

Adjacency lists: Naïve Solution The set of neighbors of a node Assume each URL represented by an integer for a 118 million page web, need 27 bits per node Naively, this demands 54 bits to represent each hyperlink When stored in an adjacency list, we cut size in half. Why? Method we will see achieved: 118M nodes, 1G links with an average of 3 bits per link!!!

Observed Properties of Links Locality: Most links lead the user to a page on the same host = if URLs are stored lexicographically the index and source and target are close Similarity: URLs that are lexicographically close have many common successors Consecutivity: Often many successors of a page have consecutive URLs

Naïve Representation Why do we need out-degree?

Gap Compression Successor list of x: S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1} For the first entry, we actually store: 2(s1-x) if s1-x>=0 2|s1-x| -1 if s1-x<0 Why?

Reference Compression Instead of representing S(x), we can code it as a modified version of S(y), for some y<x x-y is the reference number The copy list is a bit sequence the length of S(y) which indicates which of the values in S(y) are also in S(x) Also store a list of extra nodes S(x)-S(y) Usually we only consider references that are within a sliding window

Reference Compression Ref of 0 indicates that no referencing is used, i.e., no copy list

Differential Compression Do not store a copy list that is the length of S(y) Look at the copy list as a an alternating sequence of 1 and 0 blocks Store the number of blocks Store the length of the first block Store the lengths - 1 of all other blocks besides the last block (why?) Always assume that the first block is of 1 (may be a block of length 0)

Using copy blocks

Compressing Extra Nodes List Use consecutivity to compress list of extra nodes We find subsequences containing consecutive numbers of length >= given threshold Lmin Store a list of integer intervals: for each interval we store left extreme and length Left extremes are compressed using differences between extremes – 2 Interval lengths are decremented by Lmin Store a list of residuals compressed using gaps (and using a variable length encoding)

Compressing intervals Lmin = 2 in this example Important property: These series of numbers are self-decodable Note: Node number is not actually stored

Indexing for Wildcard Queries

Finding Lexicon Entries Suppose that we are given a term t, we can find its entry in the lexicon using binary search What happens if we are given a term with a wild card? How can we find the following terms in the lexicon? lab* *or lab*r

Indexing Using n-grams Decompose all terms into n-grams for some small value of n n-grams are sequences of n letters in a word use $ to mark beginning and end of word digram = 2-gram Example: Digrams of labor are $l, la, ab, bo, or, r$ We store an additional structure, that, for each digram, points has a list of terms containing that digram actually, store a list of pointers to entries in the lexicon

alphabetically sorted Digram Term numbers $a 1 $b 2 $l 3,4,5,6,7 $s 8 aa 3 ab 1,3,4,5,6,7,8 bo 4,5,6 la 3,4,5,6,7,8 or 1,4,5 ou 6 ra 5 ry r$   sl Example Term Number Term 1 abhor 2 bear 3 laaber 4 labor 5 laborator 6 labour 7 lavacaber 8 slab alphabetically sorted Can you fill this in?

Term Number Term 1 abhor 2 bear 3 laaber 4 labor 5 laborator 6 labour 7 lavacaber 8 slab Digram Term numbers $a 1 $b 2 $l 3,4,5,6,7 $s 8 aa 3 ab 1,3,4,5,6,7,8 bo 4,5,6 la 3,4,5,6,7,8 or 1,4,5 ou 6 ra 5 ry r$   sl To find lab*r, you would look for terms in common with $l, la, ab, r$ and then post-process to ensure that the term does match

Indexing Using Rotated Lexicons In wildcard queries are common, we can save on time at the cost of more space Rotated indexed can find the matches of any wildcard query with a single wildcard in one binary search An index entry is stored for each letter of each term labor, would have 6 pointers: one for each letter + one for the beginning of the word

Partial Example Term Number Term 1 abhor 2 bear 4 labor 6 labour Rotated Form Address $abhor (1,0) $bear (2,0) $labor (4,0) $labour (5,0) abhor$ (1,1) abor$l (4,2) abour$l (6,2) r$abho (1,5) r$bea (2,4) r$labo (4,5) r$labou (6,6) Note: We do not actually store the rotated string in the rotated lexicon. The pair of numbers is enough for binary search

How would you find the terms for: Partial Example Term Number Term 1 abhor 2 bear 4 labor 6 labour Rotated Form Address $abhor (1,0) $bear (2,0) $labor (4,0) $labour (5,0) abhor$ (1,1) abor$l (4,2) abour$l (6,2) r$abho (1,5) r$bea (2,4) r$labo (4,5) r$labou (6,6) How would you find the terms for: lab* *or *ab* l*r l*b*r

How Much Space? How much space does a digram index require? How much space does a rotated lexicon require?

Summary We now know how to find terms that match a wildcard query Basic steps for query evaluation for a wildcard query lookup the wildcard-ed words in an auxiliary index, to find all possible matching terms given these terms proceed with normal query processing (as if this was not a wildcard query) But, we have not yet explained how normal query processing should proceed! (later…)

Preprocessing the Data

Choosing What Data To Store We would like the user to be able to get as many relevant answers to his query Examples: Query: computer science. Should it match Computer Science? Query: data compression. Should it match compressing data? Query: Amir Perez. Should it match Amir Peretz? The way we store the data in our lexicon will affect our answers to the queries

Case Folding Normally accepted to perform case folding, i.e., to reduce all words to lower-case form before storing in the lexicon Use’s query is transformed to lower case before looking for the terms in the index What affect does this have on the lexicon size?

Stemming Suppose that a user is interested in finding pages about “running shoes” We may also want to return pages with shoe We may also want to return pages with run or runs Solution: Use a stemmer Stemmer returns the stem (שורש) of the word Note: This means that more relevant answers will be returned, as well as more irrelevant answers! Example: cleary AND witten => clear AND wit

Porter Stemmer A multi-step, longest-match stemmer. Notation Paper introducing this stemmer can be found online Notation v vowel(s) תנועות=AEIOU c consonant(s) עיצורים (vc)m vowel(s) followed by consonant(s), repeated m times Any word can be written: [c](vc)m[v] brackets are optional m is called the measure of the word We discuss only the first few rules of the stemmer

Follow first applicable rule Porter Stemmer: Step 1a Follow first applicable rule Suffix Replacement Examples sses ss caresses => caress ies i ponies => poni ties => ti caress => caress s null cats => cat

Follow first applicable rule Porter Stemmer: Step 1b Follow first applicable rule Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing *v* - the stem contains a vowel

Stop Words Stop words are very common words that generally are not of importance, e.g.: the, a, to Such words take up a lot of room in the index (why?) They slow down query processing (why?) They generally do not improve the results (why?) Some search engines do not store these words at all, and remove them from queries Is this always a good idea?

Spelling correction

Spell correction Two principal uses: Types of spelling correction: Correcting document(s) being indexed Retrieve matching documents when query contains a spelling error Types of spelling correction: Isolated word spelling correction Context-sensitive spelling correction Only latter will catch: I flew form Heathrow to Narita.

Document correction Primarily for OCR’ed documents Correction algorithms tuned for this Goal: the index (dictionary) contains fewer OCR-induced misspellings Can use domain-specific knowledge E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

Query mis-spellings Our principal focus here We can either E.g., the query carot We can either Retrieve documents indexed by the correct spelling, OR Return several suggested alternative queries with the correct spelling Did you mean … ?

Isolated word correction Fundamental premise – there is a lexicon from which the correct spellings come Two basic choices for this A standard lexicon such as (e.g., Webster’s English Dictionary, industry-specific lexicon) The lexicon of the indexed corpus (e.g., all words on the web, including the mis-spellings)

Isolated word correction Problem: Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What’s “closest”? We’ll study several alternatives Edit distance Weighted edit distance n-gram overlap

Edit distance Edit Distance: Given two strings S1 and S2, the minimum number of basic operations to convert one to the other Basic operations are typically character-level Insert, Delete. Replace The edit distance from cat to dog is 3. What is the edit distance from cat to dot? What is the maximal edit distance between s and t? Generally found by dynamic programming.

Computing Edit Distance: Intuition Suppose we want to compute edit distance of s and t Create a matrix d with 0,…,|s| columns and 0,…,|t| rows The entry d[i,j] is the edit distance between the words: sj (i.e. prefix of s of size j) ti (i.e. prefix of s of size i) Where will the edit distance of s and t be placed?

Computing Edit Distance (1) To compute the edit distance of s and t: n := length(s) m := length(t) If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. Initialize the first row to 0..n. Initialize the first column to 0..m.

Computing Edit Distance (2) for each character of s (i from 1 to n). for each character of t (j from 1 to m). If s[i] equals t[j], c := 0. If s[i] doesn't equal t[j], c := 1. d[i,j] := min(d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+c). Return d[n,m].

Example: s=GUMBO, t=GAMBOL Steps 1 and 2: G U M B O 1 2 3 4 5 A L 6 1 2 3 4 1 1 2 3 4 2 2 1 2 3 3 3 2 1 2 4 4 3 2 1 5 5 4 3 2

Weighted edit distance As above, but the weight of an operation depends on the character(s) involved Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q (Same ideas usable for OCR, but with different weights) Require weight matrix as input Modify dynamic programming to handle weights

Edit distance to all dictionary terms? Given a (mis-spelled) query – do we compute its edit distance to every dictionary term? Expensive and slow How do we cut the set of candidate dictionary terms? Here we use n-gram overlap for this

n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams

Example with trigrams Suppose the text is november Trigrams are nov, ove, vem, emb, mbe, ber. The query is december Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?

One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y don’t have to be of the same size Always assigns a number between 0 and 1 Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match

Matching trigrams Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo alone lord sloth or border lord morbid rd ardent border card lord Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure.

Computational cost Spell-correction is computationally expensive Avoid running routinely on every query? Run only on queries that matched few docs