Advanced Indexing Issues

Advanced Indexing Issues

Additional Indexing Issues
Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing of text Spelling correction

Indexing the Web Structure

Connectivity Server Support for fast queries on the web graph
Which URLs point to a given URL? Which URLs does a given URL point to? Applications Crawl control Web graph analysis (Connectivity, crawl optimization) Link analysis

WebGraph The WebGraph Framework I: Compression Techniques. Boldi and Vigna, WWW2004 Goal: maintain node adjacency lists in memory For this, compressing the adjacency lists is the critical component

Adjacency lists: Naïve Solution
The set of neighbors of a node Assume each URL represented by an integer for a 118 million page web, need 27 bits per node Naively, this demands 54 bits to represent each hyperlink When stored in an adjacency list, we cut size in half. Why? Method we will see achieved: 118M nodes, 1G links with an average of 3 bits per link!!!

Observed Properties of Links
Locality: Most links lead the user to a page on the same host = if URLs are stored lexicographically the index and source and target are close Similarity: URLs that are lexicographically close have many common successors Consecutivity: Often many successors of a page have consecutive URLs

Naïve Representation Why do we need out-degree?

Gap Compression Successor list of x: S(x) = {s1-x, s2-s1-1, ..., sk-sk-1-1} For the first entry, we actually store: 2(s1-x) if s1-x>=0 2|s1-x| -1 if s1-x<0 Why?

Reference Compression
Instead of representing S(x), we can code it as a modified version of S(y), for some y<x x-y is the reference number The copy list is a bit sequence the length of S(y) which indicates which of the values in S(y) are also in S(x) Also store a list of extra nodes S(x)-S(y) Usually we only consider references that are within a sliding window

Reference Compression
Ref of 0 indicates that no referencing is used, i.e., no copy list

Differential Compression
Do not store a copy list that is the length of S(y) Look at the copy list as a an alternating sequence of 1 and 0 blocks Store the number of blocks Store the length of the first block Store the lengths - 1 of all other blocks besides the last block (why?) Always assume that the first block is of 1 (may be a block of length 0)

Using copy blocks

Compressing Extra Nodes List
Use consecutivity to compress list of extra nodes We find subsequences containing consecutive numbers of length >= given threshold Lmin Store a list of integer intervals: for each interval we store left extreme and length Left extremes are compressed using differences between extremes – 2 Interval lengths are decremented by Lmin Store a list of residuals compressed using gaps (and using a variable length encoding)

Compressing intervals
Lmin = 2 in this example Important property: These series of numbers are self-decodable Note: Node number is not actually stored

Indexing for Wildcard Queries

Finding Lexicon Entries
Suppose that we are given a term t, we can find its entry in the lexicon using binary search What happens if we are given a term with a wild card? How can we find the following terms in the lexicon? lab* *or lab*r

Indexing Using n-grams
Decompose all terms into n-grams for some small value of n n-grams are sequences of n letters in a word use $ to mark beginning and end of word digram = 2-gram Example: Digrams of labor are $l, la, ab, bo, or, r$ We store an additional structure, that, for each digram, points has a list of terms containing that digram actually, store a list of pointers to entries in the lexicon

alphabetically sorted
Digram Term numbers $a 1 $b 2 $l 3,4,5,6,7 $s 8 aa 3 ab 1,3,4,5,6,7,8 bo 4,5,6 la 3,4,5,6,7,8 or 1,4,5 ou 6 ra 5 ry r$ sl Example Term Number Term 1 abhor 2 bear 3 laaber 4 labor 5 laborator 6 labour 7 lavacaber 8 slab alphabetically sorted Can you fill this in?

Term Number Term 1 abhor 2 bear 3 laaber 4 labor 5 laborator 6 labour 7 lavacaber 8 slab Digram Term numbers $a 1 $b 2 $l 3,4,5,6,7 $s 8 aa 3 ab 1,3,4,5,6,7,8 bo 4,5,6 la 3,4,5,6,7,8 or 1,4,5 ou 6 ra 5 ry r$ sl To find lab*r, you would look for terms in common with $l, la, ab, r$ and then post-process to ensure that the term does match

Indexing Using Rotated Lexicons
In wildcard queries are common, we can save on time at the cost of more space Rotated indexed can find the matches of any wildcard query with a single wildcard in one binary search An index entry is stored for each letter of each term labor, would have 6 pointers: one for each letter + one for the beginning of the word

Partial Example Term Number Term 1 abhor 2 bear 4 labor 6 labour Rotated Form Address $abhor (1,0) $bear (2,0) $labor (4,0) $labour (5,0) abhor$ (1,1) abor$l (4,2) abour$l (6,2) r$abho (1,5) r$bea (2,4) r$labo (4,5) r$labou (6,6) Note: We do not actually store the rotated string in the rotated lexicon. The pair of numbers is enough for binary search

How would you find the terms for:
Partial Example Term Number Term 1 abhor 2 bear 4 labor 6 labour Rotated Form Address $abhor (1,0) $bear (2,0) $labor (4,0) $labour (5,0) abhor$ (1,1) abor$l (4,2) abour$l (6,2) r$abho (1,5) r$bea (2,4) r$labo (4,5) r$labou (6,6) How would you find the terms for: lab* *or *ab* l*r l*b*r

How Much Space? How much space does a digram index require?
How much space does a rotated lexicon require?

Summary We now know how to find terms that match a wildcard query
Basic steps for query evaluation for a wildcard query lookup the wildcard-ed words in an auxiliary index, to find all possible matching terms given these terms proceed with normal query processing (as if this was not a wildcard query) But, we have not yet explained how normal query processing should proceed! (later…)

Preprocessing the Data

Choosing What Data To Store
We would like the user to be able to get as many relevant answers to his query Examples: Query: computer science. Should it match Computer Science? Query: data compression. Should it match compressing data? Query: Amir Perez. Should it match Amir Peretz? The way we store the data in our lexicon will affect our answers to the queries

Case Folding Normally accepted to perform case folding, i.e., to reduce all words to lower-case form before storing in the lexicon Use’s query is transformed to lower case before looking for the terms in the index What affect does this have on the lexicon size?

Stemming Suppose that a user is interested in finding pages about “running shoes” We may also want to return pages with shoe We may also want to return pages with run or runs Solution: Use a stemmer Stemmer returns the stem (שורש) of the word Note: This means that more relevant answers will be returned, as well as more irrelevant answers! Example: cleary AND witten => clear AND wit

Porter Stemmer A multi-step, longest-match stemmer. Notation
Paper introducing this stemmer can be found online Notation v vowel(s) תנועות=AEIOU c consonant(s) עיצורים (vc)m vowel(s) followed by consonant(s), repeated m times Any word can be written: [c](vc)m[v] brackets are optional m is called the measure of the word We discuss only the first few rules of the stemmer

Follow first applicable rule
Porter Stemmer: Step 1a Follow first applicable rule Suffix Replacement Examples sses ss caresses => caress ies i ponies => poni ties => ti caress => caress s null cats => cat

Follow first applicable rule
Porter Stemmer: Step 1b Follow first applicable rule Conditions Suffix Replacement Examples (m > 0) eed ee feed -> feed agreed -> agree (*v*) ed null plastered -> plaster bled -> bled (*v*) ing null motoring -> motor sing -> sing *v* - the stem contains a vowel

Stop Words Stop words are very common words that generally are not of importance, e.g.: the, a, to Such words take up a lot of room in the index (why?) They slow down query processing (why?) They generally do not improve the results (why?) Some search engines do not store these words at all, and remove them from queries Is this always a good idea?

Spelling correction

Spell correction Two principal uses: Types of spelling correction:
Correcting document(s) being indexed Retrieve matching documents when query contains a spelling error Types of spelling correction: Isolated word spelling correction Context-sensitive spelling correction Only latter will catch: I flew form Heathrow to Narita.

Document correction Primarily for OCR’ed documents
Correction algorithms tuned for this Goal: the index (dictionary) contains fewer OCR-induced misspellings Can use domain-specific knowledge E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

Query mis-spellings Our principal focus here We can either
E.g., the query carot We can either Retrieve documents indexed by the correct spelling, OR Return several suggested alternative queries with the correct spelling Did you mean … ?

Isolated word correction
Fundamental premise – there is a lexicon from which the correct spellings come Two basic choices for this A standard lexicon such as (e.g., Webster’s English Dictionary, industry-specific lexicon) The lexicon of the indexed corpus (e.g., all words on the web, including the mis-spellings)

Isolated word correction
Problem: Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What’s “closest”? We’ll study several alternatives Edit distance Weighted edit distance n-gram overlap

Edit distance Edit Distance: Given two strings S1 and S2, the minimum number of basic operations to convert one to the other Basic operations are typically character-level Insert, Delete. Replace The edit distance from cat to dog is 3. What is the edit distance from cat to dot? What is the maximal edit distance between s and t? Generally found by dynamic programming.

Computing Edit Distance: Intuition
Suppose we want to compute edit distance of s and t Create a matrix d with 0,…,|s| columns and 0,…,|t| rows The entry d[i,j] is the edit distance between the words: sj (i.e. prefix of s of size j) ti (i.e. prefix of s of size i) Where will the edit distance of s and t be placed?

Computing Edit Distance (1)
To compute the edit distance of s and t: n := length(s) m := length(t) If n = 0, return m and exit. If m = 0, return n and exit. Construct a matrix containing 0..m rows and 0..n columns. Initialize the first row to 0..n. Initialize the first column to 0..m.

Computing Edit Distance (2)
for each character of s (i from 1 to n). for each character of t (j from 1 to m). If s[i] equals t[j], c := If s[i] doesn't equal t[j], c := 1. d[i,j] := min(d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+c). Return d[n,m].

Example: s=GUMBO, t=GAMBOL
Steps 1 and 2: G U M B O 1 2 3 4 5 A L 6 1 2 3 4 1 1 2 3 4 2 2 1 2 3 3 3 2 1 2 4 4 3 2 1 5 5 4 3 2

Weighted edit distance
As above, but the weight of an operation depends on the character(s) involved Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q Therefore, replacing m by n is a smaller edit distance than by q (Same ideas usable for OCR, but with different weights) Require weight matrix as input Modify dynamic programming to handle weights

Edit distance to all dictionary terms?
Given a (mis-spelled) query – do we compute its edit distance to every dictionary term? Expensive and slow How do we cut the set of candidate dictionary terms? Here we use n-gram overlap for this

n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams

Example with trigrams Suppose the text is november
Trigrams are nov, ove, vem, emb, mbe, ber. The query is december Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?

One option – Jaccard coefficient
A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y don’t have to be of the same size Always assigns a number between 0 and 1 Now threshold to decide if you have a match E.g., if J.C. > 0.8, declare a match

Matching trigrams Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo alone lord sloth or border lord morbid rd ardent border card lord Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure.

Computational cost Spell-correction is computationally expensive
Avoid running routinely on every query? Run only on queries that matched few docs

Advanced Indexing Issues

Similar presentations

Presentation on theme: "Advanced Indexing Issues"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Indexing Issues

Similar presentations

Presentation on theme: "Advanced Indexing Issues"— Presentation transcript:

Similar presentations

About project

Feedback