Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing.

Advanced Indexing Issues 1

Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing of text Spelling correction 2

Indexing the Web Structure 3

4 Connectivity Server Support for fast queries on the web graph –Which URLs point to a given URL? –Which URLs does a given URL point to? Applications –Crawl control –Web graph analysis (Connectivity, crawl optimization) –Link analysis

5 WebGraph The WebGraph Framework I: Compression Techniques. Boldi and Vigna, WWW2004 Goal: maintain node adjacency lists in memory –For this, compressing the adjacency lists is the critical component

6 Adjacency lists: Naïve Solution The set of neighbors of a node Assume each URL represented by an integer –for a 118 million page web, need 27 bits per node Naively, this demands 54 bits to represent each hyperlink –When stored in an adjacency list, we cut size in half. Why? Method we will see achieved: 118M nodes, 1G links with an average of 3 bits per link!!!

7 Observed Properties of Links Locality: Most links lead the user to a page on the same host = if URLs are stored lexicographically the index and source and target are close Similarity: URLs that are lexicographically close have many common successors Consecutivity: Often many successors of a page have consecutive URLs

Naïve Representation Why do we need out-degree? 8

Gap Compression Successor list of x: S(x) = {s 1 -x, s 2 -s 1 -1,..., s k -s k-1 -1} For the first entry, we actually store: – 2x if x>=0 –2|x| -1 if x<0 Why?

Reference Compression Instead of representing S(x), we can code it as a modified version of S(y), for some y<x y-x is the reference number The copy list is a bit sequence the length of S(y) which indicates which of the values in S(y) are also in S(x) Also store a list of extra nodes S(x)-S(y) Usually we only consider references that are within a sliding window 10

Reference Compression Ref of 0 indicates that no referencing is used, i.e., no copy list

Differential Compression Do not store a copy list that is the length of S(y) Look at the copy list as a an alternating sequence of 1 and 0 blocks Store the number of blocks Store the length of the first block Store the lengths - 1 of all other blocks besides the last block (why?) Always assume that the first block is of 1 (may be a block of length 0) 12

Using copy blocks

Compressing Extra Nodes List Use consecutivity to compress list of extra nodes –We find subsequences containing consecutive numbers of length >= given threshold L min Store a list of integer intervals: for each interval we store left extreme and length –Left extremes are compressed using differences between left extremes – 2 –Intervals are decremented by L min Store a list of residuals compressed using gaps (and using a variable length encoding) 14

Compressing intervals Important property: These series of numbers are self- decodable Note: Node number is not actually stored

16 Indexing for Wildcard Queries

17 Finding Lexicon Entries Suppose that we are given a term t, we can find its entry in the lexicon using binary search What happens if we are given a term with a wild card? How can we find the following terms in the lexicon? –lab* –*or –lab*r

18 Indexing Using n-grams Decompose all terms into n-grams for some small value of n –n-grams are sequences of n letters in a word –use $ to mark beginning and end of word –digram = 2-gram Example: Digrams of labor are $l, la, ab, bo, r$ We store an additional structure, that, for each digram, points has a list of terms containing that digram –actually, store a list of pointers to entries in the lexicon

19 Example Term NumberTerm 1abhor 2bear 3laaber 4labor 5laborator 6labour 7lavacaber 8slab DigramTerm numbers $a1 $b2 $l3,4,5,6,7 $s8 aa3 ab1,3,4,5,6,7,8 bo4,5,6 la3,4,5,6,7,8 or1,4,5 ou6 ra5 ry5 r$ sl Can you fill this in? alphabetically sorted

20 Term NumberTerm 1abhor 2bear 3laaber 4labor 5laborator 6labour 7lavacaber 8slab DigramTerm numbers $a1 $b2 $l3,4,5,6,7 $s8 aa3 ab1,3,4,5,6,7,8 bo4,5,6 la3,4,5,6,7,8 or1,4,5 ou6 ra5 ry5 r$ sl To find lab*r, you would look for terms in common with $l, la, ab, r$ and then post- process to ensure that the term does match

21 Indexing Using Rotated Lexicons In wildcard queries are common, we can save on time at the cost of more space Rotated indexed can find the matches of any wildcard query with a single wildcard in one binary search An index entry is stored for each letter of each term –labor, would have 6 pointers: one for each letter + one for the beginning of the word

22 Partial Example Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) Note: We do not actually store the rotated string in the rotated lexicon. The pair of numbers is enough for binary search

23 Partial Example Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) How would you find the terms for: lab* *or *ab* l*r l*b*r

24 Summary We now know how to find terms that match a wildcard query Basic steps for query evaluation for a wildcard query –lookup the wildcard-ed words in an auxiliary index, to find all possible matching terms –given these terms proceed with normal query processing (as if this was not a wildcard query) But, we have not yet explained how normal query processing should proceed! This is the next topic.

Homework We would like to support retrieval of documents containing phrases –E.g., given “information retrieval”, return all documents containing “information” and then “retrieval” immediately afterwards Proposed Solution: Biword index. –Should efficiently support the following: given two wordIds w 1, w 2, efficiently return all words containing w 1 and then immediately afterwards w 2 25

Homework (cont) 1.Suggest a data structure for storing the biword index. Describe the structure precisely. 2.Draw the structure assuming that the following documents are given: –בהצלחה רבה במבחן –עידו עבר בהצלחה רבה –קל במבחן קל 3.Analyze the search complexity of the structure. 4.Analyze the size complexity of the structure. 26

27 Preprocessing the Data

28 Choosing What Data To Store We would like the user to be able to get as many relevant answers to his query Examples: –Query: computer science. Should it match Computer Science? –Query: data compression. Should it match compressing data? –Query: Amir Perez. Should it match Amir Peretz? The way we store the data in our lexicon will affect our answers to the queries

29 Case Folding Normally accepted to perform case folding, i.e., to reduce all words to lower-case form before storing in the lexicon Use’s query is transformed to lower case before looking for the terms in the index What affect does this have on the lexicon size?

30 Stemming Suppose that a user is interested in finding pages about “running shoes” –We may also want to return pages with shoe –We may also want to return pages with run or runs Solution: Use a stemmer –Stemmer returns the stem (שורש) of the word Note: This means that more relevant answers will be returned, as well as more irrelevant answers! –Example: cleary AND witten => clear AND wit

31 Porter Stemmer A multi-step, longest-match stemmer. –Paper introducing this stemmer can be found onlineonline Notation –v vowel(s) –cconstant(s) –(vc) m vowel(s) followed by constant(s), repeated m times Any word can be written: [c](vc) m [v] –brackets are optional –m is called the measure of the word We discuss only the first few rules of the stemmer

32 Porter Stemmer: Step 1a Follow first applicable rule SuffixReplacementExamples ssessscaresses => caress iesi ponies => poni ties => ti ss caress => caress snullcats => cat

33 Porter Stemmer: Step 1b ConditionsSuffixReplacementExamples (m > 0)eedeefeed -> feed agreed -> agree (*v*)ednullplastered -> plaster bled -> bled (*v*)ingnullmotoring -> motor sing -> sing *v* - the stem contains a vowel Follow first applicable rule

34 Stop Words Stop words are very common words that generally are not of importance, e.g.: the, a, to Such words take up a lot of room in the index (why?) They slow down query processing (why?) They generally do not improve the results (why?) Some search engines do not store these words at all, and remove them from queries –Is this always a good idea?

Spelling correction

Spell correction Two principal uses: 1.Correcting document(s) being indexed 2.Retrieve matching documents when query contains a spelling error Types of spelling correction: –Isolated word spelling correction –Context-sensitive spelling correction –Only latter will catch: I flew form Heathrow to Narita.

Document correction Primarily for OCR’ed documents –Correction algorithms tuned for this Goal: the index (dictionary) contains fewer OCR- induced misspellings Can use domain-specific knowledge –E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

Query mis-spellings Our principal focus here –E.g., the query carot We can either –Retrieve documents indexed by the correct spelling, OR –Return several suggested alternative queries with the correct spelling Did you mean … ?

Isolated word correction Fundamental premise – there is a lexicon from which the correct spellings come Two basic choices for this –A standard lexicon such as (e.g., Webster’s English Dictionary, industry-specific lexicon) – The lexicon of the indexed corpus (e.g., all words on the web, including the mis-spellings)

Isolated word correction Problem: Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What’s “closest”? We’ll study several alternatives –Edit distance –Weighted edit distance –n-gram overlap

Edit distance Edit Distance: Given two strings S 1 and S 2, the minimum number of basic operations to convert one to the other Basic operations are typically character-level –Insert, Delete. Replace The edit distance from cat to dog is 3. –What is the edit distance from cat to dot? –What is the maximal edit distance between s and t? Generally found by dynamic programming.

Computing Edit Distance: Intuition Suppose we want to compute edit distance of s and t Create a matrix d with 0,…,|s| columns and 0,…,|t| rows The entry d[i,j] is the edit distance between the words: –s j (i.e. prefix of s of size j) –t i (i.e. prefix of s of size i) Where will the edit distance of s and t be placed?

Computing Edit Distance (1) To compute the edit distance of s and t: 1.n := length(s) m := length(t) If n = 0, return m and exit. If m = 0, return n and exit. 2.Construct a matrix containing 0..m rows and 0..n columns. Initialize the first row to 0..n. Initialize the first column to 0..m.

Computing Edit Distance (2) 3.for each character of s (i from 1 to n). 4. for each character of t (j from 1 to m). 5. If s[i] equals t[j], c := 0. If s[i] doesn't equal t[j], c := 1. 6. d[i,j] := min(d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+c). 7.Return d[n,m].

Example: s=GUMBO, t=GAMBOL Steps 1 and 2: GUMBO 012345 G1 A2 M3 B4 O5 L6 Continue on blackboard!

Weighted edit distance As above, but the weight of an operation depends on the character(s) involved –Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q –Therefore, replacing m by n is a smaller edit distance than by q –(Same ideas usable for OCR, but with different weights) Require weight matrix as input Modify dynamic programming to handle weights

Edit distance to all dictionary terms? Given a (mis-spelled) query – do we compute its edit distance to every dictionary term? –Expensive and slow How do we cut the set of candidate dictionary terms? Here we use n-gram overlap for this

n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams

Example with trigrams Suppose the text is november –Trigrams are nov, ove, vem, emb, mbe, ber. The query is december –Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?

One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y don’t have to be of the same size Always assigns a number between 0 and 1 –Now threshold to decide if you have a match –E.g., if J.C. > 0.8, declare a match

Matching trigrams Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo or rd alonelordsloth lordmorbid bordercard border ardent Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure. Can we compute Jaccard coefficient if words are not stored in n-gram index?

Computational cost Spell-correction is computationally expensive Avoid running routinely on every query? –Run only on queries that matched few docs

Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing.

Similar presentations

Presentation on theme: "Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing.

Similar presentations

Presentation on theme: "Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing."— Presentation transcript:

Similar presentations

About project

Feedback