Advanced Indexing Issues 1. Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Spelling Correction for Search Engine Queries Bruno Martins, Mario J. Silva In Proceedings of EsTAL-04, España for Natural Language Processing Presenter:
Introduction to Computer Science 2 Lecture 7: Extended binary trees
Greedy Algorithms Amihood Amir Bar-Ilan University.
Web Graph representation and compression Thanks to Luciana Salete Buriol and Debora Donato.
Lecture 4: Dictionaries and tolerant retrieval
Inverted Index Hongning Wang
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 3 8/30/2010.
CS276A Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Adapted from Christopher Manning and Prabhakar Raghavan Tolerant Retrieval.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
1 ITCS 6265 Lecture 3 Dictionaries and Tolerant retrieval.
Advanced topics in Computer Science Jiaheng Lu Department of Computer Science Renmin University of China
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
CES 514 Data Mining Feb 17, 2010 Lecture 3: The dictionary and tolerant retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 3: Dictionaries and tolerant retrieval.
CS 430 / INFO 430 Information Retrieval
1 Discussion Class 3 The Porter Stemmer. 2 Course Administration No class on Thursday.
CS 430 / INFO 430 Information Retrieval
Tirgul 9 Amortized analysis Graph representation.
CS Lecture 9 Storeing and Querying Large Web Graphs.
CS728 Lecture 16 Web indexes II. Last Time Indexes for answering text queries –given term produce all URLs containing –Compact representations for postings.
Near Duplicate Detection
Hashing COMP171 Fall Hashing 2 Hash table * Support the following operations n Find n Insert n Delete. (deletions may be unnecessary in some applications)
1 ES 314 Advanced Programming Lec 2 Sept 3 Goals: Complete the discussion of problem Review of C++ Object-oriented design Arrays and pointers.
CMPS 3223 Theory of Computation Automata, Computability, & Complexity by Elaine Rich ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Slides provided.
PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)
Introduction to Information Retrieval Introduction to Information Retrieval COMP4210: Information Retrieval and Search Engines Lecture 3: Dictionaries.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
1 University of Palestine Topics In CIS ITBS 3202 Ms. Eman Alajrami 2 nd Semester
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 3: Dictionaries and tolerant retrieval Related to Chapter 3:
CSC 211 Data Structures Lecture 13
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Web- and Multimedia-based Information Systems Lecture 2.
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer.
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Hashing 1 Hashing. Hashing 2 Hashing … * Again, a (dynamic) set of elements in which we do ‘search’, ‘insert’, and ‘delete’ n Linear ones: lists, stacks,
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Correcting user queries to retrieve “right” answers Two.
Information Retrieval
Chapter 10 Algorithmic Thinking. Learning Objectives Explain similarities and differences among algorithms, programs, and heuristic solutions List the.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 3: Dictionaries and tolerant retrieval.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Autumn Web Information retrieval (Web IR) Handout #3:Dictionaries and tolerant retrieval Mohammad Sadegh Taherzadeh ECE Department, Yazd University.
1 Discussion Class 3 Stemming Algorithms. 2 Discussion Classes Format: Question Ask a member of the class to answer Provide opportunity for others to.
1 Search Engines. 2 Topics to be discussed During the next few weeks we will be discussing search engines Talk divided into 4 parts –Introduction: structure.
Tolerant Retrieval Some of these slides are based on Stanford IR Course slides at 1.
Spelling correction. Spell correction Two principal uses Correcting document(s) being indexed Retrieve matching documents when query contains a spelling.
An Introduction to IR Lecture 3 Dictionaries and Tolerant retrieval 1.
Introduction to Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar.
Lectures 5: Dictionaries and tolerant retrieval
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Tolerant Retrieval Review Questions
Advanced Indexing Issues
Text Indexing and Search
Dictionary data structures for the Inverted Index
Near Duplicate Detection
Modified from Stanford CS276 slides
13 Text Processing Hongfei Yan June 1, 2016.
Dictionary data structures for the Inverted Index
Lecture 3: Dictionaries and tolerant retrieval
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Hashing.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

Advanced Indexing Issues 1

Additional Indexing Issues Indexing for queries about the Web structure Indexing for queries containing wildcards Preprocessing of text Spelling correction 2

Indexing the Web Structure 3

4 Connectivity Server Support for fast queries on the web graph –Which URLs point to a given URL? –Which URLs does a given URL point to? Applications –Crawl control –Web graph analysis (Connectivity, crawl optimization) –Link analysis

5 WebGraph The WebGraph Framework I: Compression Techniques. Boldi and Vigna, WWW2004 Goal: maintain node adjacency lists in memory –For this, compressing the adjacency lists is the critical component

6 Adjacency lists: Naïve Solution The set of neighbors of a node Assume each URL represented by an integer –for a 118 million page web, need 27 bits per node Naively, this demands 54 bits to represent each hyperlink –When stored in an adjacency list, we cut size in half. Why? Method we will see achieved: 118M nodes, 1G links with an average of 3 bits per link!!!

7 Observed Properties of Links Locality: Most links lead the user to a page on the same host = if URLs are stored lexicographically the index and source and target are close Similarity: URLs that are lexicographically close have many common successors Consecutivity: Often many successors of a page have consecutive URLs

Naïve Representation Why do we need out-degree? 8

Gap Compression Successor list of x: S(x) = {s 1 -x, s 2 -s 1 -1,..., s k -s k-1 -1} For the first entry, we actually store: – 2x if x>=0 –2|x| -1 if x<0 Why?

Reference Compression Instead of representing S(x), we can code it as a modified version of S(y), for some y<x y-x is the reference number The copy list is a bit sequence the length of S(y) which indicates which of the values in S(y) are also in S(x) Also store a list of extra nodes S(x)-S(y) Usually we only consider references that are within a sliding window 10

Reference Compression Ref of 0 indicates that no referencing is used, i.e., no copy list

Differential Compression Do not store a copy list that is the length of S(y) Look at the copy list as a an alternating sequence of 1 and 0 blocks Store the number of blocks Store the length of the first block Store the lengths - 1 of all other blocks besides the last block (why?) Always assume that the first block is of 1 (may be a block of length 0) 12

Using copy blocks

Compressing Extra Nodes List Use consecutivity to compress list of extra nodes –We find subsequences containing consecutive numbers of length >= given threshold L min Store a list of integer intervals: for each interval we store left extreme and length –Left extremes are compressed using differences between left extremes – 2 –Intervals are decremented by L min Store a list of residuals compressed using gaps (and using a variable length encoding) 14

Compressing intervals Important property: These series of numbers are self- decodable Note: Node number is not actually stored

16 Indexing for Wildcard Queries

17 Finding Lexicon Entries Suppose that we are given a term t, we can find its entry in the lexicon using binary search What happens if we are given a term with a wild card? How can we find the following terms in the lexicon? –lab* –*or –lab*r

18 Indexing Using n-grams Decompose all terms into n-grams for some small value of n –n-grams are sequences of n letters in a word –use $ to mark beginning and end of word –digram = 2-gram Example: Digrams of labor are $l, la, ab, bo, r$ We store an additional structure, that, for each digram, points has a list of terms containing that digram –actually, store a list of pointers to entries in the lexicon

19 Example Term NumberTerm 1abhor 2bear 3laaber 4labor 5laborator 6labour 7lavacaber 8slab DigramTerm numbers $a1 $b2 $l3,4,5,6,7 $s8 aa3 ab1,3,4,5,6,7,8 bo4,5,6 la3,4,5,6,7,8 or1,4,5 ou6 ra5 ry5 r$ sl Can you fill this in? alphabetically sorted

20 Term NumberTerm 1abhor 2bear 3laaber 4labor 5laborator 6labour 7lavacaber 8slab DigramTerm numbers $a1 $b2 $l3,4,5,6,7 $s8 aa3 ab1,3,4,5,6,7,8 bo4,5,6 la3,4,5,6,7,8 or1,4,5 ou6 ra5 ry5 r$ sl To find lab*r, you would look for terms in common with $l, la, ab, r$ and then post- process to ensure that the term does match

21 Indexing Using Rotated Lexicons In wildcard queries are common, we can save on time at the cost of more space Rotated indexed can find the matches of any wildcard query with a single wildcard in one binary search An index entry is stored for each letter of each term –labor, would have 6 pointers: one for each letter + one for the beginning of the word

22 Partial Example Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) Note: We do not actually store the rotated string in the rotated lexicon. The pair of numbers is enough for binary search

23 Partial Example Term NumberTerm 1abhor 2bear 4labor 6labour Rotated FormAddress $abhor(1,0) $bear(2,0) $labor(4,0) $labour(5,0) abhor$(1,1) abor$l(4,2) abour$l(6,2) r$abho(1,5) r$bea(2,4) r$labo(4,5) r$labou(6,6) How would you find the terms for: lab* *or *ab* l*r l*b*r

24 Summary We now know how to find terms that match a wildcard query Basic steps for query evaluation for a wildcard query –lookup the wildcard-ed words in an auxiliary index, to find all possible matching terms –given these terms proceed with normal query processing (as if this was not a wildcard query) But, we have not yet explained how normal query processing should proceed! This is the next topic.

Homework We would like to support retrieval of documents containing phrases –E.g., given “information retrieval”, return all documents containing “information” and then “retrieval” immediately afterwards Proposed Solution: Biword index. –Should efficiently support the following: given two wordIds w 1, w 2, efficiently return all words containing w 1 and then immediately afterwards w 2 25

Homework (cont) 1.Suggest a data structure for storing the biword index. Describe the structure precisely. 2.Draw the structure assuming that the following documents are given: –בהצלחה רבה במבחן –עידו עבר בהצלחה רבה –קל במבחן קל 3.Analyze the search complexity of the structure. 4.Analyze the size complexity of the structure. 26

27 Preprocessing the Data

28 Choosing What Data To Store We would like the user to be able to get as many relevant answers to his query Examples: –Query: computer science. Should it match Computer Science? –Query: data compression. Should it match compressing data? –Query: Amir Perez. Should it match Amir Peretz? The way we store the data in our lexicon will affect our answers to the queries

29 Case Folding Normally accepted to perform case folding, i.e., to reduce all words to lower-case form before storing in the lexicon Use’s query is transformed to lower case before looking for the terms in the index What affect does this have on the lexicon size?

30 Stemming Suppose that a user is interested in finding pages about “running shoes” –We may also want to return pages with shoe –We may also want to return pages with run or runs Solution: Use a stemmer –Stemmer returns the stem (שורש) of the word Note: This means that more relevant answers will be returned, as well as more irrelevant answers! –Example: cleary AND witten => clear AND wit

31 Porter Stemmer A multi-step, longest-match stemmer. –Paper introducing this stemmer can be found onlineonline Notation –v vowel(s) –cconstant(s) –(vc) m vowel(s) followed by constant(s), repeated m times Any word can be written: [c](vc) m [v] –brackets are optional –m is called the measure of the word We discuss only the first few rules of the stemmer

32 Porter Stemmer: Step 1a Follow first applicable rule SuffixReplacementExamples ssessscaresses => caress iesi ponies => poni ties => ti ss caress => caress snullcats => cat

33 Porter Stemmer: Step 1b ConditionsSuffixReplacementExamples (m > 0)eedeefeed -> feed agreed -> agree (*v*)ednullplastered -> plaster bled -> bled (*v*)ingnullmotoring -> motor sing -> sing *v* - the stem contains a vowel Follow first applicable rule

34 Stop Words Stop words are very common words that generally are not of importance, e.g.: the, a, to Such words take up a lot of room in the index (why?) They slow down query processing (why?) They generally do not improve the results (why?) Some search engines do not store these words at all, and remove them from queries –Is this always a good idea?

Spelling correction

Spell correction Two principal uses: 1.Correcting document(s) being indexed 2.Retrieve matching documents when query contains a spelling error Types of spelling correction: –Isolated word spelling correction –Context-sensitive spelling correction –Only latter will catch: I flew form Heathrow to Narita.

Document correction Primarily for OCR’ed documents –Correction algorithms tuned for this Goal: the index (dictionary) contains fewer OCR- induced misspellings Can use domain-specific knowledge –E.g., OCR can confuse O and D more often than it would confuse O and I (adjacent on the QWERTY keyboard, so more likely interchanged in typing).

Query mis-spellings Our principal focus here –E.g., the query carot We can either –Retrieve documents indexed by the correct spelling, OR –Return several suggested alternative queries with the correct spelling Did you mean … ?

Isolated word correction Fundamental premise – there is a lexicon from which the correct spellings come Two basic choices for this –A standard lexicon such as (e.g., Webster’s English Dictionary, industry-specific lexicon) – The lexicon of the indexed corpus (e.g., all words on the web, including the mis-spellings)

Isolated word correction Problem: Given a lexicon and a character sequence Q, return the words in the lexicon closest to Q What’s “closest”? We’ll study several alternatives –Edit distance –Weighted edit distance –n-gram overlap

Edit distance Edit Distance: Given two strings S 1 and S 2, the minimum number of basic operations to convert one to the other Basic operations are typically character-level –Insert, Delete. Replace The edit distance from cat to dog is 3. –What is the edit distance from cat to dot? –What is the maximal edit distance between s and t? Generally found by dynamic programming.

Computing Edit Distance: Intuition Suppose we want to compute edit distance of s and t Create a matrix d with 0,…,|s| columns and 0,…,|t| rows The entry d[i,j] is the edit distance between the words: –s j (i.e. prefix of s of size j) –t i (i.e. prefix of s of size i) Where will the edit distance of s and t be placed?

Computing Edit Distance (1) To compute the edit distance of s and t: 1.n := length(s) m := length(t) If n = 0, return m and exit. If m = 0, return n and exit. 2.Construct a matrix containing 0..m rows and 0..n columns. Initialize the first row to 0..n. Initialize the first column to 0..m.

Computing Edit Distance (2) 3.for each character of s (i from 1 to n). 4. for each character of t (j from 1 to m). 5. If s[i] equals t[j], c := 0. If s[i] doesn't equal t[j], c := d[i,j] := min(d[i-1,j]+1, d[i,j-1]+1, d[i-1,j-1]+c). 7.Return d[n,m].

Example: s=GUMBO, t=GAMBOL Steps 1 and 2: GUMBO G1 A2 M3 B4 O5 L6 Continue on blackboard!

Weighted edit distance As above, but the weight of an operation depends on the character(s) involved –Meant to capture keyboard errors, e.g. m more likely to be mis-typed as n than as q –Therefore, replacing m by n is a smaller edit distance than by q –(Same ideas usable for OCR, but with different weights) Require weight matrix as input Modify dynamic programming to handle weights

Edit distance to all dictionary terms? Given a (mis-spelled) query – do we compute its edit distance to every dictionary term? –Expensive and slow How do we cut the set of candidate dictionary terms? Here we use n-gram overlap for this

n-gram overlap Enumerate all the n-grams in the query string as well as in the lexicon Use the n-gram index (recall wild-card search) to retrieve all lexicon terms matching any of the query n-grams Threshold by number of matching n-grams

Example with trigrams Suppose the text is november –Trigrams are nov, ove, vem, emb, mbe, ber. The query is december –Trigrams are dec, ece, cem, emb, mbe, ber. So 3 trigrams overlap (of 6 in each term) How can we turn this into a normalized measure of overlap?

One option – Jaccard coefficient A commonly-used measure of overlap Let X and Y be two sets; then the J.C. is Equals 1 when X and Y have the same elements and zero when they are disjoint X and Y don’t have to be of the same size Always assigns a number between 0 and 1 –Now threshold to decide if you have a match –E.g., if J.C. > 0.8, declare a match

Matching trigrams Consider the query lord – we wish to identify words matching 2 of its 3 bigrams (lo, or, rd) lo or rd alonelordsloth lordmorbid bordercard border ardent Standard postings “merge” will enumerate … Adapt this to using Jaccard (or another) measure. Can we compute Jaccard coefficient if words are not stored in n-gram index?

Computational cost Spell-correction is computationally expensive Avoid running routinely on every query? –Run only on queries that matched few docs