CS276A Text Information Retrieval, Mining, and Exploitation Lecture 2 1 Oct 2002.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 1: Boolean Retrieval 1.

Advertisements

Srihari-CSE535-Spring2008 CSE 535 Information Retrieval Lecture 2: Boolean Retrieval Model.

PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.

CpSc 881: Information Retrieval. 2 Why compression? (in general) Use less disk space (saves money) Keep more stuff in memory (increases speed) Increase.

CS276 Information Retrieval and Web Search Lecture 5 – Index compression.

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.

Index Compression David Kauchak cs160 Fall 2009 adapted from:

Information Retrieval and Web Search

Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.

CS276 Information Retrieval and Web Search Lecture 1: Boolean retrieval.

CS276A Information Retrieval Lecture 3. Recap: lecture 2 Stemming, tokenization etc. Faster postings merges Phrase queries.

Information Retrieval using the Boolean Model. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all.

Index Compression Lecture 4. Recap: lecture 3 Stemming, tokenization etc. Faster postings merges Phrase queries.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

PrasadL3InvertedIndex1 Inverted Index Construction Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 5: Index Compression 1.

Information Retrieval Space occupancy evaluation.

CS347 Lecture 2 April 9, 2001 ©Prabhakar Raghavan.

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 1 Boolean retrieval.

PrasadL05TolerantIR1 Tolerant IR Adapted from Lectures by Prabhakar Raghavan (Yahoo and Stanford) and Christopher Manning (Stanford)

Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.

Recap Preprocessing to form the term vocabulary Documents Tokenization token and term Normalization Case-folding Lemmatization Stemming Thesauri Stop words.

Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.

ADFOCS 2004 Prabhakar Raghavan Lecture 2. Corpus size for estimates Consider n = 1M documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation.

CS728 Web Indexes Lecture 15. Building an Index for the Web Wish to answer simple boolean queries – given query term, return address of web pages that.

1 ITCS 6265 Information Retrieval and Web Mining Lecture 5 – Index compression.

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 3: tolerant retrieval.

Modern Information Retrieval Lecture 3: Boolean Retrieval.

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 5: Index Compression.

Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 1: Introduction and Boolean retrieval.

ITCS 6265 IR & Web Mining ITCS 6265/8265: Advanced Topics in KDD --- Information Retrieval and Web Mining Lecture 1 Boolean retrieval UNC Charlotte, Fall.

Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan.

Introduction to Information Retrieval CSE 538 MRS BOOK – CHAPTER I Boolean Model 1.

Information Retrieval Lecture 1. Query Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare’s.

Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:

1. L01: Corpuses, Terms and Search Basic terminology The need for unstructured text search Boolean Retrieval Model Algorithms for compressing data Algorithms.

Dictionaries and Tolerant retrieval

Introduction to Information Retrieval Boolean Retrieval.

Information Retrieval Quality of a Search Engine.

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Information Retrieval On the use of the Inverted Lists.

Introduction to Information Retrieval Introduction to Information Retrieval Introducing Information Retrieval and Web Search.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 8: Index Compression.

1. 2 Today’s Agenda Search engines: What are the main challenges in building a search engine? Structure of the data index Naïve solutions and their problems.

CS315 Introduction to Information Retrieval Boolean Search 1.

Why indexing? For efficient searching of a document

COMP9319: Web Data Compression and Search

Information Retrieval and Web Search

Take-away Administrativa

Large Scale Search: Inverted Index, etc.

Text Indexing and Search

CS122B: Projects in Databases and Web Applications Winter 2017

Slides from Book: Christopher D

정보 검색 특론 Information Retrieval and Web Search

Implementation Issues & IR Systems

Boolean Retrieval.

Index Compression Adapted from Lectures by

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

אחזור מידע, מנועי חיפוש וספריות

Boolean Retrieval.

Information Retrieval and Web Search Lecture 1: Boolean retrieval

Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing

Boolean Retrieval.

Introduction to Information Retrieval

Lecture 5: Index Compression Hankz Hankui Zhuo

3-3. Index compression Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

CS276 Information Retrieval and Web Search

Query processing: phrase queries and positional indexes

CS276: Information Retrieval and Web Search

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Presentation transcript:

CS276A Text Information Retrieval, Mining, and Exploitation Lecture 2 1 Oct 2002

Course structure & admin CS276: two quarters this year: CS276A: IR, web (link alg.), (infovis, XML, P2P) Website: CS276B: Clustering, categorization, IE, bio Course staff Textbooks Required work Questions?

Today’s topics Inverted index storage (continued) Compressing dictionaries in memory Processing Boolean queries Optimizing term processing Skip list encoding Wild-card queries Positional/phrase/proximity queries Evaluating IR systems – Part I

Dictionary and postings files: a fast, compact inverted index Usually in memory Gap-encoded, on disk

Inverted index storage Last time: Postings compression by gap encoding This time: Dictionary storage Dictionary in main memory, postings on disk This is common, especially for something like a search engine where high throughput is essential, but can also store most of it on disk with small, in ‑ memory index Tradeoffs between compression and query processing speed Cascaded family of techniques

How big is the lexicon V? Grows (but more slowly) with corpus size Empirically okay model: V = kN b where b ≈ 0.5, k ≈ 30–100; N = # tokens For instance TREC disks 1 and 2 (2 Gb; 750,000 newswire articles): ~ 500,000 terms Number is decreased by case-folding, stemming Indexing all numbers could make it extremely large (so usually don’t*) Spelling errors contribute a fair bit of size Exercise: Can one derive this from Zipf’s Law?

Dictionary storage - first cut Array of fixed-width entries 500,000 terms; 28 bytes/term = 14MB. Allows for fast binary search into dictionary 20 bytes4 bytes each

Exercises Is binary search really a good idea? What are the alternatives?

Fixed-width terms are wasteful Most of the bytes in the Term column are wasted – we allot 20 bytes for 1 letter terms. And still can’t handle supercalifragilisticexpialidocious. Written English averages ~4.5 characters. Exercise: Why is/isn’t this the number to use for estimating the dictionary size? Short words dominate token counts. Average word type in English: ~8 characters. Store dictionary as a string of characters: Pointer of next word shows end of last Hope to save up to 60% of dictionary space.

Compressing the term list ….systilesyzygeticsyzygialsyzygyszaibelyiteszczecinszomo…. Binary search these pointers Total string length = 500KB x 8 = 4MB Pointers resolve 4M positions: log 2 4M = 22bits = 3bytes

Total space for compressed list 4 bytes per term for Freq. 4 bytes per term for pointer to Postings. 3 bytes per term pointer Avg. 8 bytes per term in term string 500K terms  9.5MB Now avg. 11  bytes/term,  not 20.

Blocking Store pointers to every kth on term string. Need to store term lengths (1 extra byte) ….7systile9syzygetic8syzygial6syzygy11szaibelyite8szczecin9szomo…. Save 9 bytes  on 3  pointers. Lose 4 bytes on term lengths.

Exercise Estimate the space usage (and savings compared to 9.5MB) with blocking, for block sizes of k = 4, 8 and 16.

Impact on search Binary search down to 4-term block; Then linear search through terms in block. 8 documents: binary tree ave. = 2.6 compares Blocks of 4 (binary tree), ave. = 3 compares = (1+2∙2+4∙3+4)/8 =(1+2∙2+2∙3+2∙4+5)/

Extreme compression (see MG) Front-coding: Sorted words commonly have long common prefix – store differences only (for 3 in 4) Using perfect hashing to store terms “within” their pointers not good for vocabularies that change. Partition dictionary into pages use B-tree on first terms of pages pay a disk seek to grab each page if we’re paying 1 disk seek anyway to get the postings, “only” another seek/query term.

Compression: Two alternatives Lossless compression: all information is preserved, but we try to encode it compactly What IR people mostly do Lossy compression: discard some information Using a stoplist can be thought of in this way Techniques such as Latent Semantic Indexing (17 Oct) can be viewed as lossy compression One could prune from postings entries unlikely to turn up in the top k list for query on word Especially applicable to web search with huge numbers of documents but short queries e.g., Carmel et al. SIGIR 2002

Boolean queries: Exact match An algebra of queries using AND, OR and NOT together with query words What we used in examples in the first class Uses “set of words” document representation Precise: document matches condition or not Primary commercial retrieval tool for 3 decades Researchers had long argued superiority of ranked IR systems, but not much used in practice until spread of web search engines Professional searchers still like boolean queries: you know exactly what you’re getting Cf. Google’s boolean AND criterion

Query optimization Consider a query that is an AND of t terms. The idea: for each of the t terms, get its term-doc incidence from the postings, then AND together. Process in order of increasing freq: start with smallest set, then keep cutting further. This is why we kept freq in dictionary

Query processing exercises If the query is friends AND romans AND (NOT countrymen), how could we use the freq of countrymen? How can we perform the AND of two postings entries without explicitly building the 0/1 term-doc incidence vector?

General query optimization e.g., (madding OR crowd) AND (ignoble OR strife) Can put any boolean query into CNF Get freq’s for all terms. Estimate the size of each OR by the sum of its freq’s (conservative). Process in increasing order of OR sizes.

Exercise Recommend a query processing order for (tangerine OR trees) AND (marmalade OR skies) AND (kaleidoscope OR eyes)

Speeding up postings merges Insert skip pointers Say our current list of candidate docs for an AND query is 8,13,21. (having done a bunch of ANDs) We want to AND with the following postings entry: 2,4,6,8,10,12,14,16,18,20,22 Linear scan is slow.

Augment postings with skip pointers (at indexing time) At query time: As we walk the current candidate list, concurrently walk inverted file entry - can skip ahead (e.g., 8,21). Skip size: recommend about  (list length) 2,4,6,8,10,12,14,16,18,20,22,24,...

Caching If 25% of your users are searching for Britney Spears then you probably do need spelling correction, but you don’t need to keep on intersecting those two postings lists Web query distribution is extremely skewed, and you can usefully cache results for common queries

Query vs. index expansion Recall, from lecture 1: thesauri for term equivalents soundex for homonyms How do we use these? Can “expand” query to include equivalences Query car tyres  car tyres automobile tires Can expand index Index docs containing car under automobile, as well

Query expansion Usually do query expansion No index blowup Query processing slowed down Docs frequently contain equivalences May retrieve more junk puma  jaguar Carefully controlled wordnets

Wild-card queries: * mon*: find all docs containing any word beginning “mon”. Easy with binary tree (or B-Tree) lexicon: retrieve all words in range: mon ≤ w < moo *mon: find words ending in “mon”: harder Permuterm index: for word hello index under: hello$, ello$h, llo$he, lo$hel, o$hell Queries: X lookup on X$X* lookup on X*$ *X lookup on X$**X* lookup on X* X*Y lookup on Y$X*X*Y*Z ??? Exercise!

Wild-card queries Permuterm problem: ≈ quadruples lexicon size Another way: index all k-grams occurring in any word (any sequence of k chars) e.g., from text “April is the cruelest month” we get the 2-grams (bigrams) $ is a special word boundary symbol $a,ap,pr,ri,il,l$,$i,is,s$,$t,th,he,e$,$c,cr,ru, ue,el,le,es,st,t$, $m,mo,on,nt,h$

Processing n-gram wild-cards Query mon* can now be run as $m AND mo AND on Fast, space efficient But we’d get a match on moon. Must post-filter these results against query. Further wild-card refinements Cut down on pointers by using blocks Wild-card queries tend to have few bigrams keep postings on disk Exercise: given a trigram index, how do you process an arbitrary wild-card query?

Phrase search Search for “to be or not to be” No longer suffices to store only entries But could just do this anyway, and then post- filter [i.e., grep] for phrase matches Viable if phrase matches are uncommon Alternatively, store, for each term, entries <number of docs containing term; doc1: position1, position2 … ; doc2: position1, position2 … ; etc.>

Positional index example Can compress position values/offsets as we did with docs in the last lecture Nevertheless, this expands postings list in size substantially <be: ; 1: 7, 18, 33, 72, 86, 231; 2: 3, 149; 4: 17, 191, 291, 430, 434; 5: 363, 367, …> Which of these docs could contain “to be or not to be”?

Processing a phrase query Extract inverted index entries for each distinct term: to, be, or, not Merge their doc:position lists to enumerate all positions where “to be or not to be” begins. to: 2:1,17,74,222,551; 4:8,27,101,429,433; 7:13,23,191;... be: 1:17,19; 4:17,191,291,430,434; 5:14,19,101;... Same general method for proximity searches

Example: WestLaw Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use boolean queries Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search

Evaluating an IR system – Part I What are some measures for evaluating an IR system’s performance? Speed of indexing Index/corpus size ratio Speed of query processing “Relevance” of results Note: information need is translated into a boolean query Relevance is assessed relative to the information need not the query

Standard relevance benchmarks TREC - National Institute of Standards and Testing (NIST) has run large IR testbed for many years Reuters and other benchmark sets used “Retrieval tasks” specified sometimes as queries Human experts mark, for each query and for each doc, “Relevant” or “Not relevant” or at least for subset that some system returned

Precision and recall Precision: fraction of retrieved docs that are relevant = P(relevant|retrieved) Recall: fraction of relevant docs that are retrieved = P(retrieved|relevant) Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) RelevantNot Relevant Retrievedtpfp Not Retrievedfntn

Why not just use accuracy? How to build a % accurate search engine on a low budget…. People doing information retrieval want to find something and have a certain tolerance for junk Search for:

Precision/Recall Can get high recall (but low precision) by retrieving all docs for all queries! Recall is a non-decreasing function of the number of docs retrieved Precision usually decreases (in a good system) Difficulties in using precision/recall Should average over large corpus/query ensembles Need human relevance judgements Heavily skewed by corpus/authorship

A combined measure: F Combined measure that assesses this tradeoff is F measure (weighted harmonic mean): People usually use balanced F 1 measure i.e., with  = 1 or  = ½ Harmonic mean is conservative average See CJ van Rijsbergen, Information Retrieval

F 1 and other averages

Resources for today’s lecture Managing Gigabytes, Chapter 4 Sections 4.0 – 4.3 and 4.5. Modern Information Retrieval, Chapter 3. Princeton Wordnet

Glimpse of what’s ahead Building indices Term weighting and vector space queries Probabilistic IR User interfaces and visualization Link analysis in hypertext Web search Global connectivity analysis on the web XML data Large enterprise issues