1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Improvements and extras Paul Thomas CSIRO. Overview of the lectures 1.Introduction to information retrieval (IR) 2.Ranked retrieval 3.Probabilistic retrieval.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Multimedia Database Systems
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Lecture 11 Search, Corpora Characteristics, & Lucene Introduction.
IR Models: Overview, Boolean, and Vector
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 3 March 23, 2005
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Information Retrieval Review
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) Classic Information Retrieval (IR)
ISP 433/533 Week 2 IR Models.
Basic IR: Queries Query is statement of user’s information need. Index is designed to map queries to likely to be relevant documents. Query type, content,
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
1 CS 430: Information Discovery Lecture 3 Inverted Files and Boolean Operations.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Modern Information Retrieval Chapter 2 Modeling. Can keywords be used to represent a document or a query? keywords as query and matching as query processing.
Hinrich Schütze and Christina Lioma
Evaluating the Performance of IR Sytems
Vector Space Model CS 652 Information Extraction and Integration.
Information retrieval Finding relevant data using irrelevant keys Example: database of photographic images sorted by number, date. DBMS: Well structured.
INEX 2003, Germany Searching in an XML Corpus Using Content and Structure INEX 2003, Germany Yiftah Ben-Aharon, Sara Cohen, Yael Grumbach, Yaron Kanza,
DL Introduction – Beeri/Feitelson1 Information Retrieval scope, basic concepts system architectures, modes of operation.
1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Information Retrieval
WMES3103 : INFORMATION RETRIEVAL INDEXING AND SEARCHING.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
CS246 Basic Information Retrieval. Today’s Topic  Basic Information Retrieval (IR)  Bag of words assumption  Boolean Model  Inverted index  Vector-space.
Chapter 5: Information Retrieval and Web Search
Overview of Search Engines
DETECTING NEAR-DUPLICATES FOR WEB CRAWLING Authors: Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma Presentation By: Fernando Arreola.
Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Thanks to Bill Arms, Marti Hearst Documents. Last time Size of information –Continues to grow IR an old field, goes back to the ‘40s IR iterative process.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
1 CS 430: Information Discovery Lecture 3 Inverted Files.
Chapter 6: Information Retrieval and Web Search
Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
LIS618 lecture 3 Thomas Krichel Structure of talk Document Preprocessing Basic ingredients of query languages Retrieval performance evaluation.
IR Homework #1 By J. H. Wang Mar. 21, Programming Exercise #1: Vector Space Retrieval Goal: to build an inverted index for a text collection, and.
Web- and Multimedia-based Information Systems Lecture 2.
Vector Space Models.
Information Retrieval
1 Data Mining: Text Mining. 2 Information Retrieval Techniques Index Terms (Attribute) Selection: Stop list Word stem Index terms weighting methods Terms.
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
1 Text Categorization  Assigning documents to a fixed set of categories  Applications:  Web pages  Recommending pages  Yahoo-like classification hierarchies.
Information Retrieval Inverted Files.. Document Vectors as Points on a Surface Normalize all document vectors to be of length 1 Define d' = Then the ends.
3: Search & retrieval: Structures. The dog stopped attacking the cat, that lived in U.S.A. collection corpus database web d1…..d n docs processed term-doc.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Large Scale Search: Inverted Index, etc.
Information Retrieval in Practice
Text Indexing and Search
Indexing & querying text
Information Retrieval and Web Search
IST 516 Fall 2011 Dongwon Lee, Ph.D.
Implementation Issues & IR Systems
CS 430: Information Discovery
Basic Information Retrieval
Representation of documents and queries
Chapter 5: Information Retrieval and Web Search
Boolean and Vector Space Retrieval Models
Information Retrieval and Web Design
Information Retrieval and Web Design
Presentation transcript:

1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26,

2 Information Retrieval

3 I want information about Michael Jordan, the machine learning expert Information Retrieval Setting query User Document Collection “Information Need” +”Michael Jordan” -basketball 1.Michael I. Jordan’s homepage 2.NBA.com 3.Michael Jordan on TV Ranked list of retrieved documents IR System documents No. 1 is good, Rest are bad feedback Revised ranked list of retrieved documents 1.Michael I. Jordan’s homepage 2.M.I. Jordan’s pubs 3.Graphical Models

4 Information Retrieval vs. Data Retrieval Information Retrieval System: a system that allows a user to retrieve documents that match her “information need” from a large corpus.  Ex: Get documents about Michael Jordan, the machine learning expert. Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus.  Ex: SELECT doc FROM corpus WHERE (doc.text CONTAINS “Michael Jordan”) AND NOT (doc.text CONTAINS “basketball”).

5 Information Retrieval vs. Data Retrieval Data RetrievalInformation Retrieval Database tables, structured Free text, unstructuredData SQL, Relational algebras Keywords, Natural language Queries Exact matchesApproximate matchesResults UnorderedOrdered by relevanceResults Knowledgeable users or automatic processes Non-expert humansAccessibility

6 Information Retrieval Systems IR System query processor text processor user query ranked retrieved docs User Corpus ranking procedure system query retrieved docs index indexer tokenized docs postings raw docs

7 Search Engines Search Engine query processor text processor user query ranked retrieved docs User Web ranking procedure system query retrieved docs index indexer tokenized docs postings crawler global analyzer repository

8 Classical IR vs. Web IR Web IRClassical IR HugeLargeVolume Noisy, dupsClean, no dupsData quality In fluxInfrequentData change rate Partially accessibleAccessibleData accessibility Widely diverseHomogeneousFormat diversity HypertextTextDocuments LargeSmall# of matches Link-basedContent-basedIR techniques

9 Outline Abstract formulation Models for relevance ranking Retrieval evaluation Query languages Text processing Indexing and searching

10 Abstract Formulation Ingredients:  D: document collection  Q: query space  f: D x Q  R: relevance scoring function  For every q in Q, f induces a ranking (partial order)  q on D Functions of an IR system:  Preprocess D and create an index I  Given q in Q, use I to produce a permutation  on D Goals:  Accuracy:  should be “close” to  q  Compactness: index should be compact  Response time: answers should be given quickly

11 Document Representation T = { t 1,…, t k }: a “token space”  (a.k.a. “feature space” or “term space”)  Ex: all words in English  Ex: phrases, URLs, … A document: a real vector d in R k  d i : “weight” of token t i in d  Ex: d i = normalized # of occurrences of t i in d

12 Classic IR (Relevance) Models The Boolean model The Vector Space Model (VSM)

13 The Boolean Model A document: a boolean vector d in {0,1} k  d i = 1 iff t i belongs to d A query: a boolean formula q over tokens  q: {0,1} k  {0,1}  Ex: “Michael Jordan” AND (NOT basketball)  Ex: +“Michael Jordan” –basketball Relevance scoring function: f(d,q) = q(d)

14 The Boolean Model: Pros & Cons Advantages:  Simplicity for users Disadvantages:  Relevance scoring is too coarse

15 The Vector Space Model (VSM) A document: a real vector d in R k  d i = weight of t i in d (usually TF-IDF score) A query: a real vector q in R k  q i = weight of t i in q Relevance scoring function: f(d,q) = sim(d,q)  “similarity” between d and q

16 Popular Similarity Measures L 1 or L 2 distance  d,q are first normalized to have unit norm Cosine similarity d q d –q  d q

17 TF-IDF Score: Motivation Motivating principle:  A term t i is relevant to a document d if: t i occurs many times in d relative to other terms that occur in d t i occurs many times in d relative to its number of occurrences in other documents Examples  10 out of 100 terms in d are “java”  10 out of 10,000 terms in d are “java”  10 out of 100 terms in d are “the”

18 TF-IDF Score: Definition n(d,t i ) = # of occurrences of t i in d N =  i n(d,t i ) (# of tokens in d) D i = # of documents containing t i D = # of documents in the collection TF(d,t i ): “Term Frequency”  Ex: TF(d,t i ) = n(d,t i ) / N  Ex: TF(d,t i ) = n(d,t i ) / (max j { n(d,t j ) }) IDF(t i ): “Inverse Document Frequency”  Ex: IDF(t i ) = log (D/D i ) TFIDF(d,t i ) = TF(d,t i ) x IDF(t i )

19 VSM: Pros & Cons Advantages:  Better granularity in relevance scoring  Good performance in practice  Efficient implementations Disadvantages:  Assumes term independence

20 Retrieval Evaluation Notations:  D: document collection  D q : documents in D that are “relevant” to query q Ex: f(d,q) is above some threshold  L q : list of results on query q D LqLq DqDq Recall: Precision:

21 Recall & Precision: Example Recall(A) = 80% Precision(A) = 40% 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List ARelevant docs: d 123, d 56, d 9, d 25, d 3 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B Recall(B) = 100% Precision(B) = 50%

22 and Notations:  D q : documents in D that are “relevant” to q  L q,k : top k results on the list

23 Example 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List A 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B

24 Example 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List A 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B

25 “Interpolated” Precision Notations:  D q : documents in D that are “relevant” to q  r: a recall level (e.g., 20%)  k(r): first k so that >= r Interpolated recall level r = max { : k >= k(r) }

26 Precision vs. Recall: Example 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List A 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B

27 Query Languages: Keyword-Based Singe-word queries  Ex: Michael Jordan machine learning Context queries  Phrases. Ex: “Michael Jordan” “machine learning”  Proximity. Ex: “Michael Jordan” at distance of at most 10 words from “machine learning” Boolean queries  Ex: +”Michael Jordan” –basketball Natural language queries  Ex: “Get me pages about Michael Jordan, the machine learning expert.”

28 Query Languages: Pattern Matching Prefixes  Ex: prefix:comput Suffixes  Ex: suffix:net Regular Expressions  Ex: [0-9]+th world-wide web conference

29 Text Processing Lexical analysis & tokenization  Split text into words, downcase letters, filter out punctuation marks, digits, hyphens Stopword elimination  Better retrieval accuracy, more compact index  Ex: “to be or not to be” Stemming  Ex: “computer”, “computing”, “computation”  comput Index term selection  Keywords vs. full text

30 Inverted Index Michael 1 Jordan 2, the 3 author 4 of 5 “graphical 6 models 7 ”, is 8 a 9 professor 10 at 11 U.C. 12 Berkeley 13. The 1 famous 2 NBA 3 legend 4 Michael 5 Jordan 6 liked 7 to 8 date 9 models 10. d1d1 d2d2 author: (d 1,4) berkeley: (d 1,13) date: (d 2,9) famous: (d 2,2) graphical: (d 1,6) jordan: (d 1,2), (d 2,6) legend: (d 2,4) like: (d 2,7) michael: (d 1,1), (d 2,5) model: (d 1,7), (d 2,10) nba: (d 2,3) professor: (d 1,10) uc: (d 1,12) Vocabulary Postings

31 Inverted Index Structure Vocabulary File term1 term2 … Postings File postings list 1 postings list 2 … Usually, fits in main memory Stored on disk

32 Searching an Inverted Index Given:  t 1, t 2 : query terms  L 1,L 2 : corresponding posting lists Need to get ranked list of docs in intersection of L 1,L 2 Solution 1: If L 1,L 2 are comparable in size, “merge” L 1 and L 2 to find docs in their intersection, and then order them by rank. (running time: O(|L 1 | + |L 2 |)) Solution 2: If L 1 is considerably shorter than L 2, binary search each posting of L 1 in L 2 to find the intersection, and then order them by rank. (running time: O(|L 1 | x log(|L 2 |))

33 Search Optimization Improvement: Order docs in posting lists by static rank (e.g., PageRank). Then, can output top matches, without scanning the whole lists.

34 Index Construction Given a stream of documents, store (did,tid,pos) triplets in a file Sort and group file by tid Extract posting lists

35 Index Maintenance Naïve updates of inverted index can be very costly  Require random access  A single change may cause many insertions/deletions Batch updates Two indices  Main index (created in batch, large, compressed)  “Stop-press” index (incremental, small, uncompressed)

36 Index Maintenance If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop- press index. Given a query term t, fetch its list L t from main index, and two lists L t,+ and L t,- from stop-press index. Result is: When stop-press index grows too large, it is merged into the main index.

37 Index Compression Delta compression  Saves a lot for popular terms  Doesn’t save much for rare terms (but these don’t take much space anyway) michael: ( ,5), ( ,12), ( ,77), ( ,88),… michael: ( ,5), (2,12), (4,77), (22,88),…

38 Variable Length Encodings How to encode gaps succinctly?  Option 1: Fixed-length binary encoding. Effective when all gap lengths are equally likely No savings over storing doc ids.  Option 2: Unary encoding. Gap x is encoded by x-1 1’s followed by a 0 Effective when large gaps are very rare (Pr(x) = 1/2 x )  Option 3: Gamma encoding. Gap x is encoded by (  x  x ), where  x is the binary encoding of x and  x is the length of  x, encoded in unary. Encoding length: about 2log(x).

39 End of Lecture 2