Copyright © 2000-2009 D.S.Weld10/25/2015 10:38 AM1 CSE 454 Index Compression Alta Vista PageRank.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Memory.
Chapter 5: Introduction to Information Retrieval
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval in Practice
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Xyleme A Dynamic Warehouse for XML Data of the Web.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Text Operations: Coding / Compression Methods. Text Compression Motivation –finding ways to represent the text in fewer bits –reducing costs associated.
9/10: Indexing & Tolerant Dictionaries Make-up Class: 10:30  11:45AM The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze.
Quick Review of Apr 15 material Overflow –definition, why it happens –solutions: chaining, double hashing Hash file performance –loading factor –search.
Document and Query Forms Chapter 2. 2 Document & Query Forms Q 1. What is a document? A document is a stored data record in any form A document is a stored.
Database System Concepts, 5th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 11: Storage and.
Overview of Search Engines
Copyright © D.S.Weld8/8/2015 9:34 PM1 CSE Case Studies Design of Alta Vista Based on a talk by Mike Burrows.
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
Indexing. Overview of the Talk zInverted File Indexing zCompression of inverted files zSignature files and bitmaps zComparison of indexing methods zConclusion.
Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.
©Silberschatz, Korth and Sudarshan11.1Database System Concepts Chapter 11: Storage and File Structure File Organization Organization of Records in Files.
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
CE Operating Systems Lecture 17 File systems – interface and implementation.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
1 Information Retrieval LECTURE 1 : Introduction.
Evidence from Content INST 734 Module 2 Doug Oard.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 5 Record Storage and Primary File Organizations
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
General Architecture of Retrieval Systems 1Adrienn Skrop.
CSE 454 Indexing. Todo A bit repetitive – cut some slides Some inconsistencie – eg are positions in the index or not. Do we want nutch as case study instead.
Information Retrieval in Practice
University of Maryland Baltimore County
CSE 454 Indexing.
Storage and File Organization
Why indexing? For efficient searching of a document
Information Retrieval in Practice
Module 11: File Structure
Efficient Multi-User Indexing for Secure Keyword Search
Search Engine Architecture
Text Indexing and Search
Indexing & querying text
Chapter 11: Storage and File Structure
Implementation Issues & IR Systems
Database Implementation Issues
Indexing and Hashing Basic Concepts Ordered Indices
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Based on a talk by Mike Burrows
Lecture 2- Query Processing (continued)
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Based on a talk by Mike Burrows
Chapter 13: Data Storage Structures
DATABASE IMPLEMENTATION ISSUES
CSE 454 Crawlers & Indexing.
Database Implementation Issues
Based on a talk by Mike Burrows
Efficient Retrieval Document-term matrix t1 t tj tm nf
Chapter 11: Indexing and Hashing
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
Database Implementation Issues
Presentation transcript:

Copyright © D.S.Weld10/25/ :38 AM1 CSE 454 Index Compression Alta Vista PageRank

Copyright © D.S.Weld Administrivia No class Tues 10/26 –Instead go to today’s colloquium –Group Meetings Never-Ending Language Learning –Today 3:30pm EEB /25/ :38 AM2

Copyright © D.S.Weld Information Extraction INet Advertising Security Cloud Computing Revisiting Class Overview Network Layer Crawling IR - Ranking Indexing Query processing Other Cool Stuff Content Analysis

Copyright © D.S.Weld 4 Review Vector Space Representation –Dot Product as Similarity Metric TF-IDF for Computing Weights –w ij = f(i,j) * log(N/n i ) –Where q = … word i … –N = |docs| n i = |docs with word i | But How Process Efficiently? documents terms q didi t1t1 t2t2 djdj

Copyright © D.S.Weld 5 Retrieval Document-term matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d 2 | d i w i1 w i2... w ij... w im 1/|d i | d n w n1 w n2... w nj... w nm 1/|d n | w ij is the weight of term t j in document d i Most w ij ’s will be zero.

Copyright © D.S.Weld 6 Inverted Files for Multiple Documents DOCID OCCUR POS 1 POS “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document LEXICON OCCURENCE INDEX …

Copyright © D.S.Weld 7 Many Variations Possible Address space (flat, hierarchical) –Alta Vista uses flat approach Record term-position information Precalculate TF-IDF info Stored header, font & tag info Compression strategies

Copyright © D.S.Weld 8 Compression What Should We Compress? –Repository –Lexicon –Inv Index What properties do we want? –Compression ratio –Compression speed –Decompression speed –Memory requirements –Pattern matching on compressed text –Random access

Copyright © D.S.Weld 9 Inverted File Compression Each inverted list has the form A naïve representation results in a storage overhead of This can also be stored as Each difference is called a d-gap. Since each pointer requires fewer than Trick is encoding …. since worst case …. bits. Assume d-gap representation for the rest of the talk, unless stated otherwise Slides adapted from Tapas Kanungo and David Mount, Univ Maryland

Copyright © D.S.Weld 10 Text Compression Two classes of text compression methods Symbolwise (or statistical) methods –Estimate probabilities of symbols - modeling step –Code one symbol at a time - coding step –Use shorter code for the most likely symbol –Usually based on either arithmetic or Huffman coding Dictionary methods –Replace fragments of text with a single code word –Typically an index to an entry in the dictionary. eg: Ziv-Lempel coding: replaces strings of characters with a pointer to a previous occurrence of the string. –No probability estimates needed Symbolwise methods are more suited for coding d-gaps

Copyright © D.S.Weld 11 Classifying d-gap Compression Methods: Global: each list compressed using same model –non-parameterized: probability distribution for d-gap sizes is predetermined. –parameterized: probability distribution is adjusted according to certain parameters of the collection. Local: model is adjusted according to some parameter, like the frequency of the term By definition, local methods are parameterized.

Copyright © D.S.Weld 12 Conclusion Local methods best Parameterized global models ~ non-parameterized –Pointers not scattered randomly in file In practice, best index compression algorithm is: –Local Bernoulli method (using Golomb coding) Compressed inverted indices usually faster+smaller than –Signature files –Bitmaps Local < Parameterized Global < Non-parameterized Global Not by much

Copyright © D.S.Weld10/25/ :38 AM13 CSE Case Studies Design of Alta Vista Based on a talk by Mike Burrows

Copyright © D.S.Weld10/25/ :38 AM14 AltaVista: Inverted Files Map each word to list of locations where it occurs Words = null-terminated byte strings Locations = 64 bit unsigned ints –Layer above gives interpretation for location URL Index into text specifying word number Slides adapted from talk by Mike Burrows

Copyright © D.S.Weld10/25/ :38 AM15 Documents A document is a region of location space –Contiguous –No overlap –Densely allocated (first doc is location 1) All document structure encoded with words –enddoc at last location of document –begintitle, endtitle mark document title Document 1 Document 2... enddoc

Copyright © D.S.Weld10/25/ :38 AM16 Format of Inverted Files Words ordered lexicographically Each word followed by list of locations Common word prefixes are compressed Locations encoded as deltas –Stored in as few bytes as possible –2 bytes is common –Sneaky assembly code for operations on inverted files Pack deltas into aligned 64 bit word First byte contains continuation bits Table lookup on byte => no branch instructs, no mispredicts 35 parallelized instructions/ 64 bit word = 10 cycles/word Index ~ 10% of text size

Copyright © D.S.Weld10/25/ :38 AM17 Index Stream Readers (ISRs) Interface for –Reading result of query –Return ascending sequence of locations –Implemented using lazy evaluation Methods –loc(ISR) return current location –next(ISR)advance to next location –seek(ISR, X)advance to next loc after X –prev(ISR)return previous location !

Copyright © D.S.Weld10/25/ :38 AM18 Processing Simple Queries User searches for “mp3” Open ISR on “mp3” –Uses hash table to avoid scanning entire file Next(), next(), next() –returns locations containing the word

Copyright © D.S.Weld10/25/ :38 AM19 Combining ISRs AndCompare locs on two streams Or file ISR or ISR and ISR genesis fixx mp3 Genesis OR fixx (genesis OR fixx) AND mp3 OrMerges two or more ISRs NotNotReturns locations not in ISR (lazily) Problem!!!

Copyright © D.S.Weld What About File Boundaries? 10/25/ :38 AM20 fixx enddoc mp3

Copyright © D.S.Weld10/25/ :38 AM21 ISR Constraint Solver Inputs: –Set of ISRs: A, B,... –Set of Constraints Constraint Types –loc(A)  loc(B) + K –prev(A)  loc(B) + K –loc(A)  prev(B) + K –prev(A)  prev(B) + K For example: phrase “a b” –loc(A)  loc(B), loc(B)  loc(A) + 1 a a b a a b a b

Copyright © D.S.Weld10/25/ :38 AM22 Two words on one page Let E be ISR for word enddoc Constraints for conjunction a AND b –prev(E)  loc(A) –loc(A)  loc(E) –prev(E)  loc(B) –loc(B)  loc(E) b a e b a b e b prev(E) loc(A) loc(E)loc(B) What if prev(E) Undefined?

Copyright © D.S.Weld10/25/ :38 AM23 Advanced Search Field query: a in Title of page Let BT, ET be ISRP of words begintitle, endtitle Constraints: –loc(BT)  loc(A) –loc(A)  loc(ET) –prev(ET)  loc(BT) et a bt a et a bt et prev(ET) loc(BT) loc(A) loc(ET)

Copyright © D.S.Weld Implementing the Solver 10/25/ :38 AM24 Constraint Types –loc(A)  loc(B) + K –prev(A)  loc(B) + K –loc(A)  prev(B) + K –prev(A)  prev(B) + K

Copyright © D.S.Weld10/25/ :38 AM25 Remember: Index Stream Readers Methods –loc(ISR) return current location –next(ISR)advance to next location –seek(ISR, X)advance to next loc after X –prev(ISR)return previous location

Copyright © D.S.Weld10/25/ :38 AM26 Solver Algorithm To satisfy: loc(A)  loc(B) + K –Execute: seek(B, loc(A) - K) while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X return it prev(ISR) return pre loc

Copyright © D.S.Weld10/25/ :38 AM27 Solver Algorithm To satisfy: loc(A)  loc(B) + K –Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K –Execute: seek(B, prev(A) - K) while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X return it prev(ISR) return pre loc a a b a a b a b K A B PAPA K b

Copyright © D.S.Weld10/25/ :38 AM28 Solver Algorithm To satisfy: loc(A)  loc(B) + K –Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K –Execute: seek(B, prev(A) - K) To satisfy: loc(A)  prev(B) + K –Execute: seek(B, loc(A) - K), – next(B) while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X return it prev(ISR) return pre loc a a b a a b a b K AB PBPB b

Copyright © D.S.Weld10/25/ :38 AM29 Solver Algorithm To satisfy: loc(A)  loc(B) + K –Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K –Execute: seek(B, prev(A) - K) To satisfy: loc(A)  prev(B) + K –Execute: seek(B, loc(A) - K), – next(B) To satisfy:prev(A)  prev(B) + K –Execute: seek(B, prev(A) - K) – next(B) while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X return it prev(ISR) return pre loc

Copyright © D.S.Weld10/25/ :38 AM30 Solver Algorithm To satisfy: loc(A)  loc(B) + K –Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K –Execute: seek(B, prev(A) - K) To satisfy: loc(A)  prev(B) + K –Execute: seek(B, loc(A) - K), – next(B) To satisfy:prev(A)  prev(B) + K –Execute: seek(B, prev(A) - K) – next(B) while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) Heuristic: Which choice advances a stream the furthest?

Copyright © D.S.Weld10/25/ :38 AM31 Update Can’t insert in the middle of an inverted file Must rewrite the entire file –Naïve approach: need space for two copies –Slow since file is huge Split data along two dimensions –Buckets solve disk space problem –Tiers alleviate small update problem

Copyright © D.S.Weld10/25/ :38 AM32 Buckets & Tiers Each word is hashed to a bucket Add new documents by adding a new tier –Periodically merge tiers, bucket by bucket Tier 1 Hash bucket(s) for word a Hash bucket(s) for word b Hash bucket(s) for word zebra Tier 2... older newer bigger smaller

Copyright © D.S.Weld10/25/ :38 AM33 What if Word Removed from Doc? Delete documents by adding deleted word Expunge deletions when merging tier 1 Tier 1 Hash bucket(s) for word b Hash bucket(s) for deleted Tier 2... older newer bigger smaller a deleted

Copyright © D.S.Weld10/25/ :38 AM34 Scaling How handle huge traffic? –AltaVista Search ranked #16 –10,674,000 unique visitors (Dec’99) Scale across N hosts 1. Ubiquitous index. Query one host 2. Split N ways. Query all, merge results 3. Ubiquitous index. Host handles subrange of locations. Query all, merge results 4. Hybrids

Copyright © D.S.Weld10/25/ :38 AM35 Front ends –Alpha workstations Back ends –4-10 CPU Alpha servers 8GB RAM, 150GB disk –Organized in groups of 4-10 machines Each with 1/N th of index AltaVista Structure Back end index servers FDDI switch FDDI switch To alternate site Border routers Front end HTTP servers