Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
Hash-based Indexes CS 186, Spring 2006 Lecture 7 R &G Chapter 11 HASH, x. There is no definition for this word -- nobody knows what hash is. Ambrose Bierce,
Hash-Based Indexes The slides for this text are organized into chapters. This lecture covers Chapter 10. Chapter 1: Introduction to Database Systems Chapter.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Hash-Based Indexes Chapter 11.
Copyright 2003Curt Hill Hash indexes Are they better or worse than a B+Tree?
PrasadL07IndexCompression1 Index Compression Adapted from Lectures by Prabhakar Raghavan (Yahoo, Stanford) and Christopher Manning.
Inverted Index Hongning Wang
External Sorting “There it was, hidden in alphabetical order.” Rita Holt R&G Chapter 13.
Indexing UCSB 290N. Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley, 2008.
Inverted Files, Signature Files, Bitmaps
Parametric search and zone weighting Lecture 6. Recap of lecture 4 Query expansion Index construction.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
External Sorting R & G Chapter 11 One of the advantages of being disorderly is that one is constantly making exciting discoveries. A. A. Milne.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
Information Retrieval IR 4. Plan This time: Index construction.
Efficient Storage and Retrieval of Data
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Lecture 19: B-trees and Hash Tables Wednesday, November 12, 2003.
Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Organizing files for performance Chapter Data compression Advantages of reduced file size Redundancy reduction: state code example Repeating sequences:
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
Information Retrieval Space occupancy evaluation.
External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Fast Set Intersection in Memory Bolin Ding Arnd Christian König UIUC Microsoft Research.
Basics of Data Compression Paolo Ferragina Dipartimento di Informatica Università di Pisa.
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Indices Tomasz Bartoszewski. Inverted Index Search Construction Compression.
Compression of Inverted Indexes for Fast Query Evaluation Falk Scholer Hugh Williams John Yiannis Justin Zobel (RMIT University, Melbourne, Australia)
1 Chapter 17 Disk Storage, Basic File Structures, and Hashing Chapter 18 Index Structures for Files.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Author: Abhishek Das Google Inc., USA Ankit Jain Google Inc., USA Presented By: Anamika Mukherji 13/26/2013Indexing The World Wide Web.
Indexing.
Hashing and Hash-Based Index. Selection Queries Yes! Hashing  static hashing  dynamic hashing B+-tree is perfect, but.... to answer a selection query.
Positional Data Organization and Compression in Web Inverted Indexes Leonidas Akritidis Panayiotis Bozanis Department of Computer & Communication Engineering,
The College of Saint Rose CSC 460 / CIS 560 – Search and Information Retrieval David Goldschmidt, Ph.D. from Search Engines: Information Retrieval in Practice,
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Evidence from Content INST 734 Module 2 Doug Oard.
Index construction: Compression of postings Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 5.3 and a paper.
Web Information Retrieval Textbook by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schutze Notes Revised by X. Meng for SEU May 2014.
Relational Operator Evaluation. overview Projection Two steps –Remove unwanted attributes –Eliminate any duplicate tuples The expensive part is removing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 13.
1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Introduction to COMP9319: Web Data Compression and Search Search, index construction and compression Slides modified from Hinrich Schütze and Christina.
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
University of Maryland Baltimore County
Why indexing? For efficient searching of a document
COMP9319: Web Data Compression and Search
Information Retrieval in Practice
Text Indexing and Search
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Indexing & querying text
External Sorting Chapter 13
COMP 430 Intro. to Database Systems
Are they better or worse than a B+Tree?
Hash-Based Indexes Chapter 11
9/12/2018.
Implementation Issues & IR Systems
Chapters 17 & 18 6e, 13 & 14 5e: Design/Storage/Index
Hash-Based Indexes Chapter 10
External Sorting Chapter 13
Hash-Based Indexes Chapter 11
Database Design and Programming
Query processing: phrase queries and positional indexes
Lecture 20: Indexes Monday, February 27, 2006.
External Sorting Chapter 13
Presentation transcript:

Search engines 2 Øystein Torbjørnsen Fast Search and Transfer

Outline Inverted index Constructing inverted indexes Compression Succinct index (Holger Bast) Hierarchical inverted indexes Skip lists

Inverted index dark darker Dictionary Posting file a cal drill excellent zebra docidfrequencyposition list posting list

Inverted index Posting list is sorted on docid Usually 2 disk IOs to look up one term, O(1) – One to read the dictionary entry – One to read the posting list (possibly large)

Construction Create sorted subfiles Merge the subfiles into one large file Needs twice the disk storage as the final index

Compression Basic idea: – Use knowledge of value distribution to compress data Costly to compress and decompress, but – Less disk IO – More data fits in main memory – Better locality in memory Many different schemes: – Delta coding – vByte – PFOR-DELTA – Huffman, Golomb, Rice, Simple9, Simple16

Delta coding Works on sorted lists Encoded as difference from previous entry To be combined with other compression

vByte Variable-byte encoding Using full bytes 1 marker bit + 7 value bits Fast encoding and decoding byte end marker value = 76 *128*128 = = 57 *128 = = 106 = 106 =

PFOR-DELTA Combination of three techniques – P=Prefix suppression – FOR=Frame Of Reference – DELTA = delta coding Blocks of e.g. 128 values Fixed number of bits per value Exception list for outliers

Succinct index Variation of inverted index Index ranges of words Prefix and range search Smaller dictionary Longer lists to process Better compression Less disk IOs – Disk position vs. transfer times

Hierarchical inverted indexes Incremental indexing Build vs lookup time

Never merge Just keep subfiles and never merge into large file Construction is O(n) Fastest possible construction time Slow lookup with many files O(n)

Hierarchy n=3 Level 1 Level 2 Level 3

Merging strategy Merge into same levelMerge to level above m=2n=3

Issues Needs twice the space Merge of upper layer takes a long time Larger initial files leads to fewer merges Lookup times varies over time depending on number of files at each level

Column organization Field selection – Based on query Phrase queries and proximity scoring needs position Simple boolean queries does not need position and frequency Relevance scoring needs frequency – Don’t decompress what you don’t need – Don’t read from disk what you don’t need – Locality

More than text search Context info Meta data Values docidfrequencyposition listcontext dociddate docidsize docidowner docidperson docidzip code docidcompany position docidURI

Skipping Search engine and skipping – Used in merging (AND queries) – Semi sequential access – Direct lookup – Disk based Skip list Vs Btree Variants

Skip list 0 < p < 1 (e.g. p=1/2 or p=1/4) Lookup and insertion is O(log n) Size vs speed

Issues Compression Can be skewed

Skip list vs B-Tree Skip list Main-memory structure Less space B-Tree Disk based structure Better locality

Variations Deterministic skip list 1 level skips Separate skip table