Copyright © 2000-2005 D.S.Weld8/8/2015 9:34 PM1 CSE 454 - Case Studies Design of Alta Vista Based on a talk by Mike Burrows.

Slides:



Advertisements
Similar presentations
Information Retrieval in Practice
Advertisements

Memory.
© Copyright 2012 STI INNSBRUCK Apache Lucene Ioan Toma based on slides from Aaron Bannert
Topics covered: CPU Architecture CSE 243: Introduction to Computer Architecture and Hardware/Software Interface.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
File Processing - Organizing file for Performance MVNC1 Organizing Files for Performance Chapter 6 Jim Skon.
CS4432: Database Systems II Hash Indexing 1. Hash-Based Indexes Adaptation of main memory hash tables Support equality searches No range searches 2.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part C Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree.
File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Lucene Part3‏. Lucene High Level Infrastructure When you look at building your search solution, you often find that the process is split into two main.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
FALL 2004CENG 3511 Hashing Reference: Chapters: 11,12.
1 Optimizing Malloc and Free Professor Jennifer Rexford
C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.
1 Lecture 7: Data structures for databases I Jose M. Peña
Introduction to Parallel Programming MapReduce Except where otherwise noted all portions of this work are Copyright (c) 2007 Google and are licensed under.
Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma
The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.
WHAT IS A SEARCH ENGINE A search engine is not a physical engine, instead its an electronic code or a software programme that searches and indexes millions.
File Systems CSCI What is a file? A file is information that is stored on disks or other external media.
Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.
DBMS Implementation Chapter 6.4 V3.0 Napier University Dr Gordon Russell.
Copyright © D.S.Weld10/25/ :38 AM1 CSE 454 Index Compression Alta Vista PageRank.
8.1 Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Paging Physical address space of a process can be noncontiguous Avoids.
LECTURE 34: MAPS & HASH CSC 212 – Data Structures.
Indexing and Hashing By Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) DIRECTOR ARUNAI ENGINEERING COLLEGE TIRUVANNAMALAI.
Virtual Memory 1 1.
Copyright © D.S.Weld12/3/2015 8:49 PM1 Link Analysis CSE 454 Advanced Internet Systems University of Washington.
1 Chapter 7 Skip Lists and Hashing Part 2: Hashing.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Evidence from Content INST 734 Module 2 Doug Oard.
CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Indexing Database Management Systems. Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B + -Tree Index Files File Organization 2.
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
Bigtable: A Distributed Storage System for Structured Data
Chapter 5 Record Storage and Primary File Organizations
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
General Architecture of Retrieval Systems 1Adrienn Skrop.
MapReduce: Simplied Data Processing on Large Clusters Written By: Jeffrey Dean and Sanjay Ghemawat Presented By: Manoher Shatha & Naveen Kumar Ratkal.
W4118 Operating Systems Instructor: Junfeng Yang.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Lecture 6 of Computer Science II
Information Retrieval in Practice
Module 11: File Structure
Indexing UCSB 293S, 2017 Mainly based on slides from the text books of Croft/Metzler/Strohman and Manning/Raghavan/Schutze All slides ©Addison Wesley,
Indexing & querying text
C++ Standard Library.
File System Implementation
Hash-Based Indexes Chapter 11
OUTLINE Basic ideas of traditional retrieval systems
Optimizing Malloc and Free
Database Implementation Issues
Hash-Based Indexes Chapter 10
Indexing and Hashing Basic Concepts Ordered Indices
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Developing a Model-View-Controller Component for Joomla Part 3
Based on a talk by Mike Burrows
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
Based on a talk by Mike Burrows
DATABASE IMPLEMENTATION ISSUES
Database Implementation Issues
Based on a talk by Mike Burrows
Database Implementation Issues
Virtual Memory 1 1.
Presentation transcript:

Copyright © D.S.Weld8/8/2015 9:34 PM1 CSE Case Studies Design of Alta Vista Based on a talk by Mike Burrows

Copyright © D.S.Weld8/8/2015 9:34 PM2 AltaVista: Inverted Files Map each word to list of locations where it occurs Words = null-terminated byte strings Locations = 64 bit unsigned ints –Layer above gives interpretation for location URL Index into text specifying word number Slides adapted from talk by Mike Burrows

Copyright © D.S.Weld8/8/2015 9:34 PM3 Documents A document is a region of location space –Contiguous –No overlap –Densely allocated (first doc is location 1) All document structure encoded with words –enddoc at last location of document –begintitle, endtitle mark document title Document 1 Document 2... enddoc

Copyright © D.S.Weld8/8/2015 9:34 PM4 Format of Inverted Files Words ordered lexicographically Each word followed by list of locations Common word prefixes are compressed Locations encoded as deltas –Stored in as few bytes as possible –2 bytes is common –Sneaky assembly code for operations on inverted files Pack deltas into aligned 64 bit word First byte contains continuation bits Table lookup on byte => no branch instructs, no mispredicts 35 parallelized instructions/ 64 bit word = 10 cycles/word Index ~ 10% of text size

Copyright © D.S.Weld8/8/2015 9:34 PM5 Index Stream Readers (ISRs) Interface for –Reading result of query –Return ascending sequence of locations –Implemented using lazy evaluation Methods –loc(ISR) return current location –next(ISR)advance to next location –seek(ISR, X)advance to next loc after X –prev(ISR)return previous location !

Copyright © D.S.Weld8/8/2015 9:34 PM6 Processing Simple Queries User searches for “mp3” Open ISR on “mp3” –Uses hash table to avoid scanning entire file Next(), next(), next() –returns locations containing the word

Copyright © D.S.Weld8/8/2015 9:34 PM7 Combining ISRs AndCompare locs on two streams Or file ISR or ISR and ISR genesis fixx mp3 Genesis OR fixx (genesis OR fixx) AND mp3 OrMerges two or more ISRs NotNotReturns locations not in ISR (lazily) Problem!!!

Copyright © D.S.Weld What About File Boundaries? 8/8/2015 9:34 PM8 fixx enddoc mp3

Copyright © D.S.Weld8/8/2015 9:34 PM9 ISR Constraint Solver Inputs: –Set of ISRs: A, B,... –Set of Constraints Constraint Types –loc(A)  loc(B) + K –prev(A)  loc(B) + K –loc(A)  prev(B) + K –prev(A)  prev(B) + K For example: phrase “a b” –loc(A)  loc(B), loc(B)  loc(A) + 1 a a b a a b a b

Copyright © D.S.Weld8/8/2015 9:34 PM10 Two words on one page Let E be ISR for word enddoc Constraints for conjunction a AND b –prev(E)  loc(A) –loc(A)  loc(E) –prev(E)  loc(B) –loc(B)  loc(E) b a e b a b e b prev(E) loc(A) loc(E)loc(B) What if prev(E) Undefined?

Copyright © D.S.Weld8/8/2015 9:34 PM11 Advanced Search Field query: a in Title of page Let BT, ET be ISRP of words begintitle, endtitle Constraints: –loc(BT)  loc(A) –loc(A)  loc(ET) –prev(ET)  loc(BT) et a bt a et a bt et prev(ET) loc(BT) loc(A) loc(ET)

Copyright © D.S.Weld Implementing the Solver 8/8/2015 9:34 PM12 Constraint Types –loc(A)  loc(B) + K –prev(A)  loc(B) + K –loc(A)  prev(B) + K –prev(A)  prev(B) + K

Copyright © D.S.Weld8/8/2015 9:34 PM13 Remember: Index Stream Readers Methods –loc(ISR) return current location –next(ISR)advance to next location –seek(ISR, X)advance to next loc after X –prev(ISR)return previous location

Copyright © D.S.Weld8/8/2015 9:34 PM14 Solver Algorithm To satisfy: loc(A)  loc(B) + K –Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K –Execute: seek(B, prev(A) - K) To satisfy: loc(A)  prev(B) + K –Execute: seek(B, loc(A) - K), – next(B) To satisfy:prev(A)  prev(B) + K –Execute: seek(B, prev(A) - K) – next(B) while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) Heuristic: Which choice advances a stream the furthest?

Copyright © D.S.Weld8/8/2015 9:34 PM15 Update Can’t insert in the middle of an inverted file Must rewrite the entire file –Naïve approach: need space for two copies –Slow since file is huge Split data along two dimensions –Buckets solve disk space problem –Tiers alleviate small update problem

Copyright © D.S.Weld8/8/2015 9:34 PM16 Buckets & Tiers Each word is hashed to a bucket Add new documents by adding a new tier –Periodically merge tiers, bucket by bucket Delete documents by adding deleted word –Expunge deletions when merging tier 0 Tier 1 Hash bucket(s) for word a Hash bucket(s) for word b Hash bucket(s) for word zebra Tier 2... older newer bigger smaller

Copyright © D.S.Weld8/8/2015 9:34 PM17 What if Word Removed from Doc? Delete documents by adding deleted word Expunge deletions when merging tier 0 Tier 1 Hash bucket(s) for word b Hash bucket(s) for deleted Tier 2... older newer bigger smaller a deleted

Copyright © D.S.Weld8/8/2015 9:34 PM18 Scaling How handle huge traffic? –AltaVista Search ranked #16 –10,674,000 unique visitors (Dec’99) Scale across N hosts 1. Ubiquitous index. Query one host 2. Split N ways. Query all, merge results 3. Ubiquitous index. Host handles subrange of locations. Query all, merge results 4. Hybrids

Copyright © D.S.Weld8/8/2015 9:34 PM19 Front ends –Alpha workstations Back ends –4-10 CPU Alpha servers 8GB RAM, 150GB disk –Organized in groups of 4-10 machines Each with 1/N th of index AltaVista Structure Back end index servers FDDI switch FDDI switch To alternate site Border routers Front end HTTP servers