Based on a talk by Mike Burrows

Slides:

Advertisements

Similar presentations

Information Retrieval in Practice

Advertisements

Indexing & Tolerant Dictionaries The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze L'Homme qui marche Alberto Giacometti (sold for.

Chapter 5: Introduction to Information Retrieval

Modern information retrieval Modelling. Introduction IR systems usually adopt index terms to process queries IR systems usually adopt index terms to process.

Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.

Introduction to Information Retrieval

The Inside Story Christine Reilly CSCI 6175 September 27, 2011.

File Processing - Indirect Address Translation MVNC1 Hashing Indirect Address Translation Chapter 11.

Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.

Inverted Index Hongning Wang

Information Retrieval in Practice

Anatomy of a Large-Scale Hypertextual Web Search Engine (e.g. Google)

Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.

9/10: Indexing & Tolerant Dictionaries Make-up Class: 10:30  11:45AM The pdf image slides are from Hinrich Schütze’s slides,Hinrich Schütze.

1 Basic Text Processing and Indexing. 2 Document Processing Steps Lexical analysis (tokenizing) Stopwords removal Stemming Selection of indexing terms.

Web Search – Summer Term 2006 II. Information Retrieval (Basics) (c) Wolfgang Hürst, Albert-Ludwigs-University.

Overview of Search Engines

Copyright © D.S.Weld8/8/2015 9:34 PM1 CSE Case Studies Design of Alta Vista Based on a talk by Mike Burrows.

C o n f i d e n t i a l Developed By Nitendra NextHome Subject Name: Data Structure Using C Title: Overview of Data Structure.

Introduction n Keyword-based query answering considers that the documents are flat i.e., a word in the title has the same weight as a word in the body.

Detecting Near-Duplicates for Web Crawling Manku, Jain, Sarma

Chapter 13 File Structures. Understand the file access methods. Describe the characteristics of a sequential file. After reading this chapter, the reader.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Presented By: Sibin G. Peter Instructor: Dr. R.M.Verma.

Chapter 2 Architecture of a Search Engine. Search Engine Architecture n A software architecture consists of software components, the interfaces provided.

CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.

Search engines 2 Øystein Torbjørnsen Fast Search and Transfer.

Copyright © D.S.Weld10/25/ :38 AM1 CSE 454 Index Compression Alta Vista PageRank.

The Anatomy of a Large-Scale Hypertextual Web Search Engine Sergey Brin & Lawrence Page Presented by: Siddharth Sriram & Joseph Xavier Department of Electrical.

IR Homework #2 By J. H. Wang Mar. 31, Programming Exercise #2: Query Processing and Searching Goal: to search relevant documents for a given query.

1 FollowMyLink Individual APT Presentation Third Talk February 2006.

CPSC 252 Hashing Page 1 Hashing We have already seen that we can search for a key item in an array using either linear or binary search. It would be better.

Introduction n IR systems usually adopt index terms to process queries n Index term: u a keyword or group of selected words u any word (more general) n.

Chapter 5 Record Storage and Primary File Organizations

General Architecture of Retrieval Systems 1Adrienn Skrop.

CSE 454 Indexing. Todo A bit repetitive – cut some slides Some inconsistencie – eg are positions in the index or not. Do we want nutch as case study instead.

Information Retrieval in Practice

University of Maryland Baltimore County

CSE 454 Indexing.

Why indexing? For efficient searching of a document

Information Retrieval in Practice

Module 11: File Structure

Indexing & querying text

Modified from Stanford CS276 slides Lecture 4: Index Construction

Implementation Issues & IR Systems

OUTLINE Basic ideas of traditional retrieval systems

CSE 454 Advanced Internet Systems University of Washington

CSE 454 Advanced Internet Systems University of Washington

Inverted Indices (with Compression & LSI)

Database Implementation Issues

Inverted Indicies (with Compression & LSI)

CSE 454 Advanced Internet Systems University of Washington

CSE 454 Advanced Internet Systems University of Washington

Anatomy of a search engine

Hash-Based Indexes Chapter 10

Indexing and Hashing Basic Concepts Ordered Indices

Implementation Based on Inverted Files

Based on a talk by Mike Burrows

DATABASE IMPLEMENTATION ISSUES

Chapter 5: Information Retrieval and Web Search

CSE 454 Crawlers & Indexing.

File Organization.

Database Implementation Issues

Based on a talk by Mike Burrows

Efficient Retrieval Document-term matrix t1 t tj tm nf

Chapter 11 Instructor: Xin Zhang

Recuperação de Informação B

Recuperação de Informação B

Recuperação de Informação B

Advanced information retrieval

Database Implementation Issues

Presentation transcript:

Based on a talk by Mike Burrows CSE 454 - Case Studies Design of Alta Vista Based on a talk by Mike Burrows 1/18/2019 10:44 AM 1

Information Extraction Class Overview Information Extraction INet Advertising Security Cloud Computing Revisiting Other Cool Stuff Query processing Indexing IR - Ranking Content Analysis Crawling Network Layer

Review Vector Space Representation TF-IDF for Computing Weights q dj Vector Space Representation Dot Product as Similarity Metric TF-IDF for Computing Weights wij = f(i,j) * log(N/ni) Where q = … wordi… N = |docs| ni = |docs with wordi| But How Process Efficiently? t2 terms documents 3 3

Retrieval Document-term matrix t1 t2 . . . tj . . . tm nf d1 w11 w12 . . . w1j . . . w1m 1/|d1| d2 w21 w22 . . . w2j . . . w2m 1/|d2| . . . . . . . . . . . . . . di wi1 wi2 . . . wij . . . wim 1/|di| dn wn1 wn2 . . . wnj . . . wnm 1/|dn| wij is the weight of term tj in document di Most wij’s will be zero. 4 4

Inverted Files for Multiple Documents LEXICON “jezebel” occurs 6 times in document 34, 3 times in document 44, 4 times in document 56 . . . DOCID OCCUR POS 1 POS 2 . . . . . . … OCCURENCE INDEX 5 5

Many Variations Possible Address space (flat, hierarchical) Alta Vista uses flat approach Record term-position information Precalculate TF-IDF info Stored header, font & tag info Compression strategies 6 6

AltaVista: Inverted Files Map each word to list of locations where it occurs Words = null-terminated byte strings Locations = 64 bit unsigned ints Layer above gives interpretation for location URL Index into text specifying word number Slides adapted from talk by Mike Burrows 1/18/2019 10:44 AM 7

Documents A document is a region of location space Contiguous No overlap Densely allocated (first doc is location 1) All document structure encoded with words enddoc at last location of document begintitle, endtitle mark document title Document 1 Document 2 enddoc ... 0 1 2 3 4 5 6 7 8 ... 1/18/2019 10:44 AM 8

Format of Inverted Files Words ordered lexicographically Each word followed by list of locations Common word prefixes are compressed Locations encoded as deltas Stored in as few bytes as possible 2 bytes is common Sneaky assembly code for operations on inverted files Pack deltas into aligned 64 bit word First byte contains continuation bits Table lookup on byte => no branch instructs, no mispredicts 35 parallelized instructions/ 64 bit word = 10 cycles/word Index ~ 10% of text size 1/18/2019 10:44 AM 9

Index Stream Readers (ISRs) Interface for Reading result of query Return ascending sequence of locations Implemented using lazy evaluation Methods loc(ISR) return current location next(ISR) advance to next location seek(ISR, X) advance to next loc after X prev(ISR) return previous location ! 1/18/2019 10:44 AM 10

Processing Simple Queries User searches for “mp3” Open ISR on “mp3” Uses hash table to avoid scanning entire file Next(), next(), next() returns locations containing the word 1/18/2019 10:44 AM 11

Problem!!! Combining ISRs And Compare locs on two streams Or Or Merges two or more ISRs Not Not Returns locations not in ISR (lazily) file ISR file ISR file ISR fixx genesis Problem!!! or ISR mp3 Genesis OR fixx and ISR (genesis OR fixx) AND mp3 1/18/2019 10:44 AM 12

What About File Boundaries? enddoc mp3 fixx 1/18/2019 10:44 AM 13

ISR Constraint Solver Inputs: Constraint Types Set of ISRs: A, B, ... Set of Constraints Constraint Types loc(A)  loc(B) + K prev(A)  loc(B) + K loc(A)  prev(B) + K prev(A)  prev(B) + K For example: phrase “a b” loc(A)  loc(B), loc(B)  loc(A) + 1 a a b a a b a b 1/18/2019 10:44 AM 14

Two words on one page Let E be ISR for word enddoc Constraints for conjunction a AND b prev(E)  loc(A) loc(A)  loc(E) prev(E)  loc(B) loc(B)  loc(E) What if prev(E) Undefined? b a e b a b e b prev(E) loc(B) loc(E) loc(A) 1/18/2019 10:44 AM 15

Advanced Search Field query: a in Title of page Let BT, ET be ISRP of words begintitle, endtitle Constraints: loc(BT)  loc(A) loc(A)  loc(ET) prev(ET)  loc(BT) et a bt a et a bt et loc(ET) prev(ET) loc(A) loc(BT) 1/18/2019 10:44 AM 16

Implementing the Solver Constraint Types loc(A)  loc(B) + K prev(A)  loc(B) + K loc(A)  prev(B) + K prev(A)  prev(B) + K 1/18/2019 10:44 AM 17

Remember: Index Stream Readers Methods loc(ISR) return current location next(ISR) advance to next location seek(ISR, X) advance to next loc after X prev(ISR) return previous location 1/18/2019 10:44 AM 18

Solver Algorithm To satisfy: loc(A)  loc(B) + K loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X prev(ISR) return pre loc while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) To satisfy: loc(A)  loc(B) + K Execute: seek(B, loc(A) - K) 1/18/2019 10:44 AM 19 19

Solver Algorithm To satisfy: loc(A)  loc(B) + K loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X prev(ISR) return pre loc while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) To satisfy: loc(A)  loc(B) + K Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K Execute: seek(B, prev(A) - K) a a b a a b a b b K K B PA A 1/18/2019 10:44 AM 20 20

Solver Algorithm To satisfy: loc(A)  loc(B) + K loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X prev(ISR) return pre loc while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) To satisfy: loc(A)  loc(B) + K Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K Execute: seek(B, prev(A) - K) To satisfy: loc(A)  prev(B) + K Execute: seek(B, loc(A) - K), next(B) a a b a a b a b b K PB A B 1/18/2019 10:44 AM 21 21

Solver Algorithm To satisfy: loc(A)  loc(B) + K loc(ISR) return cur loc next(ISR) adv to nxt loc return it seek(ISR, X) adv to nxt loc > X prev(ISR) return pre loc while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) To satisfy: loc(A)  loc(B) + K Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K Execute: seek(B, prev(A) - K) To satisfy: loc(A)  prev(B) + K Execute: seek(B, loc(A) - K), next(B) To satisfy: prev(A)  prev(B) + K 1/18/2019 10:44 AM 22 22

Solver Algorithm To satisfy: loc(A)  loc(B) + K Heuristic: Which choice advances a stream the furthest? while (unsatisfied_constraints) satisfy_constraint(choose_unsat_constraint()) To satisfy: loc(A)  loc(B) + K Execute: seek(B, loc(A) - K) To satisfy: prev(A)  loc(B) + K Execute: seek(B, prev(A) - K) To satisfy: loc(A)  prev(B) + K Execute: seek(B, loc(A) - K), next(B) To satisfy: prev(A)  prev(B) + K 1/18/2019 10:44 AM 23 23

Update Can’t insert in the middle of an inverted file Must rewrite the entire file Naïve approach: need space for two copies Slow since file is huge Split data along two dimensions Buckets solve disk space problem Tiers alleviate small update problem 1/18/2019 10:44 AM 24

Buckets & Tiers Each word is hashed to a bucket Add new documents by adding a new tier Periodically merge tiers, bucket by bucket older newer bigger smaller . . . Tier 1 Tier 2 Hash bucket(s) for word a . . . Hash bucket(s) for word b Hash bucket(s) for word zebra 1/18/2019 10:44 AM 25

What if Word Removed from Doc? Delete documents by adding deleted word Expunge deletions when merging tier 1 deleted a older newer bigger smaller . . . Tier 1 Tier 2 . . . Hash bucket(s) for word b Hash bucket(s) for deleted 1/18/2019 10:44 AM 26