7CCSMWAL Algorithmic Issues in the WWW

Slides:



Advertisements
Similar presentations
Index Construction David Kauchak cs160 Fall 2009 adapted from:
Advertisements

Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Inverted Indices. Inverted Files Definition: an inverted file is a word-oriented mechanism for indexing a text collection in order to speed up the searching.
METU Department of Computer Eng Ceng 302 Introduction to DBMS Disk Storage, Basic File Structures, and Hashing by Pinar Senkul resources: mostly froom.
Hash Tables1 Part E Hash Tables  
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Chapter 13 Disk Storage, Basic File Structures, and Hashing.
B + -Trees (Part 1). Motivation AVL tree with N nodes is an excellent data structure for searching, indexing, etc. –The Big-Oh analysis shows most operations.
B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.
CpSc 881: Information Retrieval. 2 Hardware basics Many design decisions in information retrieval are based on hardware constraints. We begin by reviewing.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
Hinrich Schütze and Christina Lioma Lecture 4: Index Construction
CS4432: Database Systems II
Indexing Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 17 Disk Storage, Basic File Structures, and Hashing.
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Inverted index, Compressing inverted index And Computing score in complete search system Chintan Mistry Mrugank dalal.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
Indexing and hashing Azita Keshmiri CS 157B. Basic concept An index for a file in a database system works the same way as the index in text book. For.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
Index Construction: sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Lecture 4: Index Construction Related to Chapter 4:
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 4: Index Construction Related to Chapter 4:
Chapter 5 Record Storage and Primary File Organizations
Chapter 5 Ranking with Indexes. Indexes and Ranking n Indexes are designed to support search  Faster response time, supports updates n Text search engines.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Index Construction.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Index Construction Some of these slides are based on Stanford IR Course slides at
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
COMP9319: Web Data Compression and Search
CHP - 9 File Structures.
Chapter 4 Index construction
Multiway Search Trees Data may not fit into main memory
Azita Keshmiri CS 157B Ch 12 indexing and hashing
Modified from Stanford CS276 slides Lecture 4: Index Construction
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
CSCI 210 Data Structures and Algorithms
Lecture 7: Index Construction
Hashing CSE 2011 Winter July 2018.
Database System Implementation CSE 507
Extra: B+ Trees CS1: Java Programming Colorado State University
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
B+-Trees.
Database Management Systems (CS 564)
B+ Tree.
Hashing Exercises.
External Methods Chapter 15 (continued)
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Hash tables Hash table: a list of some fixed size, that positions elements according to an algorithm called a hash function … hash function h(element)
Disk Storage, Basic File Structures, and Hashing
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Index Construction: sorting
B- Trees D. Frey with apologies to Tom Anastasio
Lecture 7: Index Construction
Indexing and Hashing Basic Concepts Ordered Indices
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
B- Trees D. Frey with apologies to Tom Anastasio
Lecture 2- Query Processing (continued)
CH 9.2 : Hash Tables Acknowledgement: These slides are adapted from slides provided with Data Structures and Algorithms in C++, Goodrich, Tamassia and.
3-2. Index construction Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
B- Trees D. Frey with apologies to Tom Anastasio
Database Design and Programming
Indexing 4/11/2019.
CENG 351 Data Management and File Structures
CS276: Information Retrieval and Web Search
Dictionaries and Hash Tables
Lecture-Hashing.
Presentation transcript:

7CCSMWAL Algorithmic Issues in the WWW Lecture 7

Text searching To search for a keyword within text, we can Scan the text sequentially when The text is small, e.g., a few MB The text collection is very volatile (modified very frequently) No extra space is available (for building indices) Build a data structure over the text (called an index) to speed up the search when The text collection is large and static or semi-static (can be updated at reasonably regular intervals)

Inverted files Also called inverted indices Mainly composed of two elements Dictionary (or vocabulary, or lexicon) Set of all different words (tokens, index terms) Posting list (or inverted list) Each word has a list of positions where the word appears. Used here to refer to terms within documents Postings file is the set of all posting lists Postings lists are much larger than the dictionary Dictionary is commonly kept in memory, and postings lists are normally kept on disk Structure of postings lists can vary (problem dependent) e.g. Each page of this PPT presentation

Example (see Intro to IR) The dictionary sorted alphabetically into terms Each posting list is sorted by document ID The numbers are the documents in which the term occurs (or lines in a page or book or whatever)

Construct an inverted file Input: a list of normalized tokens for each document Example Doc 1 I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious:

Construct an inverted index The core indexing step is sorting the lists of tokens so that the terms are alphabetical Multiple occurrences of the same term from the same document are then merged The term frequency is also recorded. (#occs. of term in doc) Instances of the same term are then grouped, and the result is split into a dictionary (of terms) and postings (list of documents containing term) The dictionary also records some statistics, such as the number of documents which contain each term (document frequency), and total number of occurrences

Example The inverted file

Data structures for postings lists Fixed length arrays Each postings list is a fixed length array The arrays may not be fully utilized Reading a postings list is time-efficient as it is stored in contiguous locations

Data structures for postings lists Singly linked lists Each postings list is a singly linked list No empty entries but there is the overhead of the pointers

Methods to index the terms Various approaches include: External sort and merge In memory indexing based on hashing followed by merging Distributed indexing Google (e.g.) processes documents using Map-Reduce

Hardware constraint Indexing algorithm is governed by hardware constraints Characteristics of computer hardware Access to data in memory is much faster than access to data on disk It takes a few clock cycles to access a byte in memory, but much longer to transfer it from disk Want to keep as much data as possible in memory, especially the data that we need to access frequently

Index construction with disk The list of tokens may be too large to be stored and sorted in memory External sorting algorithm minimize the number of random disk seeks during sorting Blocked sort-based indexing algorithm (BSBI) BSBI segment the collection into parts of equal size (Step 4 of the pseudo code) construct the intermediate inverted file for each part in memory (Step 5 of the pseudo code) This step is the same as when the list of tokens can fit in the memory where inverted file is constructed in the memory store the intermediate inverted files on disk (Step 6 of the pseudo code) merge all intermediate inverted files into the final index (Step 7 of the pseudo code)

BSBI pseudo code BSBIndexConstruction() n  0 while (all documents have not been processed) do n  n+1 block  ParseNextBlock() BSBI-Invert(block) WriteBlockToDisk(block, fn) MergeBlocks(f1, ..., fn; fmerge)

Merging intermediate inverted files brutus d1,d3,d6,d7 caesar d1,d2,d4,d8,d9 julius d10 killed d8 noble d5 with d1,d2,d3,d5 brutus d1,d3 caesar d1,d2,d4 noble d5 with d1,d2,d3,d5 brutus d6,d7 caesar d8,d9 julius d10 killed d8 disk Two blocks (“posting lists to be merged”) are loaded from disk into memory, merged in memory (“merged posting lists”) and written back to disk

Single-pass in-memory indexing (SPIMI) No sorting of tokens is required Tokens are processed one by one in memory When a term occurs for the first time, it is added to the dictionary (implemented as a hash table), and a new posting list is created Otherwise, find the corresponding postings list Then, add the docID to the postings list The process continues until the memory is full. The dictionary is then sorted and written to disk Note: the dictionary much shorter than the complete list of all tokens which occur (or that’s the idea)

Hash table A data structure that supports operation Lookup and Insert (and possibly Delete) in expected constant time Can be considered as a table of data Each term is stored in one of the entries of the table A hash function determines which data to be stored in which table entry Typically, the hash function maps a string (an index term or a key) to an integer (table entry) More details in slide p.37

Picture from Wikipedia Example Picture from Wikipedia

SPIMI pseudo code SPIMI-Invert(token_stream) output_file = NewFile() dictionary = NewHash() while (free memory available) do token  next(token_stream) if term(token)  dictionary then postings_list = AddToDictionary(dictionary, term(token)) else postings_list = GetPostingsList(dictionary, term(token)) if full(postings_list) then posting_list = DoublePostingList(dictionary, term(token)) AddToPostingsList(postings_list, docID(token)) sorted_terms  SortTerms(dictionary) WriteBlockToDisk(sorted_terms, dictionary, output_file) return output_file The pseudo code only shows how an intermediate inverted file is constructed The final inverted files merging is the same as BSBI

Example List of tokens Main memory Disk Hash table Inverted file (did, 1) (enact, 1) (julius, 1) (casear, 1) (so, 2) (did, 2) (it, 2) (the, 3) (you, 3) (hold, 3) ... Terms Postings casear 1 enact I so 2 julius did 1,2 Terms Postings casear 1 did 1,2 enact I julius so 2 Hash table Inverted file The tokens are read one by one and inserted to the hash table in main memory until the memory is full The entries in the hash table are sorted and written to disk as inverted file

Distributed indexing Perform indexing on large computer cluster A computer cluster is a group of tightly coupled computers that work closely together The group may consists of hundreds or thousands of nodes (computers) Individual nodes can fail at any time The result of the construction process is a distributed index that is partitioned across several machines Either according to term or according to document We focus on term-partitioned index

Distributed indexing MapReduce: a general architecture for distributed computing A master node (computer) directs the process of dividing the work up into small tasks assigning the tasks to individual nodes re-assigning tasks in case of node failure

Distributed indexing The master-node breaks the input documents into splits Each split is a subset of documents (corresponding to the partitions of the list of tokens made in BSBI/SPIMI) Two set of tasks Parsers Inverters

Parsers Master assigns a split to an idle parser node Parser reads one document at a time and produces (term, doc) pairs Parser writes pairs into j partitions for passing on to Inverters Each partition is for a range of terms’ first letters E.g., a-f, g-p, q-z  here j=3

Inverters To complete the index inversion Parses pass the term-partitions to the inverters. Or can send the (term, doc) pairs one at a time An inverter collects all (term, doc) pairs (= postings) for its term-partition Sorts and writes to posting lists

... ... ... Data flow Master Splits of documents Postings a-f g-p q-z assign assign Postings a-f Parser a-f g-p q-z Inverter Parser g-p a-f g-p q-z Inverter ... ... ... q-z Inverter Parser a-f g-p q-z partitions

Dynamic indexing Up to now, we have assumed that collections are static They rarely are New Documents come in over time and need to be inserted Documents are deleted and modified This means that the dictionary and the postings have to be modified: Posting updates for terms already in dictionary New terms are added to dictionary

Simplest approach. Block update Maintain “big” main index on disk New documents go into “small” auxiliary index in memory Merge the auxiliary index block and the main index when the auxiliary index is bigger than a threshold value Assume that the threshold value for refreshing the auxiliary index is a large constant n

Suppose symbol  represents the merge operation Example New main index after merging with auxiliary index Auxiliary index Main index n postings  0 postings n postings 1st merge n postings  n postings 2n postings 2nd merge 3rd merge n postings  2n postings 3n postings ... ... ... k-th merge n postings  (k-1)n postings kn postings Suppose symbol  represents the merge operation

Time complexity To process T=kn items uses k=T/n merges To merge two sorted lists of size n, and Jn takes O(n+Jn)=O(Jn) time Process of building a main index with T postings needs J=1,…,T/n merges so takes O(1n + 2n +3n + ... +(T/n)n) = O(T2/n) time

Logarithmic merge Basic idea: Don’t merge auxiliary and main index directly Speeds up merging and index construction in dynamic indexing Maintain a series of indexes, each twice as large as the previous one Keep smallest index (Z0) in memory Larger indices (I0, I1, ...) on disk (size doubling) I0 with size n, I1 with size 2n, I2 with size 4n, and so on The scheme for merging If Z0 gets too big (>=n), write to disk as I0 or merge with I0 (if I0 already exists) as Z1 Either write Z1 to disk as I1 (if no I1) or merge with I1 to form Z2

Pseudo code of logarithmic merging LMergeAddToken(indexes, Z0, token) Z0  Merge(Z0, {token}) if |Z0| = n then for i  0 to  do if Ii  indexes then Zi+1  Merge(Ii, Zi) (Zi+1 is a temporay index on disk) indexes  indexes – {Ii} else Ii  Zi (Zi becomes the permanent index Ii) indexes  indexes  {Ii} Break Zo   LogarithmicMerge() Zo   (Z0 is the in-memory index) indexes   while true do LMergeAddToken(indexes, Z0, GetNextToken())

Example symbol  represents the merge operation Actions taken Indexes 1st time when |Z0|=n I0  Z0; Z0  ; I0 2nd time when |Z0|=n Z1  I0  Z0; I1  Z1; Remove I0; Z0   I1 3rd time when |Z0|=n I0  Z0; Z0   I0, I1 4th time when |Z0|=n Z1  I0  Z0; I1  Z1 ; Z2  I1  Z1; I2  Z2 Remove I0, I1; Z0   I2 5th time when |Z0|=n I0, I2 6th time when |Z0|=n I1, I2 7th time when |Z0|=n I0, I1, I2 k-th time indices at binary k-1

Time complexity Size Doubling: For T postings blocks, the series of indexes consists of at most log T indexes, I0, I1, I2, ..., Ilog T Why? Need k=log(T/n) levels for (2^k)n=T items To build a main index with T postings, the overall construction time is O(T log T) Each posting is processed (i.e., merged) only once on each of (at most) log T levels Why? If merging occurs item moves up a level So logarithmic merge is more efficient for index construction than block update as T log T < T2

Searching the Index

Search structures for dictionaries Given a keyword of a query, determine if the keyword exists in the vocabulary (dictionary). If so, identify the pointer to the corresponding posting If no search structure exists, we have to check the terms of the dictionary one by one until a match is found or all terms are exhausted takes O(n) time, where n is the number of terms of the dictionary Search structures help speed up the vocabulary look-up operation

Search structures for dictionaries Two main choices Hash table (introduced on slide p.16) Search tree Factors affecting choice How many terms are we likely to have? Is the number likely to remain static, or change a lot? Are we likely to only have new terms inserted, or also to have some terms in the dictionary deleted? What are the relative frequencies with which various terms will be accessed? General speaking, hash table is preferable for more static data and search tree handles dynamic data more efficiently.

Hash table Hash table: An array with a hash function and collision management Mainly operated by a hash function, which determines where to find (or insert) a term Ash function maps a term to an integer between 0 and N-1, where N is the number of entries of the hash table Hashing is reproduce-able randomness. It looks like a term is mapped to a random array index, but every time we map the term we get the same index.

An example of hash function Suppose the dictionary consists of terms that are composed of lower-case letters or white-space only A term consists of at most 20 characters Let f() be a function that maps white- space to 0, ‘a’ to 1, ‘b’ to 2, ..., ‘z’ to 26. Let N be a large prime number The hash function F(word) can be defined as [ f(1st character) + f(2nd character)*26 +f(3rd character)*262 + f(4th character)*263 + ... ] mod N

Hash function Suppose N=13 For term ‘caesar’ For term ‘enact’ F(‘caesar’) = 3 + 1*26 + 5*262 + 19*263 + 1*264 + 18*265 mod 13 = 214659097 mod 13 = 3 For term ‘enact’ F(‘enact’) = 5 + 14*26 + 1*262 + 3*263 + 20*264 mod 13 = 9193293 mod 13 = 5 Exercise: the, let, it , best Powers of 26: 1, 26, 676, 17576, 456976 Entries Terms 1 2 3 caesar 4 did 5 enact 6 so 7 the 8 9 i 10 julius 11 killed 12

Collision Collision – two different terms mapped to the same entry For example For term ‘was’: F(‘was’) = 23 + 1*26 + 19*262 = 12893 mod 13 = 10 ‘was’ is mapped to the same entry as ‘julius’ Collision can be resolved by auxiliary structures, secondary hash function, or rehashing Entries Terms 1 2 3 caesar 4 did 5 enact 6 so 7 the 8 9 i 10 julius 11 killed 12

Search tree Binary search tree B-tree The terms are in sorted order in the in-order traversal of the tree Only practical for in-memory operations Read for interest only

Binary search tree (BST) Binary tree – a tree with every node having at most two children Binary search tree – every node is associated with a key (term) in which The term associated with the left child is lexicographically smaller than that of the parent node and The term associated with the parent is lexicographically smaller than that of the right child E.g., did caesar enact

Example Note: Posting, the documents containing the term caesar did enact so the i julius killed Vocabulary 1 1 1 1 1 1 2 1 Postings 2 2

Searching in BST Start from the root node, the search proceeds to one of the two subtrees below by comparing the term you are searching and the term associated with the root The search stops when a match is found or a leave node is reached The search (or lookup) operation takes O(log T) time where T is the number of terms, provided that the BST is balanced Balance criteria, e.g., the numbers of terms under the two subtrees of any node are either equal or differ by 1

B-tree Number of subtrees under an internal node varies in a fixed interval [a,b], where ab are positive integers The number of terms associated with an internal node, except the root, is between a-1 to b-1 Can be viewed as “collapsing” multiple levels of the binary tree into one Good for the case that dictionary is disk resident, in which case this collapsing serves the function of prefetching imminent binary tests The integers a and b are determined by the sizes of disk blocks

Example a=2 and b=4 capitol hath Vocabulary be did i’ julius let ambitious brutus caesar enact I it killed me 2 2 1 1 1 1 1 2 1 1 2 1 1 2 1 Postings 2 2

Reminder Doc 1 I did enact Julius Caesar: I was killed i’ the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious: