Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va.

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Hashing and Indexing John Ortiz.
Tutorial 8 CSI 2132 Database I. Exercise 1 Both disks and main memory support direct access to any desired location (page). On average, main memory accesses.
1 Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes November 14, 2007.
CS4432: Database Systems II
Chapter 8 File organization and Indices.
CS 277 – Spring 2002Notes 41 CS 277: Database System Implementation Notes 4: Indexing Arthur Keller.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Tree-Structured Indexes. Introduction v As for any index, 3 alternatives for data entries k* : À Data record with key value k Á Â v Choice is orthogonal.
Data Indexing Herbert A. Evans. Purposes of Data Indexing What is Data Indexing? Why is it important?
1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.
1 Lecture 20: Indexes Friday, February 25, Outline Representing data elements (12) Index structures (13.1, 13.2) B-trees (13.3)
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
CS4432: Database Systems II
DBMS Internals: Storage February 27th, Representing Data Elements Relational database elements: A tuple is represented as a record CREATE TABLE.
DISK STORAGE INDEX STRUCTURES FOR FILES Lecture 12.
Database Management Systems, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 9.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
Storage and Indexing February 26 th, 2003 Lecture 19.
Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
1 Physical Data Organization and Indexing Lecture 14.
1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu
1 CPS216: Advanced Database Systems Notes 04: Operators for Data Access Shivnath Babu.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
1 Index Structures. 2 Chapter : Objectives Types of Single-level Ordered Indexes Primary Indexes Clustering Indexes Secondary Indexes Multilevel Indexes.
Storage and Indexing1 Overview of Storage and Indexing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.
1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.
Tree-Structured Indexes Jianlin Feng School of Software SUN YAT-SEN UNIVERSITY courtesy of Joe Hellerstein for some slides.
1 Indexing. 2 Motivation Sells(bar,beer,price )Bars(bar,addr ) Joe’sBud2.50Joe’sMaple St. Joe’sMiller2.75Sue’sRiver Rd. Sue’sBud2.50 Sue’sCoors3.00 Query:
Index tuning-- B+tree. overview Overview of tree-structured index Indexed sequential access method (ISAM) B+tree.
File Structures. 2 Chapter - Objectives Disk Storage Devices Files of Records Operations on Files Unordered Files Ordered Files Hashed Files Dynamic and.
Many DBs are still disk oriented Assume tuples are stored in row order on pages A page can contain one or more tuples Pages stored on disk –Old disk drives:
Query Optimizer (Chapter ). Optimization Minimizes uses of resources by choosing best set of alternative query access plans considers I/O cost,
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Session 1 Module 1: Introduction to Data Integrity
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
File Organizations and Indexing
Spring 2004 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2004 Yanyong Zhang
CS 440 Database Management Systems Lecture 6: Data storage & access methods 1.
Storage and Indexing. How do we store efficiently large amounts of data? The appropriate storage depends on what kind of accesses we expect to have to.
Indexing. 421: Database Systems - Index Structures 2 Cost Model for Data Access q Data should be stored such that it can be accessed fast q Evaluation.
Data on External Storage – File Organization and Indexing – Cluster Indexes - Primary and Secondary Indexes – Index data Structures – Hash Based Indexing.
1 Tree-Structured Indexes Chapter Introduction  As for any index, 3 alternatives for data entries k* :  Data record with key value k   Choice.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Content based on Chapter 10 Database Management Systems, (3 rd.
I/O Cost Model, Tree Indexes CS634 Lecture 5, Feb 12, 2014 Slides based on “Database Management Systems” 3 rd ed, Ramakrishnan and Gehrke.
CS4432: Database Systems II
CS 405G: Introduction to Database Systems Instructor: Jinze Liu Fall 2007.
Select Operation Strategies And Indexing (Chapter 8)
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Tree-Structured Indexes Chapter 10.
1 Overview of Storage and Indexing Chapter 8. 2 Review: Architecture of a DBMS  A typical DBMS has a layered architecture.  The figure does not show.
Module 11: File Structure
Indexing Structures for Files and Physical Database Design
Record Storage, File Organization, and Indexes
CS522 Advanced database Systems
COP Introduction to Database Structures
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
COMP 430 Intro. to Database Systems
Database Management Systems (CS 564)
Disk storage Index structures for files
Lecture 12 Lecture 12: Indexing.
Lecture 21: Indexes Monday, November 13, 2000.
Lecture 19: Data Storage and Indexes
CSE 544: Lecture 11 Storing Data, Indexes
Storage and Indexing.
Introduction to Database Systems CSE 444 Lectures 19: Data Storage and Indexes May 16, 2008.
Index Structures Chapter 13 of GUW September 16, 2019
Presentation transcript:

Select Operation- disk access and Indexing *Some info on slides from Dr. S. Son, U. Va

Disk access DBs traditionally stored on disk Cheaper to store on disk than in memory Costs for: –Seek time, latency, data transfer time Disk access is page (block) oriented KB page size

Access time Access time is the time to randomly access a page System initially determines if page in memory buffer (page tables, etc.) Large disparity between disk access and memory access

Select operation using table scan If read the entire table for a select – table scan Improvements to table scan of disk: –Parallel access –Sequential prefetch

Parallel access Linear search - all data rows read in from disk – I/O parallelism can be used (Raid) multiple I/O read requests satisfied at the same time stripe the data across different disks –Problems with parallelism? must balance disk arm load to gain maximum parallelism requires the same total number of random I/O's, but using devices for a shorter time

Sequential prefetch I/O Retrieve one disk page after another (on same track) – (32 in DB2, varies in Oracle) Seek time no longer a problem Must know in advance to read 32 successive pages Speed up of I/O by a factor of ≈10 (500 I/O's per second vs. 70)

Access time Seek time –as low as 4 ms server Latency time –as low as 1 ms or less Data transfer time –.4-2 ms Solid state disks up to 100,000 I/Os per sec. – still expensive

Access time for fast I/O RIO Seq. Prefetch Seek - disk arm to cylinder Latency - platter to sector Data transfer - Page page vs. 32 pages.176* seconds.021 seconds 32 pages for both *.0055X32=.176 for 32 pages of RIO vs.021 for 32 pages of Seq. Prefetch

Organizing disk space How to store data so minimize access time if read the entire table?

Disk allocation Disk Resource Allocation for Databases (DBA has control) Goal – contiguous sectors on disk - want data as close together as possible to minimize seek time No standard SQL approach, but general way to deal with allocation Some OS allow specification of size of file and disk device

Types of Files Heap files (unordered – sequential) Sorted files (ordered – sort key) Hash files (hash key, hash function) B+-trees Storage Area Networks SAN – ERP (enterprise resource planning) and DW (data warehouses) –Storage devices configured as nodes in network – can attach/detach

Tablespace Tablespace is: Allocation medium for tables and indexes for ORACLE, DB2, etc. Can put >1 table in a table space if accessed together Tablespace corresponds to 1 or more OS files and can span disk devices Usually relations cannot span disk devices

DB storage structures DBCompany Database Table- tspace 1 system space OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx Segments data data data data index Extents

Tablespace ORACLE DB's contain several tablespaces, including one called system - data description + indexes + user-defined tables default tablespace given to each user if multiple tablespaces - better control over load balancing can take some disk space off-line

Extent Relation composed of 1 or more extents Extent - contiguous storage on disk when data segment or index segment first created, given an initial extent from tablespace 10KB (5 pages) if need more space given next contiguous extent

DB storage structures DBCompany Database Table- tspace 1 system space OS files fname1 fname2 fname3 Tables Empl Dept Proj Dep EmpIndx Segments data data data data index Extents

Extent Can increase the size by a positive % (cannot decrease) – initial n - size of initial extent – next n - size of next – max extents - maximum number of extents – min extents - number of extents initially allocated – pct increase n - % by which next extent grows over previous one

Oracle create tablespace

Create table Create table statement - can specify tablespace, no. of extentsCreate table statement –When initial extent full, new extent allocated –pctfree - determine how much space in a page can be used for inserts of new rows if pctfree =10%, inserts stop when page is 90% full »Uses another page –pctused – determines when new inserts start again if fall below certain percentage of total, default pctused = 40% pctfree + pctused < 100

Rows Row layout on each disk page 1 2 3… N Row N Row N-1 … Row 1 Header info Row directory free space data rows Header - Row directory – row number and page byte offset –Row number is row number in page – also called slot# Page byte offset – with varchar, row size not constant To identify a particular row use RID (RowID) – page #, slot # [file#] slot# is number in row directory (logical #)

Differences in DBMSs re: rows ROWID can be retrieved in ORACLE but not DB2 (violates relational model rule) ORACLE rows can be split between pages (row record fragmentation) Can have rows from multiple tables on same page, more info DB2, no splitting, entire row moved to new page, need forwarding pointer

Select operation using Indexes Alternative to table scan

24 Why use an index? If use a select (or join) on the same attribute frequently want a way to improve performance - use indexes –For example: Select from Employee where ssn =

B+-tree Most commonly used index structure type in DBs today Based on B-tree Good for equality and range searches B+ tree: dynamic, adjusts gracefully under inserts and deletes. Used to minimize disk I/O available in DB2, ORACLE also has hash cluster, Ingres has heap structure, B-tree, isam (chain together new nodes)

Structure of B+ Trees leaf level pointers to data (RIDs) the remaining are directory (index) nodes that point to other index nodes Fig.Fig. Index Entries Data Entries ("Sequence set") (Direct search)

Example of B+Tree Points to data

Characteristics of B+ Tree Order of tree (fan out) – max number of child nodes Minimum 50% occupancy (except for root). Each node contains d/2 <= m <= d-1 entries. –Where the parameter d is the order of the tree. Insert/delete at log F N cost; keep tree height- balanced. (F = fanout, N = # leaf pages) Supports equality and range-searches efficiently

Cost of I/O for B+-tree One index node is one page If tree with depth of 3, 3 I/Os to get pointer to data Read in index node can remain in memory –likely since frequent access to upper -level nodes of actively used B+-trees

B+ Trees in Practice Typical order: between children Typical fill-factor: 2/3 full (66.6%) –average fanout = 133 (if 200 children) Typical capacities: – Height 4: = 312,900,700 records – Height 3: = 2,352,637 records Can often hold top levels in buffer pool: – Level 1 = 1 page = 8 Kbytes – Level 2 = 133 pages = 1 Mbyte – Level 3 = 17,689 pages = 133 MBytes

Why B+-tree Directory structure - retrieve range of values efficiently –search for leftmost index entry S i such that X <= S i Index entries always in sequence by value - can use sequential prefetch on index Index entries shorter than data rows - less I/O

B+-tree Balancing of B+-trees - insert, delete Nodes usually not full Utilities to reorganize to lower disk I/O Most systems allow nodes to become depopulated- no automatic algorithm to balance Average node below root level 71% full in active growing B+-trees

Duplicate key values Duplicate key values in index leaf nodes have sibling pointers but a delete of a row that has a heavily duplicated key entails a long search through the leaf-level of the B+-tree Index compression - with multiple duplicates | header info | PrX keyval RID RID... RID | PrX keyval RID…RID| where PrX is count of RID values

Create Index Options: multiple columns tablespace storage - initial extents, etc. percent free default = 10 % of each page left unfilled (creation) free page (1 free page for every n index pages during creation)

35 Types of indexes (textbook) Primary index - key field is a candidate key (must be unique) – data file ordered by key field Clustering index - key field is not unique, data file is ordered – all records with same values on same pages Secondary index - non-clustering index – data file not ordered –First record in the data page (or block) is called the anchor record Non-dense index - pointer in index entry points to anchor Dense index - pointer to every record in the file

Clustering Efficiency advantage read in a page, get all of the rows with the same value clustering is useful for range queries e.g. between keyval1 and keyval2

Clustering Can only cluster table by 1 clustering index at a time In SQL server –creates clustered index on PK automatically if no other clustered index on table and PK nonclustered index not specified In DB2 – –if the table is empty, rows sorted as placed on disk –subsequent insertions not clustered, must use REORG In Oracle- –Cluster index – now available for PK in 10g –Define a cluster to create cluster index for 2 tables

Please help me to remember to TURN OFF THE PROJECTOR!!

Indexes vs. table scan To illustrate the difference between table scan, secondary index (non clustered) and clustered index Assume 10 M customers, 200 cities 2KB/page, row = 100 bytes, 20 rows/page Select * From Customers Where city = Birmingham 1/200 * 10M if assume selectivity = 1/200 50,000 customers in a city

Rules of Thumb for I/O Assume slightly slower times than before: –Random I/O – 160 pages/second, –Sequential prefetch I/O – 1600 pages/second, Will discuss later: –List prefetch I/O – 400 pages/second,.0025

Table Scan Table Scan - read entire table If used an random I/O (RIO) – WHICH ONE WOULD NEVER DO 10,000,000/20 = 500,000 pages 500,000*RIO = 3125 Instead, it makes more sense to use: sequential prefetch (SP) read 32 pages at a time 500,000*SP = 312

Clustering Index Clustering Index – All entries for B'ham clustered on same pages 50,000/20 = 2500 data pages (with 20 rows per page) Assume 3 upper nodes of the tree Assume 1000 index entries per leaf node, read 50000/1000 = 50 index pages / ,000/20 = number of pages to access If top 3 levels of tree in memory, count access time as 0 Access time: (3*0) + (50*SP) + (2500*SP) = 2,550 * = 1.6

Secondary Index In the worst case 1 entry for B'ham per page 50,000 data pages pages (10M/200) ,000 = 50, 053 number of accesses (3*0)+(50*SP) + (50,000*RIO)=312.5 access time REALLY slow – see next slide for a better solution! Use List Prefetch instead of RIO

List Prefetch – Better solution Create list of data pages to access Pages not necessarily in contiguous sequential order System orders pages to minimize disk I/O E.g. elevator algorithm for disk request scheduling Using list prefetch (LP) 0+(50*SP)+50,000*LP= access time

% Free Redo the previous calculations assuming relations created with 50% free option specified.

Creating Indexes When determining what indexes to create consider: –workload - mix of queries and frequencies of requests 20% of requests are updates, etc. –can create lots of indexes but: cost to create insertions initial load time high if a large table index entries can become longer and longer as multiple columns included

Multiple Indexes More than one index on a relation –e.g. age – one index, class - one index, gender - one index

Composite Index One index based on more than one attribute Create Index index_name on Table (col1, col2,... coln) Composite index entry - values for each attribute age, class, gender entry in index is: C1, C2, C3, RID

Using Indexes System must decide if to use index What if more than one index, which one? What if composite index?

Plans using Indexes Can use an index if index matches select condition in where clause: 1.A matching index scan - only have to access a limited number of contiguous leaf entries to access data 2.Predicate screening with matching index scan – index entries to eliminate RIDs 3.Non-matching index scan – use index to identify RIDs 4.Index-only retrieval – don’t access data, RIDs only 5.Multiple index retrieval – use >1 index to identify RIDs

Matching index scan Definition of a matching index scan - Only have to access contiguous leaf nodes 1)Single where clause and index matches Create index Idx1 on T1 ( C1) Select * from T1 where C1=10 search B+-tree to leaf level for leftmost entry having specified values useful for =, between

Matching Index Scan 2)If multiple where clauses and all '=' Select * from T1 where C1=10 and C2=5 i) if there is a composite index and select columns match all index columns, e.g. Create index Idx2 on T1 ( C1, C2) only have to read contiguous leaf pages ii) if there is a separate index for each clause, e.g. Create index idx3 on T1(C1); Create index idx4 on T1(C2); must choose one or more of the indexes (later)

Matching Index Scan - Rules A matching scan can be used ONLY IF one of the columns in select is the first column of index Decide how many attributes to match in a composite index after the first column, so can read in a small contiguous range of leaf entries in B+-tree to get RIDs Match first column of composite index then: –look at index columns from left to right –Match ends when no predicate found –If range (<=, like, between) for a column, match terminates thereafter easier to scan all entries for range – process rest of entries using predicate screening

Matching Index Scan with Predicate screening 1) If select conditions match some index columns of composite index Create index idx6 on T1(C1, C2, C3, C4); Select * from T1 where C1=10 and C2=3 and C4=20 Access contiguous leaf pages, but not all results on contiguous leaf pages Must examine index entries to determine if in the result - - called predicate screening

Matching Index Scan with Predicate screening Another example: 2) If all select conditions match composite index columns and some selects are a range Create index idx7 on T1(C1, C2, C3); Select * from T1 where C1=10 and C2 between 1 and 5 and C3 =‘F’

Advantages to Predicate screening discard RIDs based on values (for index) will access fewer tuples because RIDs used to eliminate potential tuples

Non-matching index scan Not always used by DBMSs attributes in where clause don't include initial attribute of index Create index idx3 on T1(C1, C2, C3); Select * from T1 where C2=2 and C3=‘M’ Search leaf entries of index and compare values for entries must read in all index leaf pages to find C2, C3 value (so why do it?) –50 index pages vs 500,000 data pages

Index only retrieval Elements retrieved in select clause are attributes of compose index Don't need to access rows (actual data) Create index idx5 on T1(C1, C3); Select C1, C3 from T1 where C1=5 and C3 between 2 and 5 Select sum(C3) from T1

Multiple Index Access If conjunctive conditions & in where clause, can use >1 index –Extract RIDs from each index satisfying matching predicate – Intersect lists of RIDs (and them) from each index – Final list - satisfies all predicates indexed If disjunctive conditions (or) –Union the two lists of RIDs

Some Query optimizer rules for using RID-lists (then use list prefetch) 1. predicted active resulting RIDs must not be > 50% of RID pool 2. Limit to any single RID list the size of the RID memory pool (16M RIDs) 3. RID list cannot be generated by screening predicates

Rules for multiple index Access Optimizer determines diminishing returns using multiple index access 1. List indexes with matching predicates in where clause 2. Place indexes in order by increasing filter factor 3. For successive indexes, extract RID list only if reduced cost for final row returned e.g. no sense reading 100's of pages of a new index to get number of rows to only 1 tuple

Example: Using RID lists with Multiple Indexes Prospects Table : 50M rows - 10 rows per page Pages in table: 5,000,000 There are 4 Indexes: age – 50 values (1000 entries per page) zipcode – 100,000 values (100 entries per page) hobby – 100 values (1000 entries per page) incomeclass – 10 values (1000 entries per page)

Problem cont’d Select name, straddr from prospects where zipcode between and and age = 40 and hobby = ‘chess’ and incomeclass = 10; Compute FF : Make sure in ascending order FF(zipcode) = 500/100,000 = 1/200 FF(hobby) = 1/100 FF(age) = 1/50 FF(incomeclass) = 1/10

Problem cont’d Data rows read if use indexes: (1) 50,000,000/200 = 250,000 (1,2) 250,000/100 = 2500 (1,2,3) 2500/50 = 50 (1,2,3,4) 50/10 = 5 How much time will this take? Is it cost effective to use all of these indexes?

Problem cont’d I/O costs Cost: –Random IO: RIO= 1/160 = –Sequential Prefetch: SP = 1/1600 = –List Prefetch: LP = 1/400 =.0025 Note: –Some textbooks assume if read <= 3 pages use RIO –They also assume non-leaf nodes RIO, we assume in memory so it takes 0 disk access time

Problem cont’d Table scan: 50M/10 per page * SP Total time: 5,000,000 * = 3125 Using index 1: (100 entries per page) data: 50M*FF*LP 250,000 * = 625 index: non-leaf pages+(#leaf entries*FF*entries per page))*SP (3*0) + (50,000,000/200/100) * = 1.56 Total time: =

Problem cont’d Using indexes 1&2: data: 250,000/100 * LP 2500 * = 6.25 index 2: (1000 entries per page) (3*0) + (50,000,000/100/1000)* = To use both indexes: = Total time: =

Problem cont’d Using indexes 1,2,3: data: 50 * = index 3: (1000 entries per page) (3*0) + (50,000,000/50/1000) * =.625 To use 3 indexes: = Total time: = Using indexes 1,2,3,4: data: 5 * = index 4: (1000 entries per page) (3*0)+ (50,000,000/10/1000)* = To use 4 indexes: = Total time: = 5.635

Problem cont’d Index used Data rows I/O cost Index I/O cost Trade off if use index None50M 3125 sec 1250, sec 1.56 secDecrease 3125 to 625 sec With 1.56 additional sec 1, sec secDecrease 625 to 6.25 sec With additional sec 1,2, sec sec Decrease 6.25 to sec With additional sec 1,2,3, sec sec Decrease to sec With additional sec

Indexes and Information Retrieval Some information on slides taken from CS245 – Stanford Univ.

Query: Get employees in (Toy Dept) ^ (2nd floor) Dept. indexEMP Floor index Toy 2nd  Intersect toy RIDs and 2nd Floor RIDs to get set of matching EMP’s

This idea used in text information retrieval Documents...the cat is fat......was raining cats and dogs......Fido the dog...

This idea used in text information retrieval Documents...the cat is fat......was raining cats and dogs......Fido the dog... Inverted lists cat dog

IR QUERIES Find articles with “cat” and “dog” Find articles with “cat” or “dog” Find articles with “cat” and not “dog”

IR QUERIES Find articles with “cat” and “dog” Find articles with “cat” or “dog” Find articles with “cat” and not “dog” Find articles with “cat” in title Find articles with “cat” and “dog” within 5 words

IR – Web search problems –Crawling and indexing share similar characteristics and requirements –Both are offline problems, no need for real-time –Tolerable for a few minutes delay before content searchable –OK to run smaller-scale index updates frequently –Querying online problem –Demands sub-second response time –Low latency high throughput –Loads can very greatly

Architecture of IR Systems Documents Query Hits Representation Function Representation Function Query RepresentationDocument Representation Comparison Function Index offlineonline

How do we represent text? “Bag of words” –Treat all the words in a document as index terms for that document –Assign a “weight” to each term based on “importance” –Disregard order, structure, meaning, etc. of the words –Simple, yet effective! Assumptions –Term occurrence is independent –Document relevance is independent –“Words” are well-defined

Stop Word List Words filtered out Common words Match on common word not as useful as match on rare words... Not one definite listlist

Representing Documents The quick brown fox jumped over the lazy dog’s back. Document 1 Document 2 Now is the time for all good men to come to the aid of their party. the is for to of quick brown fox over lazy dog back now time all good men come jump aid their party Term Document 1Document 2 Stopword List

Inverted Index Inverted indexing is fundamental to all IR models Consists of postings lists, one with each term in the collection Posting list – document id and payload –Payload can be term frequency or number of times occurs on document, position of occurrence, properties, etc. –Can be ordered by document id, page rank, etc. –Data structure necessary to map from document id to e.g. URL

Inverted Index quick brown fox over lazy dog back now time all good men come jump aid their party Term Postings

CS 245Notes 485 Posting: an entry in inverted list. Represents occurrence of term in article Size of a list:1Rare words or (in postings) miss-spellings 10 6 Common words Size of a posting: bits (compressed)

Process query Given a query, fetch posting lists associated with query, traverse postings to compute result set Query document scores must be computed Partial scores stored in accumulators Top k documents extracted Optimization strategies to reduce # postings must examine

Indexing: Performance Analysis The indexing problem –Must be relatively fast, but need not be real time –For Web, incremental updates are important How large is the inverted index? –Size of vocabulary –Size of postings Fundamentally, a large sorting problem –Terms usually fit in memory –Postings usually don’t

Index Size of index depends on payload Well-optimized inverted index can be 1/10 of size of original document collection If store position info, could be several times larger Usually can hold entire vocabulary in memory (using front-coding) Postings lists usually too large to store in memory Query evaluation involves random disk access and decoding postings –Try to minimize random seeks