Proactive Index Design Using QUBE Lauri Pietarinen Courtesy of Tapio Lahdenmäki November 2010 IDUG 2010.

Slides:

Advertisements

Similar presentations

Storing Data: Disk Organization and I/O

Advertisements

Introduction to Database Systems1 Records and Files Storage Technology: Topic 3.

Database Management Systems, R. Ramakrishnan and J. Gehrke1 External Sorting Chapter 11.

1 External Sorting Chapter Why Sort?  A classic problem in computer science!  Data requested in sorted order  e.g., find students in increasing.

1 Advanced Database Technology February 12, 2004 DATA STORAGE (Lecture based on [GUW ], [Sanders03, ], and [MaheshwariZeh03, ])

IELM 230: File Storage and Indexes Agenda: - Physical storage of data in Relational DB’s - Indexes and other means to speed Data access - Defining indexes.

1 Overview of Storage and Indexing Chapter 8 (part 1)

External Sorting R & G Chapter 13 One of the advantages of being

Other time considerations Source: Simon Garrett Modifications by Evan Korth.

1 Overview of Storage and Indexing Yanlei Diao UMass Amherst Feb 13, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

B + -Trees (Part 1) COMP171. Slide 2 Main and secondary memories  Secondary storage device is much, much slower than the main RAM  Pages and blocks.

1 External Sorting for Query Processing Yanlei Diao UMass Amherst Feb 27, 2007 Slides Courtesy of R. Ramakrishnan and J. Gehrke.

1 Database Tuning Rasmus Pagh and S. Srinivasa Rao IT University of Copenhagen Spring 2007 February 8, 2007 Tree Indexes Lecture based on [RG, Chapter.

Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.

1 Overview of Storage and Indexing Chapter 8 1. Basics about file management 2. Introduction to indexing 3. First glimpse at indices and workloads.

CS4432: Database Systems II

External Sorting Chapter 13.. Why Sort? A classic problem in computer science! Data requested in sorted order  e.g., find students in increasing gpa.

Layers of a DBMS Query optimization Execution engine Files and access methods Buffer management Disk space management Query Processor Query execution plan.

1 Lecture 7: Data structures for databases I Jose M. Peña

Overview of Implementing Relational Operators and Query Evaluation

Indexing. Goals: Store large files Support multiple search keys Support efficient insert, delete, and range queries.

Lecture 11: DMBS Internals

Database System Architecture and Performance CSCI 6442 ©Copyright 2015, David C. Roberts, all rights reserved.

Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.

1 Physical Data Organization and Indexing Lecture 14.

1 IT420: Database Management and Organization Storage and Indexing 14 April 2006 Adina Crăiniceanu

CS 162 Section Lecture 8. What happens when you issue a read() or write() request?

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

ICS 321 Fall 2011 Overview of Storage & Indexing (i) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 11/9/20111Lipyeow.

External Storage Primary Storage : Main Memory (RAM). Secondary Storage: Peripheral Devices –Disk Drives –Tape Drives Secondary storage is CHEAP. Secondary.

1 Overview of Storage and Indexing Chapter 8 (part 1)

Storage and Indexing1 Overview of Storage and Indexing.

DB2 10 Hash Access: Access Path or Collision Course? Donna Di Carlo BMC Software Session Code: A13 Wednesday, 16 November 2011 | Platform: DB2 for z/OS.

Large Data Operations Joe Chang

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Implementing Relational Operators and Query Evaluation Chapter 12.

Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Overview of Storage and Indexing Chapter 8.

1 Overview of Storage and Indexing Chapter 8. 2 Data on External Storage  Disks: Can retrieve random page at fixed cost  But reading several consecutive.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

Indexing CS 400/600 – Data Structures. Indexing2 Memory and Disk  Typical memory access: 30 – 60 ns  Typical disk access: 3-9 ms  Difference: 100,000.

Query Optimizer (Chapter ). Optimization Minimizes uses of resources by choosing best set of alternative query access plans considers I/O cost,

Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.

Relational Operator Evaluation. Overview Application Programmer (e.g., business analyst, Data architect) Sophisticated Application Programmer (e.g.,

Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.

B+ Trees: An IO-Aware Index Structure Lecture 13.

11.1 Silberschatz, Galvin and Gagne ©2005 Operating System Principles 11.5 Free-Space Management Bit vector (n blocks) … 012n-1 bit[i] =  1  block[i]

IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.

CS 540 Database Management Systems

Introduction to Database Systems1 External Sorting Query Processing: Topic 0.

DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.

DMBS Internals I. What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently.

DMBS Architecture May 15 th, Generic Architecture Query compiler/optimizer Execution engine Index/record mgr. Buffer manager Storage manager storage.

1 CSCE 520 Test 2 Info Indexing Modified from slides of Hector Garcia-Molina and Jeff Ullman.

What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the database concurrently and safely. Provide.

1 Lecture 16: Data Storage Wednesday, November 6, 2006.

Select Operation Strategies And Indexing (Chapter 8)

CS222: Principles of Data Management Lecture #4 Catalogs, Buffer Manager, File Organizations Instructor: Chen Li.

Chris Index Feng Shui Chris

Lecture 16: Data Storage Wednesday, November 6, 2006.

Database Management Systems (CS 564)

Hustle and Bustle of SQL Pages

Lecture 11: DMBS Internals

Lecture 10: Buffer Manager and File Organization

Lecture 9: Data Storage and IO Models

CS222/CS122C: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.

Proactive Index Design using QUBE Courtesy of Tapio Lahdenmäki

Troubleshooting Techniques(*)

CS222p: Principles of Data Management Lecture #4 Catalogs, File Organizations Instructor: Chen Li.

Presentation transcript:

Proactive Index Design Using QUBE Lauri Pietarinen Courtesy of Tapio Lahdenmäki November 2010 IDUG 2010

Q U B E Q-Quick U-Upper B-Bound E-Estimate A simple formula for estimating CPU and elapsed time for queries, created by Tapio Lahdenmäki and others at IBM-Finland

Index Basics CNO CNO, IDATE SELECT INO, CNAME FROM INVOICE WHERE CNO = :CNO AND IDATE > :IDATE INVOICE CNO, IDATE, INO, CNAME INVOICE 1 1,000,000 invoices Per customer: Max 10,000 invoices Max 300 recent invoices ,000,000 T10,000 T 300 T T = Touch

Leaf pages Non-leaf pages Continue until last level single page Normally 3 levels if 1,000,000 table rows Number of non-leaf pages much lower than number of leaf pages Non-leaf pages tend to stay in pool with current hardware Reasonable (2010): Ignore cost of non-leaf page processing B-Tree Index WHERE COL = 12 WHERE COL BETWEEN 2 AND 10 COL TABLETABLE

Recommended Mental Image COL is M column (matching column) Predicate COL BETWEEN 2 AND 10 defines index slice (matching predicate) COL T = Touch TABLE TR = Number of random touches TS = Number of sequential touches TR = TS = T T T T T T T

Request Tracking Insurance company Req DEADLINE STATUS RPK CNO... BO = 15 10s X X X X X X 1,000 per day Average: 5 STATUS changes per request Primary key of REQUEST = RPK, foreign keys CNO and BO BO = Branch office (100 branch offices, the largest one covers 3% of requests) STATUS: (9 = Closed) DL = Deadline DL = BO = Latest BO 99% Customers 20 rows / screen

REQUEST RPK 1,000,000 rows, average length 300 bytes SELECT DL, STATUS, RPK, CNO, C1, C2 FROM REQUEST WHERE STATUS < 9 AND BO = :BO ORDER BY DL CNO BO FF = 0...3% FF = 1% FF = Filter Factor: % Common Transaction

REQUEST RPK 1,000,000 rows CNO BO Read 1,000,000 table rows Pick rows that satisfy both predicates Sort 300 result rows Read index slice Read 30,000 table rows Pick rows that satisfy STATUS < 9 Sort 300 result rows 3% Alternative 1Alternative 2 Which One Faster?

Sequential Read in 2010 I/O time DBMS and the disk subsystem read ahead -- lots of pages with one rotation Not all pages at once -- just trying to stay ahead: when the program needs a page it should be in the buffer pool If sequential read speed 40 MB/s, I/O time per 4K page 0.1 ms; if 10 rows per page, I/O time per row = 10 us (microseconds) CPU time Rule of thumb: CPU time per examined row = 5 us with sequential read FETCH (move qualifying row to application pgm) may take 50 us of CPU time READ CACHE Processor RAM CPU cache Buffer pool

Random Read in 2010 Disk I/O time If needed page not in pool: disk read If needed page in read cache: I/O time may be 1 ms Random read from disk drive may take 10 ms CPU time Retrieving a row and evaluating it may take 50 us of CPU time (random read) FETCH one row may take 50 us of CPU time -- as with sequential read Serious READ CACHE Processor RAM CPU cache Buffer pool

Depends on drive busy Q = (u / (1-u)) x S Q = Average queuing time u = Average drive busy S = Average service time 50 random reads a second u = 50 read/s x s/read = 0.3 Q = (0.3 /(1- 0.3)) x 6 ms = 3 ms Queuing (Q) 3 ms Seek 4 ms Half a rotation 2 ms Transfer 1 ms Total I/O time 10 ms S = Service time One random read keeps a drive busy for 6 ms Random Read from Disk Drive

Disk Drives -- the Bottleneck 3 GB 72 GB 145 GB 300 GB TB 2007 Storage density grows dramatically Sequential I/O getting faster Random I/O remains slow (and may even become slower) u 30% to 60% Q 3 ms to 9 ms Random read 10 ms to 16 ms Q = (u / (1-u)) x S Q = Average queuing time u = Average drive busy S = Average service time 2 TB 2009

ET = TR x 10 ms + TS x 0.01 ms + F x 0.1 ms CPU = TR x 50 us + TS x 5 us + F x 50 us ET = Elapsed time (SQL) CPU = CPU time (SQL) TR = Number of random touches TS = Number of sequential touches F = Number of rows returned to program (Fetches) Quick Upper-Bound Estimate (QUBE)

Index Table TR TS ET = ( CPU = ( + + ) x 10 ms = + + ) ms / 20 = TR TS/1000F/100 TRTS/10F REQUEST 1,000,000 rows Alternative 1 Worst input: F = M = 1,000, M 1 1, s ,0005 s

Alternative 2A REQUEST RPK 1,000,000 rows CNO BO 3% Index Table TR TS ET = ( CPU = ( + + ) x 10 ms = + + ) ms / 20 = TR TS/1000F/100 TRTS/10F Worst input: F = , s 30, ,000 2 s 1 30,000

REQUEST BO Alternative 2B RPK 1,000,000 rows CNO 3% Index Table TR TS ET = ( CPU = ( + + ) x 10 ms = + + ) ms / 20 = TR TS/1000F/100 TRTS/10F Worst input: F = s ,000 0,3 s 1 30,000 C 3% 1

STATUS BO MatchingScreening REQUEST Basic Question STATUS < 9 defines index slice BO = :BO evaluated in index All predicate columns in one index? Touch table only when WHERE clause true If yes, index is semi-fat WHERE STATUS < 9 AND BO = :BO T T T T T T

BOSTATUS STATUS < 9 AND BO = 2 defines index slice REQUEST The index slice contains only qualifying index rows Semi-Fat Index Matching T T T T T

REQUEST SELECT DL, STATUS, RPK, CNO, C1, C2 FROM REQUEST WHERE STATUS < 9 AND BO = :BO ORDER BY DL FF = 1% M M 0.03% BO, STATUS M M QUBE for Semi-Fat Index – Your Turn! FF = 3% MC = 2 SC = 0 IXONLY = N SORT = Y Index Table TR TS ET = ( CPU = ( + + ) x 10 ms = + + ) ms / 20 = TRTS/1000F/100 TRTS/10F F = 300

QUBE for Semi-Fat Index – Solution REQUEST SELECT DL, STATUS, RPK, CNO, C1, C2 FROM REQUEST WHERE STATUS < 9 AND BO = :BO ORDER BY DL FF = 1% M M 0.03% BO, STATUS M M FF = 3% MC = 2 SC = 0 IXONLY = N SORT = Y Index Table TR TS ET = ( CPU = ( ++ ) x 10 ms = + + ) ms / 20 = TR TS/1000 F/100 TR TS/10F F = 00 33 s ms

Still Too Long - What Next? The problem: 300 random table touches Fat index No table touches 20 FETCHes - 20 table touches? 300 x 10 ms = 3 s

DECLARE CURSOR... OPEN CURSOR FETCH CURSOR ---- while found CLOSE CURSOR FETCH: One result row OR Access path without sort OPEN CURSOR: All result rows When Do Touches Take Place? Sort very fast today (say, 10 us CPU per row) but... Access path with sort ? ?

MC = 1 SC = 1 IXONLY = N SORT = N Index Table TR TS ET = ( CPU = ( SELECT DL, STATUS, RPK, CNO, C1, C2 FROM REQUEST WHERE STATUS < 9 AND BO = :BO ORDER BY DL, RPK FETCH FIRST 20 ROWS ONLY FF = 1% + + ) x 10 ms = + + ) ms / 20 = TRTS/1000F/100 TRTS/10F FF = 3% No Sort, 20 FETCHes BO, DL, RPK, STATUS F = 20 First screen ms ms

Worst-Input Estimates Elapsed time (ET) CPU time Note No index 10 s 5 s BO (non-C) 300 s 2 s BO, STATUS 3 s 0.03 s No Sort BO (C) 1 s 0,3 s 0.2 s s Semi-fat First screen Modify pgm BO, DL, RPK, STATUS

MC = 2 SC = 0 IXONLY = Y SORT = Y Index Table TR TS ET = ( CPU = ( SELECT DL, STATUS, RPK, CNO, C1, C2 FROM REQUEST WHERE STATUS < 9 AND BO = :BO ORDER BY DL FF = 1% + + ) x 10 ms = + + ) ms / 20 = TRTS/1000F/100 TRTS/10F FF = 3% Fat Index with Sort BO, STATUS, RPK, DL, CNO, C1, C2 F = ms ms

Worst-Input Estimates Elapsed time (ET) CPU time Note No index 10 s 5 s BO (non-C) 300 s 2 s BO, STATUS 3 s 0.03 s No Sort Fat BO, STATUS, RPK s 0.02 s BO (C) 1 s 0,3 s 0.2 s s Semi-fat First screen Modify pgm BO, DL, RPK, STATUS

Too Expensive? Disk space RAM (Non-leaf pages) INSERT UPDATE DELETE Index reorg (rebuild) BO Something else? 1000 new rows per day 5000 STATUS updates per day 1,000,000 rows 80 bytes per row BO, STATUS, RPK, DL, CNO, C1, C2

The Cost of Adding an Index Roughly 10 ms per added row Add 10 ms if split Assumptions: Upper index levels in DB cache Leaf page not in DB cache INSERT DELETE UPDATE Roughly 10 ms per removed row Roughly 10 ms per added row when columns of new index updated Add 10 ms if move Add 10 ms if split Asynchronous writes Drive busy up

The Cost of Adding an Index Column None if adequate distributed free space (!) INSERT DELETE UPDATE None (!) Roughly 10 ms when only the new column updated Add 10 ms if move Add 10 ms if split I/O moves the whole page, not a row

No Low cost compared to dramatic reduction in response time and cost of SELECT Disk space RAM (Non-leaf pages) INSERT UPDATE DELETE Index reorg (rebuild) Something else? 1000 new rows per day 5000 STATUS updates per day 1,000,000 rows 80 bytes per row BO BO, STATUS, RPK,DL, CNO, C1, C2 The only issue Index slice read time increases if index reorg interval too long So, Too Expensive?

Obsolete Do not index volatile columns STATUS Max N indexes per table Max 5 columns per index etc INSERT and DELETE fast enough after index added? TR = 1 per added or removed index row UPDATE fast enough after index columns added? TR = 1 or 2 per updated index column Index reorg requirement OK? Long index rows (more than 5% of leaf page) Hot spots (except end of index) Drive load caused by index maintenance If dozens of random index row inserts or deletes a second Index storage cost (disk & RAM) Diminishing year after year Underindexing: a common mistake TR RAM e/GB/m

Index BO Was Not Adequate For This SELECT Who should have seen this? When? SELECT DL, STATUS, RPK, CNO, C1, C2 FROM REQUEST WHERE STATUS < 9 AND BO = :BO ORDER BY DL

Summary Qube is a way of thinking about indexes It can be used to prevent performance problems It can be used in conjunction with other tools It can be used to understand and analyze performance problems