Database Systems Data Access Structures (Index) Query planning and execution Gergely Lukács Pázmány Péter Catholic University Faculty of Information Technology Budapest, Hungary lukacs@itk.ppke.hu
External/Conceptual/Internal Views External View: A user is anyone who needs to access some portion of the data. Conceptual/Logical View: An abstract representation of the entire information content of the database. Internal/physical View: Describes how the data are stored/organised physically (indeces, acces paths, block size…)
Data access structures, Binary Search Tree B+ tree („Index“) 4
Binary tree each node can have up to two successor nodes The successor nodes of a node are called its children The predecessor node of a node is called its parent The "beginning" node is called the root (has no parent) A node without children is called a leaf
Binary tree - 2 Nodes are organize in levels (indexed from 0). Level (or depth) of a node: number of edges in the path from the root to that node. Height of a tree h: #levels = L (some books define h as #levels-1). Full tree: (every node has exactly two children) and all the leaves are on the same level.
Binary tree - 3 Max. #nodes at level l: 2l Height h of full tree H = log2(N+1) → (log2(N)) Tree operations (e.g., insert, delete, retrieve etc.) depend on h. h determines running time!
Binary Search Tree (BST) - 4 The value stored at a node is greater than the value stored at its left child and less than the value stored at its right child How to search a BST? We begin by examining the root node. If the tree is null, the key we are searching for does not exist in the tree. Otherwise, if the key equals that of the root, the search is successful and we return the node. If the key is less than that of the root, we search the left subtree. Similarly, if the key is greater than that of the root, we search the right subtree. This process is repeated until the key is found or the remaining subtree is null. If the searched key is not found before a null subtree is reached, then the item must not be present in the tree. Tree operations (e.g., insert, delete, retrieve etc.) depend on height of tree. h determines running time! 13 9 44 4 11 30 2 15 34 6
Disk storage Disk storage Data is organised in Blocks! data is transferred between disk and main memory in blocks (e.g, 4/8/16/32 kB) No matter, whether 1 byte or the whole block is needed Significant cost factor in DBMS: # blocks accessed from disk Sequential access much faster than random access 3600 RPM 16.7 ms X00 000 instructions!!
Table storage, blocks
Binary search tree – vs. disk storage Assume: X0(0) million values Deep tree Idea: more branches and thus reduce the height of the tree! Values at different levels Idea: values only in the leaf nodes
B+ tree as database index B+ tree – nodes ~ blocks high fanout (ca. 100-?00) E.g., 4 KB block, one value (separator): 40 Bytes: ~100 values in a block! flat tree (log100N instead of log2: ca. 3-5 levels)
B+ tree Data (pointer) only in leaf nodes Leaf nodes chained Pointer on the data (table)
B+ tree – select (search) Searching for Key Value 6
B+ tree: inserting new element Search bucket If the bucket is not full, add the record. Otherwise, split the bucket. Allocate new leaf and move half the bucket's elements to the new bucket. Insert the new leaf's smallest key and address into the parent. If the parent is full, split it too. Add the middle key to the parent node. Repeat until a parent is found that need not split. If the root splits, create a new root which has one key and two pointers. Propagating splits up, as far as necessary
Example Inserting new data with key 74 74 ... 25 45 65 75 10 ... 20 ... 30 ... 40 ... 50 ... 60 ... 70 ... 74 ... 80 ... 90 ...
Example Inserting new data with key 51 Overflow 51 ... 25 45 65 75 10 ... 20 ... 30 ... 40 ... 50 ... 60 ... 70 ... 74 ... 80 ... 90 ...
Example Inserting new data with key 51 Overflow 25 45 65 75 10 ... New separator: 55 10 ... 20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ... 60 ...
Example Inserting new data with key 51 55 25 45 65 75 10 ... 20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ... 60 ...
Example Deleting data with key 60: underflow 55 25 45 65 75 10 ... 20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ... X underflow 60 ...
Example Deleting data with key 60: 55 25 45 65 75 10 ... 20 ... 30 ... underflow X 25 45 65 75 X 10 ... 20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ...
Example Deleting data with key 60: 25 45 55 75 10 ... 20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ...
B+ tree as database index -- pointers in index to records in table
Creating and using indexes CREATE INDEX Idx_Emp_LName ON Employees ("Last Name") Exact match query ... WHERE "Last Name" = 'Doe' Range query ... WHERE "Last Name" > 'DA' AND "Last Name" < 'DC' Selectivity of query (condition) Index only for highly selective queries
Selectivity in a quey Selectivity of query (condition) Index useful only for highly selective queries Low selectivity: nearly all data blocks have to be read, even several times!
Multiple-Column Indexes CREATE INDEX Idx_Emp_Name ON Employees ("Last Name", "First Name") Useful for ... WHERE "Last Name" = 'Doe' ... WHERE "Last Name" = 'Doe' AND "First Name" = 'John' ... WHERE "First Name" = 'John' AND "Last Name" = 'Doe' Can not be used ... WHERE "First Name" = 'John' 1 2
Index Query conditions Join conditions Sorted result Space requirement ~ similar to table Maintenance (updates, inserts) ~ (>) table Bulk updates, inserts Deactivating, dropping index Reactivating, recreating
Cost based query optimization 28
Query execution SQL: very high level, declarative What to retrieve, not how! SQL query is translated by the query processor into a low level program – the execution plan
Query optimisation One SQL query – many (!) different execution plans, execution alternatives Index – using, not using tendence: selective query – using Which index? Join order ? (Join execution?) Dramatically different costs! Query optimizer: choosing a relatively good execution plan
Query execution Query parser Transformation Query optimizer SQL Query parser Internal query description Transformation Internal query description Query optimizer Query Execution Plan Query execution
Query execution (detailed) Parsing Syntactic check (tables, attributes – Data Dictionary) View expansion Query parser Transformation Query optimizer Query execution
Query execution (detailed) Query parser Standardized description (operator tree) Transformation Query optimizer Query execution
Query execution (detailed) Query parser Optimisation (cost-based): Setting up execution plans Estimating their costs Selecting a cheap execution plan Transformation Query optimizer Query execution
Reminder: Key Idea: Algebraic Optimization N = ((n*5)+(n*2)+0)/1 Algebraic laws: (+) identity: x+0=x (/) identity: x/1=x (*) distributes: (y*x)+(z*x)=(y+z)*x (*) commutes: y*x=x*y Rules 1, 3, 4,2: N=(5+2)*n using relational algebra
Query execution (detailed) Query parser Transformation Query optimizer Execution of selected execution plan Query execution
Transformation, operator tree B A1, A2,..., An select A1, A2,..., An from R1, R2,..., Rm where B A1, A2,..., An (B (R1 R2 ... Rm))
Basic idea Keep intermediate results as small as possible Executing , early, ⋈, , ,… late, as and reduces the volume of data, ⋈, … result often in large intermediate results. Heuristics + cost estimation
Equivalences 1. Selection c1c2 ... cn (R) c1(c2(…(cn(R)) …)) 2. ist commutative c1(c2((R)) c2(c1((R)) 3. -cascades: If L1 L2 … Ln, then L1(L2 (…(Ln(R)) …)) L1(R) 4. Changing and If the selection only refers to the projected attributes A1, …, An, selection and projection can be exchanged c(A1, …, An(R)) A1, …, An(c(R))
Equivalences 2 5. , ⋈, and are commutative 6. A Cartesic product, followed by a selection referring to both operands can be replaced by a join. c(R S) R ⋈c S ....
Example select Lname from Employee, WorksOn, Project where Pname = 'GOM‘ and Pnumber = PNO and ESSN = SSN and Bdate > 58.04.16 (Select the lastname of an employee born after 16.04.58 and working on the project „GOM”)
Selection as early, as possible Lname Pname=‘GOM‘ Pnumber = PNO ESSN= SSN Bdate > 58.04.16 EMPLOYEE WORKS_ON PROJECT Selection as early, as possible
Restrictive joins early Pnumber = PNO EMPLOYEE WORKS_ON PROJECT Pname=‘GOM‘ ESSN=SSN Bdate>58.04.16 Lname Restrictive joins early
Cross product and selection => join PROJECT WORKS_ON EMPLOYEE Lname Pnumber = PNO ESSN=SSN Pname=‘GOM‘ Bdate>58.04.16 Cross product and selection => join
PROJECT WORKS_ON EMPLOYEE Lname ⋈Pnumber = PNO ⋈ESSN=SSN Pname=‘GOM‘ Bdate>58.04.16 Projections as early, as possible (attributes for result and intermediate results kept)
Lname ⋈Pnumber = PNO ⋈ESSN=SSN Pname=‘GOM‘ Bdate>58.04.16 PROJECT WORKS_ON EMPLOYEE ⋈Pnumber = PNO ⋈ESSN=SSN Pname=‘GOM‘ Bdate>58.04.16 Pnumber ESSN,PNO SSN,Lname ESSN
Cost based selection Optimisation (cost-based): Query parser Optimisation (cost-based): Setting up execution plans Estimating their costs Selecting a cheap execution plan Transformation Query optimizer Query execution
Statistics in databases Optimizer needs statistics on the data to make decisions! E.g., Size of a table? Selectivity of a (join) condition?
Equivalences - … c1c2 ... cn (R) c1(c2(…(cn(R)) …)) c1(c2((R)) c2(c1((R)) L1(L2 (…(Ln(R)) …)) L1(R) c(A1, …, An(R)) A1, …, An(c(R)) R S S R (⋈, , , ) c(R S) R ⋈c S c(R S) c(R) S (⋈, ) c(R S) c1(R) c2(S) (⋈, ) L(R ⋈c S) (A1, …, An(R)) ⋈c (B1, …, Bn(S)) L(R ⋈c S) L((A1, …, An, A1', …, An'(R)) ⋈c (B1, …, Bn, B1', …, Bn'(S))) (R S) T R (S T) (⋈, , , ) c(R S) (c(R)) (c(S)) (, , ) L(R S) (L(R)) (L(S)) (, , )
Join-order Left oriented join trees, greedy search... Many joins Joins expensive ⋈ ⋈ ⋈ ⋈ R1 ⋈ ⋈ R1 ⋈ ⋈ R2 R1 R2 R3 R4 R2 ⋈ R4 R3 R3 R4 Left oriented join trees, greedy search...
Example select FlugNr from (select F., FT., count (TicketNr) from FLUG F, FLUGZEUGTYP FT, BUCHUNG B where B.FlugNr = F.FlugNr group by F., FT., Datum) as DFT(F.,FT.,count) where F.FtypId = FT.FtypId and FT.First+FT.Business+FT.Economy < DFT.count
Example FLUG F FLUGZEUGTYP FT BUCHUNG B F.FlugNr F.FtypId = FT.FtypId FT.First+FT.Business+FT.Economy < count F., FT., count F., FT., count (TicketNr) F., FT., Datum B.FlugNr = F.FlugNr F.FlugNr ( F.FtypId = FT.FtypId FT.First+FT.Business+FT.Economy < count ( F.,FT.,count ( F., FT., count (TicketNr) ( F., FT., Datum ( B.FlugNr = F.FlugNr (FLUG F FLUGZEUGTYP FT BUCHUNG B))))))
Statistics in databases Oracle TABLES: num_rows, num_blocks avg_row_len TAB_COL_STATISTICS num_distinct num_nulls num_buckets INDEXES leaf_blocks blevel
Calculation of statistics Task of database administrator (expensive task!) analyze table relation compute statistics for columns attribute,..., attribute size value : estimate statistics sample value percent
Oracle SQL Developer, Explain Plan