Gergely Lukács Pázmány Péter Catholic University

Database Systems Data Access Structures (Index) Query planning and execution
Gergely Lukács Pázmány Péter Catholic University Faculty of Information Technology Budapest, Hungary

External/Conceptual/Internal Views
External View: A user is anyone who needs to access some portion of the data. Conceptual/Logical View: An abstract representation of the entire information content of the database. Internal/physical View: Describes how the data are stored/organised physically (indeces, acces paths, block size…)

Data access structures, Binary Search Tree B+ tree („Index“)
4

Binary tree each node can have up to two successor nodes
The successor nodes of a node are called its children The predecessor node of a node is called its parent The "beginning" node is called the root (has no parent) A node without children is called a leaf

Binary tree - 2 Nodes are organize in levels (indexed from 0).
Level (or depth) of a node: number of edges in the path from the root to that node. Height of a tree h: #levels = L (some books define h as #levels-1). Full tree: (every node has exactly two children) and all the leaves are on the same level.

Binary tree - 3 Max. #nodes at level l: 2l Height h of full tree H = log2(N+1) → (log2(N)) Tree operations (e.g., insert, delete, retrieve etc.) depend on h. h determines running time!

Binary Search Tree (BST) - 4
The value stored at a node is greater than the value stored at its left child and less than the value stored at its right child How to search a BST? We begin by examining the root node. If the tree is null, the key we are searching for does not exist in the tree. Otherwise, if the key equals that of the root, the search is successful and we return the node. If the key is less than that of the root, we search the left subtree. Similarly, if the key is greater than that of the root, we search the right subtree. This process is repeated until the key is found or the remaining subtree is null. If the searched key is not found before a null subtree is reached, then the item must not be present in the tree. Tree operations (e.g., insert, delete, retrieve etc.) depend on height of tree. h determines running time! 13 9 44 4 11 30 2 15 34 6

Disk storage Disk storage Data is organised in Blocks!
data is transferred between disk and main memory in blocks (e.g, 4/8/16/32 kB) No matter, whether 1 byte or the whole block is needed Significant cost factor in DBMS: # blocks accessed from disk Sequential access much faster than random access 3600 RPM 16.7 ms X instructions!!

Table storage, blocks

Binary search tree – vs. disk storage
Assume: X0(0) million values Deep tree Idea: more branches and thus reduce the height of the tree! Values at different levels Idea: values only in the leaf nodes

B+ tree as database index
B+ tree – nodes ~ blocks high fanout (ca. 100-?00) E.g., 4 KB block, one value (separator): 40 Bytes: ~100 values in a block! flat tree (log100N instead of log2: ca. 3-5 levels)

B+ tree Data (pointer) only in leaf nodes Leaf nodes chained
Pointer on the data (table)

B+ tree – select (search)
Searching for Key Value 6

B+ tree: inserting new element
Search bucket If the bucket is not full, add the record. Otherwise, split the bucket. Allocate new leaf and move half the bucket's elements to the new bucket. Insert the new leaf's smallest key and address into the parent. If the parent is full, split it too. Add the middle key to the parent node. Repeat until a parent is found that need not split. If the root splits, create a new root which has one key and two pointers. Propagating splits up, as far as necessary

Example Inserting new data with key 74 74 ... 25 45 65 75 10 ...
20 ... 30 ... 40 ... 50 ... 60 ... 70 ... 74 ... 80 ... 90 ...

Example Inserting new data with key 51 Overflow 51 ... 25 45 65 75
10 ... 20 ... 30 ... 40 ... 50 ... 60 ... 70 ... 74 ... 80 ... 90 ...

Example Inserting new data with key 51 Overflow 25 45 65 75 10 ...
New separator: 55 10 ... 20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ... 60 ...

Example Inserting new data with key 51 55 25 45 65 75 10 ... 20 ...
30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ... 60 ...

Example Deleting data with key 60: underflow 55 25 45 65 75 10 ...
20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ... X underflow 60 ...

Example Deleting data with key 60: 55 25 45 65 75 10 ... 20 ... 30 ...
underflow X 25 45 65 75 X 10 ... 20 ... 30 ... 40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ...

Example Deleting data with key 60: 25 45 55 75 10 ... 20 ... 30 ...
40 ... 50 ... 51 ... 70 ... 74 ... 80 ... 90 ...

B+ tree as database index -- pointers in index to records in table

Creating and using indexes
CREATE INDEX Idx_Emp_LName ON Employees ("Last Name") Exact match query ... WHERE "Last Name" = 'Doe' Range query ... WHERE "Last Name" > 'DA' AND "Last Name" < 'DC' Selectivity of query (condition) Index only for highly selective queries

Selectivity in a quey Selectivity of query (condition)
Index useful only for highly selective queries Low selectivity: nearly all data blocks have to be read, even several times!

Multiple-Column Indexes
CREATE INDEX Idx_Emp_Name ON Employees ("Last Name", "First Name") Useful for ... WHERE "Last Name" = 'Doe' ... WHERE "Last Name" = 'Doe' AND "First Name" = 'John' ... WHERE "First Name" = 'John' AND "Last Name" = 'Doe' Can not be used ... WHERE "First Name" = 'John' 1 2

Index Query conditions Join conditions Sorted result
Space requirement ~ similar to table Maintenance (updates, inserts) ~ (>) table Bulk updates, inserts Deactivating, dropping index Reactivating, recreating

Cost based query optimization
28

Query execution SQL: very high level, declarative
What to retrieve, not how! SQL query is translated by the query processor into a low level program – the execution plan

Query optimisation One SQL query – many (!) different execution plans, execution alternatives Index – using, not using tendence: selective query – using Which index? Join order ? (Join execution?) Dramatically different costs! Query optimizer: choosing a relatively good execution plan

Query execution Query parser Transformation Query optimizer
SQL Query parser Internal query description Transformation Internal query description Query optimizer Query Execution Plan Query execution

Query execution (detailed)
Parsing Syntactic check (tables, attributes – Data Dictionary) View expansion Query parser Transformation Query optimizer Query execution

Query parser Standardized description (operator tree) Transformation Query optimizer Query execution

Query parser Optimisation (cost-based): Setting up execution plans Estimating their costs Selecting a cheap execution plan Transformation Query optimizer Query execution

Reminder: Key Idea: Algebraic Optimization
N = ((n*5)+(n*2)+0)/1 Algebraic laws: (+) identity: x+0=x (/) identity: x/1=x (*) distributes: (y*x)+(z*x)=(y+z)*x (*) commutes: y*x=x*y Rules 1, 3, 4,2: N=(5+2)*n using relational algebra

Query parser Transformation Query optimizer Execution of selected execution plan Query execution

Transformation, operator tree
 B A1, A2,..., An select A1, A2,..., An from R1, R2,..., Rm where B  A1, A2,..., An (B (R1  R2  ...  Rm))

Basic idea Keep intermediate results as small as possible
Executing ,  early, ⋈, , ,… late, as  and  reduces the volume of data, ⋈, … result often in large intermediate results. Heuristics + cost estimation

Equivalences 1. Selection c1c2 ... cn (R)  c1(c2(…(cn(R)) …))
2.  ist commutative c1(c2((R))  c2(c1((R)) 3.  -cascades: If L1  L2  …  Ln, then L1(L2 (…(Ln(R)) …))  L1(R) 4. Changing  and  If the selection only refers to the projected attributes A1, …, An, selection and projection can be exchanged c(A1, …, An(R))  A1, …, An(c(R))

Equivalences 2 5. , ⋈,  and  are commutative
6. A Cartesic product, followed by a selection referring to both operands can be replaced by a join. c(R  S)  R ⋈c S ....

Example select Lname from Employee, WorksOn, Project where Pname = 'GOM‘ and Pnumber = PNO and ESSN = SSN and Bdate > (Select the lastname of an employee born after and working on the project „GOM”)

Selection as early, as possible
Lname Pname=‘GOM‘  Pnumber = PNO  ESSN= SSN  Bdate > EMPLOYEE WORKS_ON PROJECT  Selection as early, as possible

Restrictive joins early
Pnumber = PNO EMPLOYEE WORKS_ON PROJECT Pname=‘GOM‘ ESSN=SSN Bdate>  Lname Restrictive joins early

Cross product and selection => join
PROJECT WORKS_ON EMPLOYEE  Lname Pnumber = PNO ESSN=SSN Pname=‘GOM‘ Bdate> Cross product and selection => join

PROJECT WORKS_ON EMPLOYEE Lname ⋈Pnumber = PNO ⋈ESSN=SSN Pname=‘GOM‘ Bdate> Projections as early, as possible (attributes for result and intermediate results kept)

Lname ⋈Pnumber = PNO ⋈ESSN=SSN Pname=‘GOM‘ Bdate>58.04.16
PROJECT WORKS_ON EMPLOYEE ⋈Pnumber = PNO ⋈ESSN=SSN Pname=‘GOM‘ Bdate> Pnumber ESSN,PNO SSN,Lname ESSN

Cost based selection Optimisation (cost-based):
Query parser Optimisation (cost-based): Setting up execution plans Estimating their costs Selecting a cheap execution plan Transformation Query optimizer Query execution

Statistics in databases
Optimizer needs statistics on the data to make decisions! E.g., Size of a table? Selectivity of a (join) condition?

Equivalences - … c1c2 ... cn (R)  c1(c2(…(cn(R)) …))
c1(c2((R))  c2(c1((R)) L1(L2 (…(Ln(R)) …))  L1(R) c(A1, …, An(R))  A1, …, An(c(R)) R  S  S  R (⋈, , , ) c(R  S)  R ⋈c S c(R  S)  c(R)  S (⋈, ) c(R  S)  c1(R)  c2(S) (⋈, ) L(R ⋈c S)  (A1, …, An(R)) ⋈c (B1, …, Bn(S)) L(R ⋈c S)  L((A1, …, An, A1', …, An'(R)) ⋈c (B1, …, Bn, B1', …, Bn'(S))) (R S) T  R (S T) (⋈, , , ) c(R  S) (c(R))  (c(S)) (,  , ) L(R  S) (L(R))  (L(S)) (,  , )

Join-order Left oriented join trees, greedy search... Many joins
Joins expensive ⋈ ⋈ ⋈ ⋈ R1 ⋈ ⋈ R1 ⋈ ⋈ R2 R1 R2 R3 R4 R2 ⋈ R4 R3 R3 R4 Left oriented join trees, greedy search...

Example select FlugNr from (select F., FT., count (TicketNr) from FLUG F, FLUGZEUGTYP FT, BUCHUNG B where B.FlugNr = F.FlugNr group by F., FT., Datum) as DFT(F.,FT.,count) where F.FtypId = FT.FtypId and FT.First+FT.Business+FT.Economy < DFT.count

Example FLUG F FLUGZEUGTYP FT BUCHUNG B  F.FlugNr  F.FtypId = FT.FtypId  FT.First+FT.Business+FT.Economy < count  F., FT., count F., FT., count (TicketNr)  F., FT., Datum B.FlugNr = F.FlugNr  F.FlugNr ( F.FtypId = FT.FtypId  FT.First+FT.Business+FT.Economy < count ( F.,FT.,count ( F., FT., count (TicketNr) ( F., FT., Datum ( B.FlugNr = F.FlugNr (FLUG F  FLUGZEUGTYP FT BUCHUNG B))))))

Statistics in databases
Oracle TABLES: num_rows, num_blocks avg_row_len TAB_COL_STATISTICS num_distinct num_nulls num_buckets INDEXES leaf_blocks blevel

Calculation of statistics
Task of database administrator (expensive task!) analyze table relation compute statistics for columns attribute,..., attribute size value : estimate statistics sample value percent

Oracle SQL Developer, Explain Plan

Gergely Lukács Pázmány Péter Catholic University

Similar presentations

Presentation on theme: "Gergely Lukács Pázmány Péter Catholic University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Gergely Lukács Pázmány Péter Catholic University

Similar presentations

Presentation on theme: "Gergely Lukács Pázmány Péter Catholic University"— Presentation transcript:

Similar presentations

About project

Feedback