Final Exam Review Lecture 31. Administrivia Office hours 1:15 – 2:15 today –also available via

Final Exam Review Lecture 31

Administrivia Office hours 1:15 – 2:15 today –also available via e-mail: ebenh@eecs.berkeley.eduebenh@eecs.berkeley.edu –TAs will have extra office hours, see class web page, news group Final Exam May 16 8-11 a.m. –Location: 55 Warren –Closed book, 2 pages of notes, both sides –IEEE will provide pastries and juice starting at 7:30!

Final Exam Topics Up to midterm 1 (25%) –Relational Model & Query Languages (Roth) Relational Algebra and Calculus SQL –Database Implementation (Haber) Disks, buffers, files Indexes: B-Trees, Hash Indexes Between midterm 1 & midterm 2 (25%) –Query Execution Relational Operators (Haber) Sorting (Haber) Joining (Haber) Query Optimization (Roth) Since midterm 2 (48%) –Database Design (Haber) The ER Model Functional Dependencies & Normalization –Transactions, Concurrency Control, & Recovery (Roth) Guest Lectures (2%)

Why are databases interesting? Theoretical foundation –Modelling structure of information Relations: sets of identically structured tuples Design constraints: FDs and correct decompositions –Formal query languages Algebra: operators on relations, return relations Calculus: declarative specification of query result Practical application of theory –Using computer structures pages, files, memory, buffer pools, indexes –ACID properties (xacts, concur. control, recovery) –Reasonable efficiency

Review Outline Up to midterm 1 –Relational Model & Query Languages Relational Algebra and Calculus SQL –Database Implementation Disks, buffers, files Indexes: B-Trees, Hash Indexes Between midterm 1 & midterm 2 –Query Execution Relational Operators Sorting Joining Query Optimization Since midterm 2 –Database Design The ER Model Functional Dependencies & Normalization –Transactions, Concurrency Control, & Recovery

DBMS components Query Optimization and Execution Relational Operators Access Methods Buffer Management Disk Space Management DB Makes efficient use of disk space -> Think 300,000,000 accounts! Makes efficient use of RAM -> Think 1,000,000 simultaneous requests! Provides generic ways to combine data -> Do you want a list of customers and accounts or the total account balance of all customers? Figures out the best way to answer a question -> There is always more than 1 way to skin a cat…! Provides efficient ways to extract data -> Do you need 1 record or a bunch? Database application Talks to DBMS to manage data for a specific task -> e.g. app to withdraw/deposit money or provide a history of the account

The Storage Hierarchy Source: Operating Systems Concepts 5th Edition –Main memory (RAM) for currently used data. –Disk for the main database (secondary storage). –Tapes for archiving older versions of the data (tertiary storage). Smaller, Faster Bigger, Slower

Disks are slow. Why? Time to access (read/write) a disk block: –seek time ( moving arms to position disk head on track ) –rotational delay ( waiting for block to rotate under head ) –transfer time ( actually moving data to/from disk surface ) Arm movement Seek time Rotational delay Transfer time

Disk Space Manager Lowest layer of DBMS software manages space on disk (using OS file system or not?). Higher levels call upon this layer to: –allocate/de-allocate a page –read/write a page Best if a request for a sequence of pages is satisfied by pages stored sequentially on disk! –Responsibility of disk space manager. –Higher levels don’t know how this is done, or how free space is managed. –Though they may make performance assumptions! Hence disk space manager should do a decent job.

Buffer Management in a DBMS DB MAIN MEMORY DISK disk page free frame Page Requests from Higher Levels BUFFER POOL choice of frame dictated by replacement policy Buffer pool information table contains:

Buffer Management Keeps a group a disk pages in memory Records whether each is pinned –What happens when all pages pinned? –Whan happens when a page is unpinned? Replacement –When all frames used, but not pinned, and new page requested? –How is the replaced page chosen? –Least Recently Used (LRU) –Most Recently Used (MRU) –Clock –Advantages? Disadvantages?

What is in Database Pages? Database contains files, which are made up of… Pages, which are made up of… Records, which are made up of… Fields, which hold single values.

How are records/pages organized? depends on whether fields variable, or fixed length In Minibase, array of type/offsets, followed by data. depends on whether records variable, fixed length. Minibase: slot array at beginning of page, records compacted at end of page. F1 F2 F3 F4 Array of Field Offsets Page i Rid = (i,N) Rid = (i,2) Rid = (i,1) Pointer to start of free space SLOT DIRECTORY N... 2 1 20 1624 N # slots

How are files organized? Unordered Heap File: chained directory pages, containing records that point to data pages. Other possibilities: sorted files, clustered indexes, unclustered index + heap file –Many tradeoffs between them Data Page 1 Data Page 2 Data Page N Header Page DIRECTORY

I/O Cost of Operations B: The number of data pages R: Number of records per page F: Fanout of B-Tree S: Time required for equality search * Don’t Use Index Heap FileSorted FileClustered TreeUnclustered TreeHash Index Scan all records BB1.5 BB* Get all in sort order 4BB1.5 B4B* Equality Search 0.5 Blog 2 Blog F (1.5 B)log F (.15 B) + 1 2 Range Search BS + #matching pages S + #matching records B* Insert 2S + BS + 1S + 24 Delete 0.5B + 1S + B0.5B + 1S + 2

Indexes Can be used to store data records (alt 1), or be an auxillary data structure that referrs to existing file of records (alt 2, 3) Many types of index (B-Tree, Hash Table, R-Tree, etc.) How do you choose the right index? Difference between clustered and unclustered indexes? Index entries Data entries direct search for (Index File) (Data file) Data Records data entries Data entries Data Records CLUSTERED UNCLUSTERED

Review: Query Processing Queries start out as SQL Database translates SQL to one or more Relational Algebra plans Plan is a tree of operations, with access path for each Access path is how each operator gets tuples –If working directly on table, can use scan, index –Some operators, like sort-merge join, or group-by, need tuples sorted –Often, operators pipelined, getting tuples that are output from earlier operators in the tree Database estimates cost for various plans, chooses least expensive

Cost of Operations Selections Projections Sorting, a.k.a. Order By Removing duplicates, a.k.a. Select Distinct Joins

Selections: “age < 20”, “fname = Bob”, etc No index –Do sequential scan over all tuples –Cost: N I/Os Sorted data –Do binary search –Cost: log 2 (N) I/Os Clustered B-Tree –Cost: 2 or 3 to find first record + 1 I/O for each #qualifying pages Unclustered B-Tree –Cost: 2 or 3 to find first RID + ~1 I/O for each qualifying tuple Clustered Hash Index –Cost: ~1.2 I/Os to find bucket, all tuples inside Unclustered Hash Index –Cost: ~1.2 I/Os to find bucket, + ~1 I/O for each matching tuple

Projection Expensive when eliminating duplicates Can do this via: –Sorting: cost no more than external sort Cheaper if you project columns in initial pass, since more projected tuples fit in each page. –Hashing: build a hash table, duplicates will end up in the same bucket

Sorting External Merge Sort –Minimum amount of memory: 3 pages Initial runs of 3 pages Then 2-way merge of sorted runs (2 pages for inputs, one for outputs) #of passes: 1 +  log 2 (N/3)  –With more memory, fewer passes With B pages, #of passes: 1 +  log (B-1) (N/B)  –I/O Cost = 2N * (# of passes) Using B+ Trees for Sorting –Idea: Retrieve records in order by traversing leaf pages. –Is this a good idea? Cases to consider: B+ tree is clusteredGood idea! B+ tree is not clusteredCould be a very bad idea! –I/O Cost Clustered tree: ~ 1.5N Unclustered tree: 1 I/O per tuple, worst case!

Remove duplicates with Hashing Idea: –Many ops don’t need the data ordered –e.g.: removing duplicates in DISTINCT –e.g.: finding matches in JOIN Good enough to match all tuples with equal values Hashing does this! –And may be cheaper than sorting! (Hmmm…!) –But how to do it for data sets bigger than memory?? If we can hash in two passes -> cost is 4N How big of a table can we hash in two passes? –B-1 “partitions” result from Phase 0 –Each should be no more than B pages in size –2 passes possible if table smaller than B(B-1) i.e.: can hash a table of size N pages in about √N space –Note: assumes hash function distributes records evenly! Have a bigger table? Recursive partitioning!

Sorting vs Hashing Based on our simple analysis: –Same memory requirement for 2 passes –Same IO cost Digging deeper … Sorting pros: –Great if input already sorted (or almost sorted) –Great if need output to be sorted anyway –Not sensitive to “data skew” or “bad” hash functions Hashing pros: –Highly parallelizable –Can exploit extra memory to reduce # IOs

Nested Loops Joins R, with M pages, joins S, with N Pages Nested Loops –Simple nested loops Insanely inefficient M + P R *M*n –Paged nested loops – only 3 pages of memory M + M*N –Blocked nested loops – B pages of memory M +  M/(B-2)  * N If M fits in memory (B-2), cost only M + N –Index nested loops M + P R *M* index cost Only good in M very small

Sort-Merge Join Simple case: –sort both tables on join column –Merge –Cost: external sort cost + merge cost 2M*(1 +  log (B-1) (M/B)  ) + 2N*(1 +  log (B-1) (N/B)  ) + M + N Optimized Case: –If we have enough memory, do final merge and join in same pass. This avoids final write pass from sort, and read pass from merge –Can we merge on 2 nd pass? Only in #runs from 1 st pass < B –#runs for R is  M/B . #runs for S is  N/B . Total #runs ~~ (M+N)/B –Can merge on 2 nd pass if M+N/B < B, or M+N < B 2 –Cost: 3(M+N)

Cost of Hash Join Partitioning phase: read+write both relations  2(|R|+|S|) I/Os Matching phase: read+write both relations  |R|+|S| I/Os Total cost of 2-pass hash join = 3(|R|+|S|) Q: what is cost of 2-pass merge-sort join? Q: how much memory needed for 2-pass sort join? Q: how much memory needed for 2-pass hash join?

Summary: Hashing vs. Sorting Sorting pros: –Good if input already sorted, or need output sorted –Not sensitive to data skew or bad hash functions Hashing pros: –Often cheaper due to hybrid hashing –For join: # passes depends on size of smaller relation –Highly parallelizable

Review: Database Design Requirements Analysis – user needs; what must database do? Conceptual Design – high level descr (often done w/ER model) Logical Design – translate ER into DBMS data model Schema Refinement – consistency, normalization Physical Design - indexes, disk layout Security Design - who accesses what

Review: the ER Model Entities and Entity Set (boxes) Relationships and Relationship sets (diamonds) –binary –n-ary Key constraints (1-1,1-M, M-N, arrows on 1 side) Participation constraints (bold for Total) Weak entities - require strong entity for key age name age pname Dependents Employees ssn Policy cost

ISA (`is a’) Hierarchies Contract_Emps name ssn Employees lot hourly_wages ISA Hourly_Emps contractid hours_worked v attributes inherited. v If we declare A ISA B, every A entity is also considered to be a B entity. Overlap constraints: Can Simon be an Hourly_Emps as well as a Contract_Emps entity? (Allowed/disallowed) Covering constraints: Does every Employees entity also have to be an Hourly_Emps or a Contract_Emps entity? (Yes/no) Conversions between Relational schema ER Diagram

Review: Functional Dependencies –Properties of the real world –Decide when to decompose relations –Help us find keys –Help us evaluate Design Tradeoffs Want to reduce redundancy, avoid anomalies Want reasonable efficiency Must avoid lossy decompositions –F+: closure, all dependencies that can be inferred from a set F –A+: attribute closure, all attributes functionally determined by the set of attributes A –G: minimal cover, smallest set of FDs such that G+ == F+

Problems Due to R  W Update anomaly: Can we modify W in only the 1st tuple of SNLRWH? Insertion anomaly: What if we want to insert an employee and don’t know the hourly wage for his or her rating? (or we get it wrong?) Deletion anomaly: If we delete all employees with rating 5, we lose the information about the wage for rating 5! Hourly_Emps

Review: Normal Forms A property of a single relation Tells us something about redundancy in reln Reln R with FDs F is in BCNF if, for all X  A in F + A  X (called a trivial FD), or X is a superkey for R. Reln R with FDs F is in 3NF if, for all X  A in F + A  X (called a trivial FD), or X is a superkey of R, or A is part of some candidate key (not superkey!) for R. (sometimes stated as “A is prime”)

Review: Decomposition If reln violates normal form, decompose –but must have lossless decomposition Lossless decomposition: –decomposition of R into X and Y is lossless if and only if X  Y is a key for either X or Y –If W  Z holds over R and (W  Z) is empty, then decomposition of R into R-Z and WZ is loss-less. Algorithm: –For each FD W  Z in R that violates normal form, decompose R into R-Z and WZ. Repeat as needed. –Order not important, but can produce very different results

Review: Dependency Preservation –decompose too much, and it might be necessary to join tables to check FDs –decomposition of R into X and Y is dependency preserving if (F X  F Y ) + = F + F X is all FDs involving only attributes in X F Y is all FDs involving only attributes in Y –Not always obvious ABC, A  B, B  C, C  A, decomposed into AB and BC. Is this dependency preserving? Is C  A preserved? –note: F + contains F  {A  C, B  A, C  B}, so… F AB contains A  B and B  A; F BC contains B  C and C  B So, (F AB  F BC ) + contains C  A

Minimal Cover for a Set of FDs G: minimal cover, smallest set of FDs such that G+ == F+ –Closure of F = closure of G. –Right hand side of each FD in G is a single attribute. –If we modify G by deleting an FD or by deleting attributes from an FD in G, the closure changes. Every FD in G is needed, and ``as small as possible’’ in order to get the same closure as F. e.g., F+ = {A  B, B  C, C  A, B  A, C  B, A  C} –several minimal covers: {A  B, B  A, C  B, B  C} (AB + BC) –or {A  C, C  A, B  C, C  B} (AC + BC) –or {A  B, B  A, C  A, A  C} (AB + AC) e.g., A  B, ABCD  E, EF  GH, ACDF  EG minimal cover: –A  B, ACD  E, EF  G and EF  H

BCNF and Dependency Preservation In general, there may not be a dependency preserving decomposition into BCNF. But, you can always find dependency-preserving decomposition into 3NF –Top down: Decompose until it is in 3NF Compute minimal cover for FDs If minimal cover contains a FD X  Y is not preserved, add reln XY –Bottom up: Compute minimal cover For each FD X  Y in minimal cover, create reln XY –Why does this work? Minimal cover doesn’t include redundant transitive dependencies, which don’t need to be preserved

Questions? Up to midterm 1 –Relational Model & Query Languages Relational Algebra and Calculus SQL –Database Implementation Disks, buffers, files Indexes: B-Trees, Hash Indexes Between midterm 1 & midterm 2 –Query Execution Relational Operators Sorting Joining Query Optimization Since midterm 2 –Database Design The ER Model Functional Dependencies & Normalization –Transactions, Concurrency Control, & Recovery

Thank you! See you next Wednesday.

Final Exam Review Lecture 31. Administrivia Office hours 1:15 – 2:15 today –also available via

Similar presentations

Presentation on theme: "Final Exam Review Lecture 31. Administrivia Office hours 1:15 – 2:15 today –also available via"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Final Exam Review Lecture 31. Administrivia Office hours 1:15 – 2:15 today –also available via

Similar presentations

Presentation on theme: "Final Exam Review Lecture 31. Administrivia Office hours 1:15 – 2:15 today –also available via"— Presentation transcript:

Similar presentations

About project

Feedback