Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Design of POSTGRES Storage System Author: M. Stonebraker Speaker: Abhishek Shrivastava.

Similar presentations


Presentation on theme: "The Design of POSTGRES Storage System Author: M. Stonebraker Speaker: Abhishek Shrivastava."— Presentation transcript:

1 The Design of POSTGRES Storage System Author: M. Stonebraker Speaker: Abhishek Shrivastava

2 Problem in other System Recovery from failures is Log-Based Most systems use Write Ahead Log (WAL) WAL crash recovery code is complicated Recovery code must be error-free

3 Alternatives A no-overwrite storage system Asynchronous archiving System No crash recovery code

4 POSTGRES Storage manager All updates are insertions rather than being a change in tuple values No recovery code required to run after crashes Vacuum Cleaner: Asynchronous process for moving archival records off the magnetic disk and onto Archival storage system

5 Magnetic Disk System Records changed by database transactions  Increment and grab current global Unique Trasaction ID (XID).  do processing  change status to committed in log (more on this)  Force data to disk or move to stable main memory & log to stable storage (in that order)

6 Magnetic Disk System contd. Transaction log  tail of log (oldest active transaction to present) needs 2 bits per transaction to record state (committed, aborted, in progress)  body of log needs only 1 bit per transaction (committed or aborted)  at 1 transaction per second, 1 year of transactions fits in 4Mb log space  A Bloom filter may be used to compress the logs to represent aborted transactions (lossy compression)  with just a little NVRAM, the log essentially never needs forcing

7 Magnetic Disk System contd. Each tuple has a bunch of system fields:  OID: a database-wide unique ID across all time  Xmin: XID of inserter  Tmin: commit time of Xmin  Cmin: command ID of inserter  Xmax: XID of deleter (if any)  Tmax: commit time of Xmax (if any)  Cmax: command ID of deleter (if any)  PTR: pointer to chain of updated records

8 Magnetic Disk System contd. Updates work as follows:  Xmax & Cmax set to updater’s XID  new replacement tuple appended to DB with: OID of old record Xmin & Cmin = XID of updater Store this as delta off original tuple Deleters simply set Xmax & Cmax to their XID

9 Magnetic Disk System contd. Time management  Time is a 32 bit integer (Internal to POSTGRES)  There is a TIME relation which stores Commit times of every transaction Timestamp is assigned to a record at the time a transaction is started and is updated by each transaction Transactions processed in order of timestamps Concurrency is attained using a 2 phase locking

10 Magnetic Disk System contd. Record Access  Sequential scan of a relation in a POSTGRES determined order  By following forward links  Reverse Pointer is provided to execute query plans forward or backwards  Once anchor point is located, the record can be constructed by following the pointer and decompressing the data fields.

11 Magnetic Disk System contd. Archiving  Three levels of archiving no archive: old versions not needed light archive: old versions not to be accessed often heavy archive: old versions to be accessed regularly  Archiving is done by Vacuum Cleaner

12 Magnetic Disk System contd. historical data can be forced to archive via the vacuum cleaner  write archive record(s) and its associated index records  write new anchor record to current database  reclaim space occupied by old anchor/deltas Crash during vacuum?  indexes may lose archive records: this will be discovered at runtime and fixed via a Seq. Scan  duplicate records may be forced to archive: OK because POSTGRES doesn’t do multi-sets

13 Magnetic Disk System contd. Indexing  Conventional indexing used in magnetic disks  Additional Index on time interval ‘I’ can be kept  R-Tree structure can be used for indexing on ‘I’

14 Performance Comparison against WAL Assumptions  Enough not volatile main memory  CPU instructions are not critical resource  Records fit on a single page  Delta records live on the same page as anchors  single-record transactions  WAL requires 3 log records, each for begin transaction, data modifications and end transaction

15 Performance Comparison against WAL Analysis Of three possible Situations  Large-SM: An ample amount of stable main memory is available  Small-SM: a modest amount of stable main memory is available  No-SM: No stable main memory available.

16

17

18

19 Conclusions 1. Instantaneous recovery from crashes 2. Ability to keep archival records on an archival medium 3. Housekeeping chores done asynchronously 4. Concurrency control based on conventional locking

20 Questions?


Download ppt "The Design of POSTGRES Storage System Author: M. Stonebraker Speaker: Abhishek Shrivastava."

Similar presentations


Ads by Google