Download presentation
Presentation is loading. Please wait.
Published byEdwina Bates Modified over 9 years ago
1
The Design of POSTGRES Storage System Author: M. Stonebraker Speaker: Abhishek Shrivastava
2
Problem in other System Recovery from failures is Log-Based Most systems use Write Ahead Log (WAL) WAL crash recovery code is complicated Recovery code must be error-free
3
Alternatives A no-overwrite storage system Asynchronous archiving System No crash recovery code
4
POSTGRES Storage manager All updates are insertions rather than being a change in tuple values No recovery code required to run after crashes Vacuum Cleaner: Asynchronous process for moving archival records off the magnetic disk and onto Archival storage system
5
Magnetic Disk System Records changed by database transactions Increment and grab current global Unique Trasaction ID (XID). do processing change status to committed in log (more on this) Force data to disk or move to stable main memory & log to stable storage (in that order)
6
Magnetic Disk System contd. Transaction log tail of log (oldest active transaction to present) needs 2 bits per transaction to record state (committed, aborted, in progress) body of log needs only 1 bit per transaction (committed or aborted) at 1 transaction per second, 1 year of transactions fits in 4Mb log space A Bloom filter may be used to compress the logs to represent aborted transactions (lossy compression) with just a little NVRAM, the log essentially never needs forcing
7
Magnetic Disk System contd. Each tuple has a bunch of system fields: OID: a database-wide unique ID across all time Xmin: XID of inserter Tmin: commit time of Xmin Cmin: command ID of inserter Xmax: XID of deleter (if any) Tmax: commit time of Xmax (if any) Cmax: command ID of deleter (if any) PTR: pointer to chain of updated records
8
Magnetic Disk System contd. Updates work as follows: Xmax & Cmax set to updater’s XID new replacement tuple appended to DB with: OID of old record Xmin & Cmin = XID of updater Store this as delta off original tuple Deleters simply set Xmax & Cmax to their XID
9
Magnetic Disk System contd. Time management Time is a 32 bit integer (Internal to POSTGRES) There is a TIME relation which stores Commit times of every transaction Timestamp is assigned to a record at the time a transaction is started and is updated by each transaction Transactions processed in order of timestamps Concurrency is attained using a 2 phase locking
10
Magnetic Disk System contd. Record Access Sequential scan of a relation in a POSTGRES determined order By following forward links Reverse Pointer is provided to execute query plans forward or backwards Once anchor point is located, the record can be constructed by following the pointer and decompressing the data fields.
11
Magnetic Disk System contd. Archiving Three levels of archiving no archive: old versions not needed light archive: old versions not to be accessed often heavy archive: old versions to be accessed regularly Archiving is done by Vacuum Cleaner
12
Magnetic Disk System contd. historical data can be forced to archive via the vacuum cleaner write archive record(s) and its associated index records write new anchor record to current database reclaim space occupied by old anchor/deltas Crash during vacuum? indexes may lose archive records: this will be discovered at runtime and fixed via a Seq. Scan duplicate records may be forced to archive: OK because POSTGRES doesn’t do multi-sets
13
Magnetic Disk System contd. Indexing Conventional indexing used in magnetic disks Additional Index on time interval ‘I’ can be kept R-Tree structure can be used for indexing on ‘I’
14
Performance Comparison against WAL Assumptions Enough not volatile main memory CPU instructions are not critical resource Records fit on a single page Delta records live on the same page as anchors single-record transactions WAL requires 3 log records, each for begin transaction, data modifications and end transaction
15
Performance Comparison against WAL Analysis Of three possible Situations Large-SM: An ample amount of stable main memory is available Small-SM: a modest amount of stable main memory is available No-SM: No stable main memory available.
19
Conclusions 1. Instantaneous recovery from crashes 2. Ability to keep archival records on an archival medium 3. Housekeeping chores done asynchronously 4. Concurrency control based on conventional locking
20
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.