Chapter 1.3: Data Models and DBMS Architecture Title: Anatomy of a Database System Authors: J. Hellerstein, M. Stonebraker Pages: 43-95
Anatomy of a Database System Problem –Problem Statement –Why is this problem important? –Why is this problem hard? Approaches –Approach description, key concepts –Contributions (novelty, improved) –Assumptions
Problem Statement – DBMS Architecture Given –A data model –Platform, i.e. operating system, computer hardware architecture Find - An DBMS architecture –A set of building-block components –Interactions among building blocks Objectives –Efficiency, Scalability –Extensibility Constraints –Relational Data Model
Why is this problem important? Why review Relational DBMS architectural innovations? –Backbone of infrastructure applications Banking, airline reservation, medical records, CRM, SCM, … –Well-understood point of reference for New extensions and future revolution Architecture allows –Analysis of properties Availability, fault-tolerance, reliability –Mapping of multiple views User requirements to components - validation and acceptance tests Software developers, maintainer, … Software operational support group
Why is this problem Hard? Complexity –Mid-1970s – Efficient implementation of a Relational DBMS –Declarative Query Language –Logical and physical independence Changes –Platforms evolve Computer Hardware, Languages, Operating Systems Storage: Tapes Disks (1960s) RAID (1990s) SAN … CPUs: Mainframe Mini Desktops Multi-core CPUs (2000s) … –Integrate many views Enterprise – performance level, transaction reliability, … Data Processing Needs – data warehouses, reports, OLTP, Web,… …
Contributions, Validation Methodology Contributions –A simple yet relatively comprehensive RDBMS architecture –Decomposition into 4 components –Identification of depedencies Validation –Ability to explain academic and commercial RDBMSs –Expert opinion, authors have architected multiple DBMSs
Proposed Approach Four Components (Figure 1, pp. 44) –A Process Manager –Query Processing Engine –Transactional Storage Subsystem –Shared Utilities, e.g. Disk space management Interactions among components –Not explicit in Figure 1 –Implicit: Left-top to lower-right flow
Component 1 – Process Manager Responsibilities - Organization of processes Platform: Uni-processor, High-performance OS threads Two Options –Process per user (connection) Issues - scalability –Server Process (+ I/O Process per disk) Dispatcher thread, log manager thread Pool of worker threads Shared data (e.g. log, I/O buffer) in common heap space Issues – asynchronous I/O, protection across threads, … Client – Server communication –network socket Q? What is new in this paper relative to Parallel Database paper by DeWitt et al.?
Component 1 – Issues Mapping DBMS threads to OS Processes –Absence of OS threads – page 50 – Commercial examples – last para, sec , page 51 Parallelism (Figures 5-7, pp ) –Shared memory – previous architectures port easily –Shared nothing Query processing parallelizes w/ horizontal data partitioning 2 phase commit need communication Partial failure –Shared disk Distributed lock manager, cache coherency protocol, … Admission Control –Avoid thrashing ( working set > memory buffers) –Control number of connections, number of queries
Component 2 – Query Processor Responsibility: –SQL query execution plan (Fig. 8, pp. 64) Subcomponents –Parsing and Authorization –Catalogs –Query rewrite – views, constant expressions, semantic optimization, sub-query flattening –Optimizer – plan space, selectivity estimation, search, parallelism, extensibility, auto-tuning, … –Executor – iterator model (Figure 9, pp. 68) Q? What is new in optimizer since Selinger ?
Component 2 – Query Processor Issues Data Modification Statements –Plans are more complex –Ex. Halloween problem (Fig. 10, pp. 71) Access Methods –Unordered files, B+-tree, R-tree and bit-map indexes –API methods – init(), get_next(), … –Search by logical conditions (sarg) or record-id –Interacts with concurrency and recovery sub-components
Component 3 – Transactional Storage Manager Responsibilities – ACID properties Subcomponents –Lock Manager Serializability, 2PL, Isolation levels (p. 76) –Log Manager WAL – 3 rules (p. 78), performance tuning –Buffer pool –Access methods Latches in B+trees (p. 80) – conservative, latch-coupling, right-link Predicate locks – next-key locking
Component 3 – Transactional Storage Manager Interdependencies among subcomponents –Lock Manager, Log Manager WAL assume strict 2PL (p. 82) Q? What would happen without strict 2PL ? –Concurrency control, Access Methods Methods are unique to index types
Component 4 – Shared Utilities Sub-components –Memory allocator (p. 84) –Disk management subsystem Map tables to devices or files New issues with RAIDs (p ) –Replication services Physical, trigger based, log-based –Batch utilities Optimizer statistics gathering, backup/export, physical reorg and index construction
Summary Paper’s focus –DBMS Architectures – components and dependencies Insights - Four Components (Figure 1, pp. 44) –A Process Manager –Query Processing Engine –Transactional Storage Subsystem –Shared Utilities, e.g. Disk space management Interactions among components –Not explicit in Figure 1 –Q. List a few discussed in the paper!
Assumptions, Rewrite today Assumptions –Focus on Relational DBMS –Centralized DBMS (Recall T2.6 on R*) –Four component architecture reminds one of Ingres! –Lessons translate over to new domains Rewrite today –Cover a post-relational DBMS, e.g. Stream or XML –Illustrate how lessons translate over web-services, repositories, network monitors, etc.