Hari Babu Fujitsu Australia Software Technology In-memory columnar storage.

Slides:



Advertisements
Similar presentations
Dr. Kalpakis CMSC 661, Principles of Database Systems Representing Data Elements [12]
Advertisements

Allocating Memory.
Chapter 11: File System Implementation
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Recap of Feb 27: Disk-Block Access and Buffer Management Major concepts in Disk-Block Access covered: –Disk-arm Scheduling –Non-volatile write buffers.
1 Storage Hierarchy Cache Main Memory Virtual Memory File System Tertiary Storage Programs DBMS Capacity & Cost Secondary Storage.
Memory Management Chapter 5.
Chapter 9 Virtual Memory Produced by Lemlem Kebede Monday, July 16, 2001.
Harvard University Oracle Database Administration Session 5 Data Storage.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
1.1 CAS CS 460/660 Introduction to Database Systems File Organization Slides from UC Berkeley.
File System Implementation
VIRTUAL MEMORY. Virtual memory technique is used to extents the size of physical memory When a program does not completely fit into the main memory, it.
Copying, Managing, and Transforming Data With DTS.
Virtual Memory Chantha Thoeun. Overview  Purpose:  Use the hard disk as an extension of RAM.  Increase the available address space of a process. 
CHP - 9 File Structures. INTRODUCTION In some of the previous chapters, we have discussed representations of and operations on data structures. These.
Lecture 21 Last lecture Today’s lecture Cache Memory Virtual memory
Database Systems Design, Implementation, and Management Coronel | Morris 11e ©2015 Cengage Learning. All Rights Reserved. May not be scanned, copied or.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
CSE 781 – DATABASE MANAGEMENT SYSTEMS Introduction To Oracle 10g Rajika Tandon.
Architecture Rajesh. Components of Database Engine.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
File System Implementation Chapter 12. File system Organization Application programs Application programs Logical file system Logical file system manages.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Database Management COP4540, SCS, FIU Physical Database Design (ch. 16 & ch. 3)
Database structure and space Management. Database Structure An ORACLE database has both a physical and logical structure. By separating physical and logical.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Database structure and space Management. Segments The level of logical database storage above an extent is called a segment. A segment is a set of extents.
Chapter 10 Designing the Files and Databases. SAD/CHAPTER 102 Learning Objectives Discuss the conversion from a logical data model to a physical database.
DATABASE MANAGEMENT SYSTEM ARCHITECTURE
CE Operating Systems Lecture 17 File systems – interface and implementation.
Lecture 18 Windows – NT File System (NTFS)
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 4 Logical & Physical Database Design
Informationsteknologi Wednesday, October 3, 2007Computer Systems/Operating Systems - Class 121 Today’s class Memory management Virtual memory.
Memory Management OS Fazal Rehman Shamil. swapping Swapping concept comes in terms of process scheduling. Swapping is basically implemented by Medium.
Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.
Virtual Memory Pranav Shah CS147 - Sin Min Lee. Concept of Virtual Memory Purpose of Virtual Memory - to use hard disk as an extension of RAM. Personal.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Unit-8 Introduction Of MySql. Types of table in PHP MySQL supports various of table types or storage engines to allow you to optimize your database. The.
Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.
October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.
Oracle Announced New In- Memory Database G1 Emre Eftelioglu, Fen Liu [09/27/13] 1 [1]
Virtual Memory. Cache memory enhances performance by providing faster memory access speed. Virtual memory enhances performance by providing greater memory.
Memory Management.
Chapter 2 Memory and process management
Module 11: File Structure
CHP - 9 File Structures.
Database structure and space Management
COMBINED PAGING AND SEGMENTATION
Physical Database Design and Performance
Chapter 11: File System Implementation
Chapter 12: Query Processing
Database Performance Tuning and Query Optimization
Chapter 11: File System Implementation
Physical Database Design
Chapter 11: File System Implementation
Computer Architecture
Main Memory Background Swapping Contiguous Allocation Paging
Overview: File system implementation (cont)
Database Systems Instructor Name: Lecture-3.
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Chapter 13: Data Storage Structures
Chapter 11 Database Performance Tuning and Query Optimization
Chapter 11: File System Implementation
Operating Systems: Internals and Design Principles, 6/E
Chapter 13: Data Storage Structures
Chapter 13: Data Storage Structures
Presentation transcript:

Hari Babu Fujitsu Australia Software Technology In-memory columnar storage

About me I work for “Fujitsu Australia Software Technology Pvt Ltd”. I am a member of a team that develops “FUJITSU Enterprise Postgres” based on PostgreSQL. I have around 9+ years of experience in developing database products (in-memory and disk). 1

Contents Introduction Need of columnar storage Need of in-memory Architecture Overview of columnar storage approach Write optimized storage Read optimized storage Data movement from WOS to ROS SQL operations In-memory approach Performance comparison with other commercial databases Current state of the patch Further improvements 2

Introduction We at Fujitsu are working towards adding analytical and enterprise capabilities to PostgreSQL. The columnar storage is the first outcome in this goal of providing analytical capabilities. Fujitsu laboratories database team has designed and developed the columnar storage. 3

Need of Columnar storage During analytical query processing on a large data sets, because of row wise data storage, most of the unwanted columns data is scanned and discarded, as a result of this the performance is affected. Because of row wise data storage, the disk IO is also increased in case of large data sets even if the row data is compressed. By storing data in columns rather than rows, the database can more precisely access the data it needs to answer a query and also compression efficiency. It is well known that a row of similar data, dates for example, can be compressed more efficiently than disparate data across rows. 4

Need of in-memory In-memory data access is an approach to querying data when it resides in a computer’s random access memory, as opposed to querying data that is stored on physical disks. This results in faster query response times. Allowing transactional and analytical applications to simultaneously access the same database is another way to provide real-time analytics capabilities can cause performance problems because of resource conflicts largely due to latency in accessing the data records stored in disk. 5

Architecture 6

Overview of Columnar storage approach Columnar storage is implemented as an extension with minimal changes in the backend code. The extension needs to be loaded as part of shared_preload_libraries during server start. This extension adds an new index access method called VCI (vertical clustered index). User needs to create a VCI index on columns that are to be part of columnar storage. 7

Overview of Columnar storage approach To support a proper columnar storage without affecting the performance of write operations and also by providing a good performance improvement to read operations, the best way to achieve it is by splitting storage into two different types. WOS – write optimized storage Temporary storage type used for holding the modification of write operations. ROS – Read optimized storage Permanent storage type of columns that are part of the columnar storage. To support above design, whenever VCI index is created, it internally creates many relations. 8

Write optimized storage write optimized storage is the place where the data of all columns that are part of columnar storage are stored in a row wise format temporarily. The following two relations that are part of WOS storage Data WOS - contains the tuple data of all columns that are part of VCI Whiteout WOS - contains a set of TIDs whose tuple data were called to be deleted write optimized storage is responsible for MVCC behaviour in our design. we adopted this approach in order to simplify the data structure of ROS. 9

Write optimized storage The following diagrams explains data page layout of Data WOS and White out WOS relations. All the newly added/deleted data is stored in WOS relation with xmin/xmax information also. If user wants to update/delete the newly added data, it doesn't affect the performance much compared to deleting the data from columnar storage. 10

Write optimized storage The tuples which don't have multiple copies or frozen data will be moved from WOS to ROS periodically by the background worker process called WOS to ROS converter process. Every column data is stored separately in it's relation file. There is no transaction information is present in ROS. The data in ROS can be referred with tuple ID. 11

Read Optimized Storage 12 Read optimized storage is the location where the actual column relation data is represented. In further slides we will discuss the following details that are important for ROS. Overview of ROS ROS Extent ROS column data access ROS Delete vector

Read Optimized Storage 13 Read optimized storage is where all the columns data is individually stored in separate data files. Differently to WOS, ROS data is not managed by the TID used in the original table. When the data is transferred from WOS to ROS, an internal ID called Columnar Record Identifier (CRID) is assigned to the set of columnar data (one data for each column). CRID is a combination of extent number and offset of the record in the extent.

Read Optimized Storage 14 The CRID gives a logical position of the columnar data, and is generated in increasing order of record registration. Columnar data are stored in column data relations, one for each column, and the CRID is used to find the position of the columnar data that compose a record in each different column data relation. Extent is introduced as a unit of data management on ROS. One extent is assumed to be a unit that contains a fixed number of consecutive CRIDs. When a large amount of records are transferred to ROS during the WOS→ROS conversion, the storing is divided and executed in units of extents.

Read Optimized Storage 15 ROS Extent: An extent is a logical data block in ROS. One extent contains 262,144 records. In order to obtain the position from CRID with a couple of operations, the number of records in an extent is fixed, including unused CRID. Since the column-element sizes differ from each other, the sizes of extents differ as shown in figure.

Read Optimized Storage 16 The position of an extent in the relation, the algorithm of compression of ROS data, and dictionaries for compression are specified per extent. The WOS to ROS translation is also performed by unit of extent. The garbage collection of deleted rows is performed for each extent. The gathered unused CRIDs at the end of an extent are reused when a new row is appended to the extent. An extent with columns of fixed column-element lengths occupies a fixed number of DB pages. Therefore, a fixed number of continuous DB pages are assigned to an extent, to prevent fragmentation.

Read Optimized Storage The following figures explains the details of mapping of an extent to DB pages and the information stored in an extent. 17

Read Optimized Storage In PostgreSQL, very long data with variable length format are TOASTed externally. In order to reduce data size in columns, the information of TOAST links is stored in the data area in extent, and the highest two bits in the column-element length data are used to indicate whether the datum is a link or a raw datum. This is shown in the below figure and table. 18 Value of two bitsKind of data 00Raw data 10External TOAST

Read Optimized Storage ROS Column data access: The following figure explains the details of how the column data relation can be reached from a VCI main relation. 19

Read Optimized Storage For fixed column length datatypes, both compressed and raw, the column size is known, and the positions of the data are directly calculated from CRID. The extent ID is obtained by dividing CRID by the number of rows in an extent, and the remainder is the position in the extent. For columns with variable lengths, by recording the offset of the data location for each column-element in the CRID offset, the data can referred. 20

Read Optimized Storage 21 ROS Delete Vector: Delete vector is a bit vector of representing whether the row is deleted from ROS or not. Delete vector itself is also a fixed- length column data, and the eight bits for continuous eight rows are packed into a byte data in ROS. In processing queries, only rows without deletion marks are used. After some periods, deleted rows are collected by copying- garbage-collection. The live rows are copied from the beginning of an empty extent, and this allows getting consecutive unused CRIDs together at the end of the extent as a result.

Read Optimized Storage 22 The unused CRIDs are reused for WOS to ROS transformation in the future. The reason to use copying- garbage collection is to continue processing queries during garbage collection. Copying-garbage collection copies only live data in an extent into an unused contiguous area inside the relation. Queries issued before copying can access data in an original source extent area, while queries issues after copying can access data in a copied destination extent area. As soon as there are no queries left that access data in the source extent area, the source extent area is reclaimed. Thus, queries can access data without stopping even in garbage- collection.

Read Optimized Storage 23 The following diagram explains the details of how the garbage collection works.

Data movement from WOS to ROS 24 A background worker process called ROS daemon similar like autovacuum launcher, launches worker processes that does the following 1. WOS to ROS conversion 2. Update delete vector(Whiteout WOS to delete vector conversion) 3. Collect deleted rows in an extent 4. Update TIDCRID relation with TID-CRID update list 5. Collect unused extent

SQL operations 25 All the INSERT/DELETE operations directly takes place at WOS relations and these will be periodically transferred to ROS.

SQL Operations 26

SQL Operations During each query execution, Data WOS & Whiteout WOS corresponding to the columnar storage table will be converted into Local ROS. Life of Local ROS ends with each query execution. The extent number of local ROS is shaken from -1 to small one. All the visible data from the WOS and whiteout WOS is transformed into Local ROS at the start of the query execution. 27

In-memory approach As we already checked need of in-memory approach, columnar storage to work properly, we need to make sure that most of the data resides in shared buffers along with OLTP operations. Instead of separate mechanism of in-memory approach, we can try to reuse the shared buffers logic. To achieve the same, we can do something like the following, Reserve some shared buffers space to columnar storage tables. The shared buffer pages that are used by the columnar storage tables will be recycled only when their usage crosses the specified reserved ratio. This way it will be ensured that most of the columnar table pages resides in shared buffers. 28

In-memory approach New GUC configuration parameter to specify reserve ratio for columnar storage tables. Reserve_buffer_ratio – (0 – 75) Create a separate shared buffer pool for columnar storage tables similar like oracle multi buffer pools, but this needs proper design changes. A new reloption can be added to specify the table that needs the stable buffer option. 29

Performance comparison with other commercial databases 30 PostgreSQLVCI VCI Performance RDBMS1 RDBMS2RDBMS3 RDBDMS3 40CPU RDBDMS3 1CPU RDBMS3 40CPU VCI off1CPU40CPU1CPU40CPU ColumnRow 1CPU Row 40CPU Row 1CPU In memory option 40CPU In memory option Execution time (ms) Q1 91,84126,0901,00425, ,3122, ,9351,61832,3071,119 Q2 4,8944,7844,4494,9632,060 1,3874,6061, ,006 Q3 28,99916,99614,69016,8052,701 8,7165,9452,836 1,6781, Q4 5,0005,3733,1795, ,5452,7842,484 15,7011,1908, Q5 13,7098,4316,7909,0061,068 1,6174,6941,420 21,8241,79810,1992,213 Q6 9,0157, , , Q7 14,50213,11014,02913,9581,674 1,34219, ,5681,8486,1251,415 Q8 5,2535,0224,6044, ,88719,2331,504 17,3482,0243,5581,897 Q9 69,84369,13867,27666,4116, ,000535,3581,214 33,6954,90418,2172,587 Q10 11,06411,3769,15211,4303, ,5843,812 4,7181, ,371 Q11 1,3151,1711,1181,2981, , Q12 12,61017,4752,32917, , ,8151,1346, Q13 60,43359,24358,34260,30960,454 1,89911,3586,763 28,4151,26121, Q14 3,9938,4702,3478,5461,517 1, ,0461,0131, Q16 20,25219,79619,19820,20219,332 1,0377, ,1531,3262,8051,371 Q17 1,8771,6801,4081, ,6409, , Q18 43,76942,30242,60244,54143,099 3,64721,7681,840 16,7201,58713,8622,227 Q19 1,4861, , , , , Q20 2,0131,8591,6051, , ,92620,1002,2992,441 Q21 24,86324,63024,22825,4952,323 13,79434,9461, ,8178,360167,4176,998 Q22 3,2702,5031,0232, , , ,

Performance comparison with other commercial databases 31 VCI VCI performance RDBMS2RDBMS1 RDBMS3(1CP U) RDBMS3(40C PU) RDBMS3(1CP U) RDBMS3(40CP U) 1CPU40CPU1CPU40CPU ColumnRow ROW In memory option Speed improvement rate Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Geometric average Geometric average (The one whose performance is lower than PG is removed) Slower than PostgreSQL

Current state of the patch The IMCS patch is implemented as an new index access method called vertical clustered index. All the storage part of the code is completed. Currently with the help of custom plan, the plans are generated to use VCI index. There are some workarounds are added to deal with HOT update and vacuum operations as IMCS is currently treated as an index. 32

Further improvements Need to integrate the storage changes with the proposed new columnar storage create syntax. Planner changes to choose the columnar storage advantage instead of using custom plan methods to generate the plan. Remove all the workarounds that are added for HOT update. 33

Questions? 34

35 Copyright 2014 FUJITSU LIMITED