Dutch-Belgium DataBase Day University of Antwerp, MonetDB/x100 Peter Boncz, Marcin Zukowski, Niels Nes
Introduction What is x100 ? A new query processing engine developed for MonetDB
Contents Introduction CWI Database Group Motivation MonetDB/x100 Architecture Highlights Optimizing CPU performance Exploiting cache memories Enhancing disk bandwidth Conclusions Discussion
CWI Database Group Database Architecture DBMS design, implementation, evaluation Wide area; many sub-areas Data structures Query processing algorithms Modern computer architectures MonetDB at CWI open-source high-performance DBMS Future: X100, MonetDB 5.0
Motivation Multimedia retrieval TREC Video: 130 hours of news, growing each year Task: search for a given text (speech recognition) or video similar to a given image 3 TB of data (!)
Motivation Similar areas Data-mining OLAP, data warehousing Scientific applications (astronomy, biology…) Challenge: process really large datasets within DBMS efficiently
x100 Highlights Use computer architecture to guide this talk
CPU Actual data processing
CPU From CISC to hyper-pipelined 1986: 8086: CISC 1990: 486: 2 execution units 1992: Pentium: 2 x 5-stage pipelined units 1996: Pentium3: 3 x 7-stage pipelined units 2000: Pentium4: 12 x 20-stage pipelined execution units Each instruction executes in multiple steps… A -> A1, …, An … in (multiple) pipelines:
CPU But only, if the instructions are independent! Otherwise: Problems: branches in program logic accessing recently modified memory [ailamaki99, …] DBMSs bad at filling pipelines
x100: vectorized processing *(int,int): int *(int[],int[]) : int[]
x100: vectorized processing Primitives: vector at a time very basic functionality independent loop iterations simple code Optimization levels: Compiler loop pipelining CPU full pipelines *(int,int): int *(int[],int[]) : int[]
x100: results (TPC-H Q1) Few CPU cycles per tuple e.g. MySQL spends ~100 cycles for such operators
Main memory Large, but not unlimited
Cache Faster, but very limited storage
Cache Memory Bottleneck Cache to hide memory access cost Different costs at different levels: L1 cache access: 1-2 cycles L2 cache access: 6-20 cycles main-memory access: cycles Consequences: random access into main-memory very expensive DBMS must buffer for CPU cache, not RAM
Cache Memory Bottleneck Cache to hide memory access cost Different costs at different levels: L1 cache access: 1-2 cycles L2 cache access: 6-20 cycles main-memory access: cycles Consequences : random access into main-memory very expensive DBMS must buffer for CPU cache, not RAM cache-conscious query processing MonetDB research [VLDB99,00,02,04]
x100: pipelining Vectors fill the CPU cache main-memory access only at the data sources and sinks - * + Project( ) X100 query processor CPU Cache RAM X100 buffer mgr disk MonetDB uses much more main memory bandwidth
x100: pipelining Vectors fill the CPU cache main-memory access only at the data input and output - * + Project( ) X100 query processor CPU Cache RAM X100 buffer mgr disk x100 MonetDB
Disk Slow, but unlimited ( ) storage
Disk Random access hopeless Size grows faster than bandwidth
x100: problem - bandwidth MonetDB/x100 too fast for disks TPC-H queries need MB/s
Bandwidth improvements Three ideas: Vertical Fragmentation (MonetDB) new: Lightweight Compression new: Cooperative Scans
Vertical fragmentation DBMS disk access in data-intensive applications Only the relevant data is read – reduced disk bandwidth requirements
Lightweight Compression Compression introduced not to reduce storage space but to increase disk bandwidth: Due to efficient code for disk-based data only few percents of CPU time are used Part of this extra time can be spent on decompressing data
Lightweight Compression Rationale: - Disk RAM transfer uses DMA and does not need CPU - (de)compress only vector-at-a-time when data is needed - * + Project( ) X100 query processor CPU Cache RAM X100 buffer mgr disk Compress on the CPU cache RAM boundary
Lightweight Compression Standard compression won’t do Compresses too well => too slow (100MB/s) Research Question devise lightweight (de)compression algorithms Results so far compression factor relatively small, up to 3.5 decompression speed – 3GB/sec (!) compression speed – 1GB/sec (!!!) perceived bandwidth 3 times bigger
Cooperative Scans Idea: use I/O bandwidth to satisfy multiple queries Cooperative Scans Active Buffer Manager, is aware of concurrent scans on the same table Research Question: devise adaptive buffer management strategies Benefits: I/O Bandwidth is re-used by multiple queries Concurrent queries don’t fight anymore for the disk arm
Cooperative Scans x100 and Cooperative Scans: >30 queries without performance degradation
x100 summary Original MonetDB successful in the same application areas, however Sub-optimal CPU utilization Only efficient if problem fits RAM x100 improves architecture on all levels Better CPU utilization Better cache utilization Scales to non-memory resident datasets Improves I/O bandwidth using compression and cooperative scans
Example results Performance close to hand-written C functions TPCH SF-1x100OracleMonetDB Q10.54s30s9.4s Q30.24s10s2.5s Q60.15s1.5s2.5s Q140.13s2s1.2s
x100 status First proof-of-concept implemented Full TPC-H benchmark executes Future work: lots of engineering new buffer manager more vectorized algorithms memory footprint tuning (for small devices) SQL front-end
More information CIDR’05 paper: “MonetDB/X100: Hyper-pipelining query execution”
Discussion ?