HBase Accelerated: In-Memory Flush and Compaction Eshcar Hillel, Anastasia Braginsky, Edward Bortnikov ⎪ Hbase Meetup, San Francisco, December 8, 2016 Hello Everybody! My name is Anastasia and I am from Yahoo Research. Today, I am going to present you two new enhancements for the in-memory component in Hbase and how they can affect the HBase performance. This is a work we have done together with Eshcar Hillel and Eddie Bortnikov, both from Yahoo Research.
Outline Motivation & Background In-Memory Basic Compaction In-Memory Eager Compaction Evaluation Here is the plan for this talk: First, I am going to give you some background and motivation for what we have done And what we have done, IN A NUTSHELL, is (first) compact the data in-memory and (second) reduce per-cell overhead. In both cases the changes are only in MemStore component. So after the background I am going to explain how we apply the in-memory compaction and what do we gain from it. And the last item is about the twist that we gave to the conventional index in the MemStore component.
HBase Accelerated: Mission Definition Goal: Real-time performance in persistent KV-stores How: Use the memory more efficiently less I/O So what do we want? We want better performance! Just like everyone… And how do we plan to get the performance? We plan to use less memory. I.e. we plan to squeeze into the memory more then previously and so the first flush to disk will be delayed.
Use the memory more efficiently less I/O Upon write request: data is first written into the MemStore, when certain memory limit is met MemStore data is flushed into Hfile. write 4 3 2 1 MemStore Hfile memory HDFS
Use the memory more efficiently less I/O Upon write request: data is first written into the MemStore, when certain memory limit is met MemStore data is flushed into Hfile. write 6 5 4 3 2 1 MemStore Hfile memory HDFS
Use the memory more efficiently less I/O Due to efficient usage we let MemStore to keep more data in the same limited memory, so less flushes to disk are required write 6 5 4 3 2 1 MemStore Hfile memory HDFS
Prolong in-memory lifetime, before flushing to disk Reduce amount of new Hfiles Reduce write amplification effect (overall I/O) Reduce retrieval latencies
HBase Accelerated: Two Basic Ideas In-Memory Index Compaction Reduce the index memory footprint, less overhead per cell In-Memory Data Compaction Exploit redundancies in the workload to eliminate duplicates in memory As I have already said in order to squeeze more into memory we have two ideas. First one – the compaction – which means eliminating in-memory duplications. So in case the same key is updated again and again its obsolete versions can be removed already in memory. No need to flush to the disk and compact there. Following this idea the more duplications you have, the more you are going to enjoy the in-memory compaction. And the second idea – the reduction – what can we reduce? If cell is written and it is not a duplication, then it needs to be there, right? Yes this is right. We reduce just the overhead per cell, we reduce the metadata that serves the cell. And in this case the smaller the cells are, the better the relative performance is. So what do we get? We try to keep the MemStore component small and so it takes longer untll the flush to disk happens. Every Computer Science student knows if you can use less memory – it is better. And also everybody knows that to stay in RAM is better than getting to disk. But I want to pay your attention that there are little more benefits. Delayed flush to disk means less files on disk and less compactions on disk – and those on-disk compactions are expensive. It takes us disk bendwidth , CPU power, network between machines in the HDFS cluster, etc.
on-disk compaction HBase Writes write Hfiles DefaultMemStore WAL write active Hfiles prepare-for-flush flush on-disk compaction Snap-shot memory HDFS Random writes are absorbed in an active segment When active segment is full Becomes immutable segment (snapshot) A new mutable (active) segment serves writes Flushed to disk, truncate WAL On-Disk compaction reads a few files, merge-sorts them, writes back new files
Outline In-Memory Basic Compaction Motivation & Background In-Memory Eager Compaction Evaluation In-Memory Basic Compaction Let’s talk about the details of the in-memory compaction. I am going to refresh briefly how read and writes work in Hbase.
In-Memory Flush We present cell segments Key to Cell Map (Index) Cells’ Data We present cell segments a set of cells inserted in a period of time MemStore is a collection of segments active segment active segment
Segments in MemStore We present cell segments Key to Cell Map (Index) Cells’ Data We present cell segments a set of cells inserted in a period of time MemStore is a collection of segments Immutable segments are simpler inherently thread-safe active segment immutable segment snapshot segment
Hbase In-Memory Flush Hfiles CompactingMemStore Block cache active in-memory-flush WAL Compaction pipeline Hfiles flush Snap-shot memory HDFS New compaction pipeline: Active segment flushed to pipeline Flush to disk only when needed
Efficient index representation Block cache active in-memory-flush WAL Hfiles flush Snap-shot memory HDFS
Efficient index representation Block cache Cm in-memory-flush WAL Hfiles flush memory HDFS
Efficient index representation Mutable Segment Concurrent Skip-List Map The “ordering” engine Cell redirecting objects Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabHfilesddeiuyoweuoweiuoieu qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jkkgkykytktgkg;diwpjjjjjjoooooooqqqaaabbyfjtdhghfhfngfhfgcHfilesddeiuyoweuoweiuoieu Bytes for the entire Cell’s information MSLAB JAVA HEAP Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabbbccHfilesddeiuyoweuoweiuo Poiuytrewqasdfaaabbbccceeeghjkl;;mnppppbvcxqqaaaxcvb qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jkdddfkgbbbkykytktgkg;diwpoqqqaaabbbccHfilesddeiuyuoweiuoieu
No Dynamic Ordering for Immutable Segment Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabHfilesddeiuyoweuoweiuoieu qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jkkgkykytktgkg;diwpjjjjjjoooooooqqqaaabbhghfhfngfhfgbccHfilesddeiuyoweuoweiuoieu MSLAB JAVA HEAP Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabbbccHfilesddeiuyoweuoweiu Poiuytrewqasdfaaabbbccceeeghjkl;;mnppppbvcxqqaaaxcvb qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jkdddfkgbbbkykytktgkg;diwpoqqqaaabbbccHsddeiuyoweuoweiuoieu
No Dynamic Ordering for Immutable Segment Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabHfilesddeiuyoweuoweiuoieu qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jkkgkykytktgkg;diwpjjjjjjoooooooqqqaaabbhghfhfngfhfgbccHfilesddeiuyoweuoweiuoieu MSLAB JAVA HEAP Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabbbccesddeiuyoweuoweiuoieu Poiuytrewqasdfaaabbbccceeeghjkl;;mnppppbvcxqqaaaxcvb qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jkdddfkgbbbkykytktgkg;diwpoqqqaabccHfilesddeiuyoweuoweiuoieu
No Dynamic Ordering for Immutable Segment New Immutable Segment Cell Array binary search to the cell’s info Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabHfilesddeiuyoweuoweiuoieu qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jkkgkykytktgkg;diwpjjjjjjoooooooqqqaaabbhghfhfngfhfgbccHfilesddeiuyoweuoweiuoieu MSLAB JAVA HEAP Hhjjuuyrqaass iutkldfk;wjt;wiej;iwjerg;iopppqqqyrtaaajeutkiyt Jkkgkykytktgkg;diwpoqqqaaabbbccHsddeiuyoweuoweiuoieu Poiuytrewqasdfaaabbbccceeeghjkl;;mnppppbvcxqqaaaxcvb qqqwertyuioasdfghjklrtyuioplkjhgfpppwwwmnbvcmnb Jfkgbbbkykytktgkg;diwpoqqqaaabbbccHfilesddeiuyoweuoweiuoieu
Exploit the Immutability of a Segment after In-Memory Flush New Design: Flat layout for immutable segments index Less overhead per cell Manage (allocate, store, release) data buffers off-heap Pros Better utilization of memory and CPU Locality in access to index Reduce memory fragmentation
Outline In-Memory Eager Compaction Motivation & Background In-Memory Basic Compaction In-Memory Eager Compaction Evaluation
Hbase In-Memory Compaction CompactingMemStore Block cache active in-memory compaction in-memory-flush WAL Compaction pipeline Hfiles flush Snap-shot memory HDFS New compaction pipeline: Active segment flushed to pipeline Pipeline segments compacted in memory Flush to disk only when needed
Remove the Duplicates of Cells while still in memory New Design: eliminate redundancy while still in memory Less I/O upon flush Less work for on disk compaction Pros Data stays in memory for longer Larger files flushed to disk Less write amplification It includes index compaction
Three policies to compact a MemStore NONE No CompactingMemStore is used everything as before BASIC CompactingMemStore only with flattening (changing the index) Each active segment is flattened upon in-memory flush Multiple segments remain in the compaction pipeline EAGER CompactingMemStore with data compaction Each active segment is compacted with compaction pipeline upon in-memory flush Results of the previous compaction is the single segment in the compaction pipeline
How to use? Globally and per column family By default, all tables apply basic inmemory compaction This global configuration can be overridden in hbasesite.xml, as follows: <property> <name>hbase.hregion.compacting.memstore.type</name> <value>< none | eager ></value> </property>
How to use per column family? The policy can be also configured in Hbase shell per column family: create ‘<tablename>’,{NAME => ‘<cfname>’, IN_MEMORY_COMPACTION => ‘ <NONE|BASIC|EAGER>’
Outline In-Memory Compaction Evaluation Motivation & Background In-Memory Basic Compaction In-Memory Eager Compaction Evaluation In-Memory Compaction Evaluation
Metrics Write Throughput Write Latency Write Amplification Data written to disk / data submitted by the client Read Latency
Framework Benchmarking tool: YCSB Data: 1 table, 100 regions (50 per RS) Cell size 100 byte Cluster 3 SSD machines, 48GB RAM, 12 cores Hbase: 2RS + Master. HDFS: 3 datanodes. Configuration: 16GB heap, from which 40% allocated to MemStore and 40% to Block Cache 10 flush threads
Workload #1: 100% Writes Focus: Write throughput, latency, amplification Zipfian distribution sampled from 200M keys 100GB data written 20 threads driving workload
Performance
Data Written
Write Amplification Policy Write Amplification Improvement None 2.201 Basic 1.865 15% Eager 1.743 21%
A Very Different Experiment … HDD hardware 1 table, 1 region small heap (1G, simulating contention at a large scale) 5G data 50% writes, 50% reads
Zoom-In Read Latency over Time
Zoom-In Read Latency over Time
Zoom-In Read Latency over Time
Read Latency
Summary Targeted for HBase 2.0.0 New design pros over default implementation Less compaction on disk reduces write amplification effect Less disk I/O and network traffic reduces load on HDFS New space efficient index representation Reducing retrieval latency by serving (mainly) from memory We would like to thank the reviewers Michael Stack, Anoop Sam John, Ramkrishna s. Vasudevan, Ted Yu
Thank You!