Download presentation
Presentation is loading. Please wait.
Published byAnthony Singleton Modified over 9 years ago
1
Concurrent Cache-Oblivious B-trees Using Transactional Memory
Jim Sukha Bradley Kuszmaul MIT CSAIL June 10, 2006
2
Thought Experiment Imagine that, one day, you are assigned the following task: Enclosed is code for a serial, cache-oblivious B-tree. We want a reasonably efficient parallel implementation that works for disk-resident data. Attach: COB-tree.tar.gz PS. We want to be able to restore the data to a consistent state after a crash too. PPS. Our deadline is next week. Good luck!
3
Concurrent COB-tree? Question:
How can one program a concurrent, cache-oblivious B-tree? Approach: We employ transactional memory. What complications does I/O introduce?
4
Potential Pitfalls Involving I/O
Suppose our data structure resides on disk. We might need to make explicit I/O calls to transfer blocks between memory and disk. But a cache-oblivious algorithm doesn’t know the block size B! We might need buffer management code if the data doesn’t fit into main memory. We might need to unroll I/O if we abort a transaction that has already written to disk.
5
Our Solution: Libxac We have implemented Libxac, a page-based transactional memory system that operates on disk-resident data. Libxac supports ACID transactions on a memory-mapped file. Using Libxac, we are able to implement a complex data structure that operates on disk-resident data, e.g. a cache-oblivious B-tree.
6
Libxac Handles Transaction I/O
We might need to make explicit I/O calls to transfer blocks between memory and disk. Similar to mmap, Libxac provides a function xMmap. Thus, we can operate on disk-resident data without knowing block size. We might need buffer management code if the data doesn’t fit into main memory. Like mmap, the OS automatically buffers pages in memory. We might need to unroll I/O if we abort a transaction that has already written to disk. Since Libxac implements multiversion concurrency control, we still have the original version of a page even if a transaction aborts.
7
Outline Programming with Libxac Cache-Oblivious B-trees
8
Example Program with Libxac
int main(void) { int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096); while (status != SUCCESS) { xbegin(); x[0] ++; status = xend(); } xMunmap(x); xShutdown(); return 0; Runtime initialization function. For durable transactions, logs are stored in the specified directory.* * Currently Libxac logs the transaction commits, but we haven’t implemented the recovery program yet. Transactionally maps the first page of the input file. Transaction body. The body can be a complex function (e.g., a cache-oblivious B-tree insert!). Unmap the region. Shutdown runtime.
9
Libxac Memory Model Aborted transactions are visible to the programmer (thus, programmer must explicitly retry transaction). Control flow always proceeds from xbegin() to xend(). Thus, the xaction body can contain system/library calls. At xend(), all changes to xMmap’ed region are discarded on FAILURE, or committed on SUCCESS. Aborted transactions always see consistent state. Read-only transactions can always succeed. int main(void) { int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096); while (status != SUCCESS) { xbegin(); x[0] ++; status = xend(); } xMunmap(x); xShutdown(); return 0; *Libxac supports concurrent transactions on multiple processes, not threads.
10
Implementation Sketch
Libxac detects memory accesses by using a SIGSEGV handler to catch a memory protection violation on a page that has been mmap’ed. This mechanism is slow for normal transactions: Time for mmap, SIGSEGV handler: ~ 10 ms Efficient if we must perform disk I/O to log transaction commits. Time to access disk: ~ 10 ms
11
Is xMmap practical? Experiment on a 4-proc. AMD Opteron,
performing 100,000 insertions of elements with random keys into a B-tree. Each insert is a separate transaction. Libxac and BDB both implement group commit. B-tree and COB-tree both use Libxac. Note that none of the three data structures have been properly tuned. Conclusion: We should achieve good performance.
12
Outline Programming with Libxac Cache-Oblivious B-trees
13
What is a Cache-Oblivious B-tree?
A cache-oblivious B-tree (e.g. [BDFC00]) is a dynamic dictionary data structure that supports searches, insertions/deletions, and range-queries. An cache-oblivious algorithm/data structure does not know system parameters (e.g. the block size B.) Theorem [FLPR99]: a cache-oblivious algorithm that is optimal for a two-level memory hierarchy is also optimal for a multi-level hierarchy.
14
Cache-Oblivious B-Tree Example
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 Packed Memory Array (PMA) The COB-tree can be divided into two pieces: A packed memory array that stores the data in order, but contains gaps. A static cache-oblivious binary-tree that indexes the packed memory array.
15
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 To insert a key of 37:
16
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 To insert a key of 37: Find correct section of PMA location using static tree. 37
17
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 To insert a key of 37: Find correct section of PMA location using static tree. Insert into PMA. This step may cause a rebalance of the PMA. 37
18
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 -- 40 39 38 56 54 48 45 -- 83 70 59 To insert a key of 37: Find correct section of PMA location using static tree. Insert into PMA. This step possibly requires a rebalance. Fix the static tree.
19
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 40 4 16 37 56 4 10 16 21 37 40 56 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 -- 40 39 38 56 54 48 45 -- 83 70 59 To insert a key of 37: Find correct section of PMA location using static tree. Insert into PMA. This step possibly requires a rebalance. Fix the static tree.
20
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 40 4 16 37 56 4 10 16 21 37 40 56 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 -- 40 39 38 56 54 48 45 -- 83 70 59 Insert is a complex operation. If we wanted to use locks, what is the locking protocol? What is the right (cache-oblivious?) lock granularity?
21
Conclusions A page-based TM system such as Libxac
Represents a good match for disk-resident data structures. The per-page overheads of TM are small compared to cost of I/O. Is easy to program with. Libxac allows us to program a concurrent, disk-resident data structure with ACID properties, as though it was stored in memory.
23
Semantics of Local Variables
int main(void) { int y=0, z=0, a=0, b=0; int* x; int status = FAILURE; xInit(“/logs”, DURABLE); x = xMmap(“input.db”, 4096); while (status != SUCCESS) { a++; xbegin(); b = x[0]; y++; x[0]++; z = x[0] – 1; status = xend(); } xMunmap(x); xShutdown(); return 0; In this system, Libxac guarantees that after loop completes: a == y Value of a is # of times transaction is attempted. We always have b == z because aborted transactions always see consistent state, even if other programs concurrently access the first page of input.db.
24
TM System Improvements?
Possible improvements to Libxac: Provide more efficient support for non-durable transactions by modifying the OS to track report pages accessed? Integrate Libxac with another TM system to provide concurrency control on both multiple threads and multiple processes?
25
Implementation Sketch
x[0] PROT_NONE 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_NONE xend(); x[2048] PROT_NONE x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
26
Implementation Sketch
x[0] PROT_NONE PROT_READ 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_NONE xend(); Segmentation Fault x[2048] PROT_NONE x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
27
Implementation Sketch
x[0] PROT_NONE PROT_READ 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_READ PROT_NONE xend(); Segmentation Fault x[2048] PROT_NONE x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
28
Implementation Sketch
x[0] PROT_NONE PROT_READ 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_NONE PROT_READ xend(); Segmentation Fault (2nd) x[2048] PROT_NONE x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
29
Implementation Sketch
x[0] PROT_NONE PROT_READ 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_READ xend(); Segmentation Fault (2nd) copy contents x[2048] PROT_NONE 7 x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
30
Implementation Sketch
x[0] PROT_NONE PROT_READ 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_READ| PROT_WRITE xend(); Segmentation Fault (2nd) x[2048] PROT_NONE 7 x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
31
Implementation Sketch
x[0] PROT_NONE PROT_READ 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_READ| PROT_WRITE xend(); x[2048] PROT_NONE 2 x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
32
Implementation Sketch
x[0] PROT_NONE PROT_READ 1 7 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_READ| PROT_WRITE xend(); log on disk x[2048] PROT_NONE 2 2 x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
33
Implementation Sketch
x[0] PROT_NONE 1 2 9 int a; xbegin(); a = x[0]; x[1024] += a+1; x[1024] PROT_NONE 2 xend(); xend(); copy contents x[2048] PROT_NONE 2 2 x[3072] PROT_NONE Memory Map input.txt Buffer File Log File
34
Focus on PMA Rebalance insert (tree, key, value) { xbegin(); x=find_location_in_pma(tree->static_index, key); insert_into_pma(tree->pma, key, value, x); fix_static_index(tree->static_index); xend(); } rebalance(tree->pma); In this talk, we illustrate the problems of transaction I/O considering a transactional rebalance of the packed memory array.
35
Rebalance of an In-Memory Array
void rebalance(int* x, int n) { int i; int count = 0; for (i = 0; i < n; i++) { if (x[i] != EMPTY_SLOT) { x[count] = x[i]; count++; } 2 1 -- 4 3 6 5 // Slide everything left 3 2 1 4 6 5 -- int j = count-1; double spacing = 1.0*n/count; for (i = n-1; i >= 0; i--) { if (floor(j*spacing) == i) { x[i] = x[j]; } else { x[i] = EMPTY_SLOT; // Redistribute items // from right 2 -- 1 4 3 6 5
36
Rebalance with Explicit I/O
void rebalance_with_I/O(int n) { int i; int count = 0; int* y; int* z; y = read_block(0); z = read_block(0); for (i = 0; i < n; i++) { if (i % B == 0) { y = read_block(i/B); } if (y[i%B] != EMPTY_SLOT) { if (count % B == 0) { z = read_block(count/B); z[count%B] = y[i%B]; count++; ... Why do we want to avoid performing explicit I/O to read/write data blocks? Issues: What if the data does not fit into memory? We must have buffer management code somewhere. A cache-oblivious algorithm does not know the value of B! write_block(…)
37
Rebalance using Memory Mapping
What if the data does not fit into memory? void rebalance_with_mmap(int n) { int i; int count = 0; x = mmap(“input.db”, n*sizeof(int)); for (i = 0; i < n; i++) { if (x[i] != EMPTY_SLOT) { x[count] = x[i]; count++; } If we use memory mapping, then the OS automatically buffers pages that are accessed. What value of B do we choose for a cache-oblivious algorithm? ... munmap(x, n*sizeof(int)); } I/O is transparent to the user. B does not appear in the application code. Using mmap, the code looks like the in-memory rebalance. But we still need concurrency control.
38
Concurrent Rebalance? What happens if we want the rebalance to occur concurrently? void rebalance_with_mmap(int n) { int i; int count = 0; x = mmap(“input.db”, n*sizeof(int)); for (i = 0; i < n; i++) { if (x[i] != EMPTY_SLOT) { x[count] = x[i]; count++; } If we use locks, what do we choose as the locking granularity? write_block(…) ... munmap(x, n*sizeof(int)); } If we use transactions, will the system need to unroll I/O when a transaction aborts to ensure the data on disk is consistent?
39
Transactional Memory Mapping
void rebalance_with_xMmap(int n) { int i; int count = 0; x = xMmap(“input.db”, n*sizeof(int)); xbegin(); for (i = 0; i < n; i++) { if (x[i] != EMPTY_SLOT) { x[count] = x[i]; count++; } Our solution: Replace mmap with xMmap, and use transactions for concurrency control. Transaction system maintains multiple versions of a page to avoid needing to unroll I/O. write_block(…) ... xend(); xMunmap(x, n*sizeof(int)); } Transactional memory mapping simplifies the code for a concurrent disk-resident data structure.
40
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 45 40 39 38 -- 54 48 83 70 59 56 2.5 ≤ n ≤ 7.5 2(a) Add 37 to packed memory array. 6 ≤ n ≤ 14 Packed Memory Array Density Thresholds
41
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 -- 40 39 38 56 54 48 45 -- 83 70 59 2.5 ≤ n ≤ 7.5 2(a) Add 37 to packed memory array. 2(b) Rebalance the PMA. 6 ≤ n ≤ 14 Packed Memory Array Density Thresholds
42
Dictionary Operations using a B+-Tree
4 10 20 42 B 4 1 3 10 5 8 20 12 15 35 21 33 41 42 51 70 77 85 The branching factor of the tree and the size of a block on disk are both q(B). For a B+-tree, the data is stored at the leaves. The keys at an interior node represent the maximum key of that node’s subtree. But to build the tree, we need to know the value of B…
43
Cache-Oblivious B-tree [BDFC00]
Operations Cost in Block Transfers Search(key) O(logB N) Insert(key, value) O(logB N)** Delete(key, value) O(logB N)** RangeQuery(start, end) O(logB N + k/B)* *Bound assumes range query finds k items with keys between start and end. ** Amortized bound. It is possible to support dictionary operations cache-obliviously, i.e., with a data structure that does not know the value of B. The cache-oblivious B-tree (COB-tree) achieves the same asymptotic (amortized) bounds as a B+-tree.
44
Cache-Oblivious B-Tree Overview
Static Cache-Oblivious Binary Tree The static tree is used as an index into a packed memory array. To perform an insert, insert into the packed memory array, and update the static tree. When the packed memory array becomes too full (empty), rebuild and grow (shrink) the entire data structure.
45
Static Cache-Oblivious Binary Tree
3 2 … 1 size q(N1/2) size q(N1/4) N1/4 3 2 … 1 N1/4 3 2 … 1 N1/4 3 2 … 1 N1/4 3 2 … 1 … 1 2 3 N1/2 Divide tree into q(N1/2) subtrees of size q(N1/2). Layout each subtree contiguously in memory, recursively.
46
Packed Memory Array A packed memory array uses a contiguous section of memory to store elements in order, but with gaps. For sections of size 2k, gaps are spaced to maintain to specified density thresholds that become arithmetically stricter as k increases. 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 38 -- 54 48 83 70 59 56 [4/16, 16/16] [5/16, 15/16] [6/16, 14/16] [7/16, 13/16] 24: Density between 4/16 and 16/16. 25: Density between 5/16 and 15/16. 26: Density between 6/16 and 13/16. 27: Density between 7/16 and 12/16 Density Thresholds Example:
47
Example Cache-Oblivious B-Tree
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 [4/16, 16/16] [5/16, 15/16] [6/16, 14/16] Packed Memory Array Density Thresholds [7/16, 13/16]
48
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 Insert 37:
49
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 Insert 37: Find correct section of PMA location using static tree. 37
50
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 38 31 24 23 45 -- 40 39 -- 54 48 83 70 59 56 Insert 37: Find correct section of PMA location using static tree. Insert into PMA. This step possibly requires a rebalance. 37
51
Example Cache-Oblivious B-Tree
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 45 40 39 38 -- 54 48 83 70 59 56 [5/16, 15/16] 2(a) Add 37 to packed memory array. [6/16, 14/16]
52
Example Cache-Oblivious B-Tree
Static Cache-Oblivious Tree 21 10 45 4 16 38 54 4 10 16 21 38 45 54 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 -- 40 39 38 56 54 48 45 -- 83 70 59 [5/16, 15/16] 2(a) Add 37 to packed memory array. 2(b) Rebalance the PMA. [6/16, 14/16]
53
Cache-Oblivious B-Tree Insert
Static Cache-Oblivious Tree 21 10 40 4 16 37 56 4 10 16 21 37 40 56 83 4 -- 1 -- 10 7 6 16 -- 15 13 -- 21 37 31 24 23 -- 40 39 38 56 54 48 45 -- 83 70 59 Insert 37: Find correct section of PMA location using static tree. Insert into PMA. This step possibly requires a rebalance. Fix the static tree.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.