Hathi: Durable Transactions for Memory using Flash

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Enabling Speculative Parallelization via Merge Semantics in STMs Kaushik Ravichandran Santosh Pande College.
Tag line, tag line Perforce Benchmark with PAM over NFS, FCP & iSCSI Bikash R. Choudhury.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,
Phase Reconciliation for Contended In-Memory Transactions Neha Narula, Cody Cutler, Eddie Kohler, Robert Morris MIT CSAIL and Harvard 1.
Scalable Content-Addressable Network Lintao Liu
Trishul Chilimbi, Yutaka Suzue, Johnson Apacible,
G Robert Grimm New York University Recoverable Virtual Memory.
Boost Write Performance for DBMS on Solid State Drive Yu LI.
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
Homework 2 In the docs folder of your Berkeley DB, have a careful look at documentation on how to configure BDB in main memory. In the docs folder of your.
Joel Coburn. , Trevor Bunker
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
InnoDB Performance and Usability Patches MySQL CE 2009 Vadim Tkachenko, Ewen Fortune Percona Inc MySQLPerformanceBlog.com.
PMIT-6102 Advanced Database Systems
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Logging in Flash-based Database Systems Lu Zeping
VLDB2012 Hoang Tam Vo #1, Sheng Wang #2, Divyakant Agrawal †3, Gang Chen §4, Beng Chin Ooi #5 #National University of Singapore, †University of California,
1 Moshe Shadmon ScaleDB Scaling MySQL in the Cloud.
Achieving Scalability, Performance and Availability on Linux with Oracle 9iR2-RAC Grant McAlister Senior Database Engineer Amazon.com Paper
Resolving Journaling of Journal Anomaly in Android I/O: Multi-Version B-tree with Lazy Split Wook-Hee Kim 1, Beomseok Nam 1, Dongil Park 2, Youjip Won.
Authors: Stavros HP Daniel J. Yale Samuel MIT Michael MIT Supervisor: Dr Benjamin Kao Presenter: For Sigmod.
MapReduce and GFS. Introduction r To understand Google’s file system let us look at the sort of processing that needs to be done r We will look at MapReduce.
Log-structured Memory for DRAM-based Storage Stephen Rumble, John Ousterhout Center for Future Architectures Research Storage3.2: Architectures.
Fast Crash Recovery in RAMCloud. Motivation The role of DRAM has been increasing – Facebook used 150TB of DRAM For 200TB of disk storage However, there.
GFS. Google r Servers are a mix of commodity machines and machines specifically designed for Google m Not necessarily the fastest m Purchases are based.
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Introduction.  Administration  Simple DBMS  CMPT 454 Topics John Edgar2.
1 Adaptive Parallelism for Web Search Myeongjae Jeon Rice University In collaboration with Yuxiong He (MSR), Sameh Elnikety (MSR), Alan L. Cox (Rice),
Transactional Flash V. Prabhakaran, T. L. Rodeheffer, L. Zhou (MSR, Silicon Valley), OSDI 2008 Shimin Chen Big Data Reading Group.
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)
NVWAL: Exploiting NVRAM in Write-Ahead Logging
Persistent Memory (PM)
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung
Ran Liu (Fudan Univ. Shanghai Jiaotong Univ.)
CS 540 Database Management Systems
An Analysis of Persistent Memory Use with WHISPER
Failure-Atomic Slotted Paging for Persistent Memory
In-Memory Capabilities
Algorithmic Improvements for Fast Concurrent Cuckoo Hashing
Managing Multi-User Databases
An Analysis of Persistent Memory Use with WHISPER
High-performance transactions for persistent memories
Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun
Scaling a file system to many cores using an operation log
Introduction to NewSQL
The Google File System Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung Google Presented by Jiamin Huang EECS 582 – W16.
NOVA: A High-Performance, Fault-Tolerant File System for Non-Volatile Main Memories Andiry Xu, Lu Zhang, Amirsaman Memaripour, Akshatha Gangadharaiah,
Persistency for Synchronization-Free Regions
Working with Very Large Tables Like a Pro in SQL Server 2014
The Vocabulary of Performance Tuning
MapReduce Simplied Data Processing on Large Clusters
HashKV: Enabling Efficient Updates in KV Storage via Hashing
Be Fast, Cheap and in Control
SQL 2014 In-Memory OLTP What, Why, and How
Predictive Performance
In Memory OLTP Not Just for OLTP.
Benchmarking Cloud Serving Systems with YCSB
The Vocabulary of Performance Tuning
Lecture 20: Intro to Transactions & Logging II
FASTER: A Concurrent Key-Value Store with In-Place Updates
In Memory OLTP Not Just for OLTP.
The Vocabulary of Performance Tuning
SQL Statement Logging for Making SQLite Truly Lite
The Gamma Database Machine Project
MapReduce: Simplified Data Processing on Large Clusters
Fast Accesses to Big Data in Memory and Storage Systems
Fan Ni Xing Lin Song Jiang
Presentation transcript:

Hathi: Durable Transactions for Memory using Flash Mohit Saxena University of Wisconsin-Madison (work done at HP Labs) Mehul A. Shah and Stavros Harizopoulos, Nou Data Michael M. Swift, University of Wisconsin-Madison Arif Merchant, Google

Durable Storage Relational DBMS File Systems Fine-grained control over durability High-level interface for structured relations Complex heavy-weight ACID transaction management File Systems Low-level interface to read/write bytes in a file Coarse-grained control over durability Designed for secondary storage 2

What has changed? Technology Application workloads Large main-memory sizes for in-core data-sets Fast flash SSD for persistence Multi-core processor for parallelism Application workloads Memory-resident: Main-memory key-value stores, social network graphs, persistent logs for network servers, and massively multiplayer online games Need fast, scalable and fine-grained durable storage 3

Hathi: Rethinking Durable Storage Persistent Heap Convenient in-memory data structures Simple low-latency memory load/stores Memory Transactions Minimal Design: Software Transactional Memory Concurrency Control and ACI Durability Control Fine-grained control over commit of memory transactions Optimized for high-speed flash SSDs Mnemosyne, NV-Heaps [ASPLOS ‘11] Hathi Transactions on Storage Class Memory  Flash SSDs 4

Outline Introduction Design Evaluation Conclusions Transaction Interface Partitioned Logging & Durability Control Checkpoint-Recovery Evaluation Conclusions 5

Hathi Design Goals Simple : application interface Persistent Heaps Memory Transactions Fast : scales with txn threads Partitioned Logging TM-based Checkpointing Durable : fine-grained control Partitioned & Split-Phase Commit Recovery Transaction Interface DRAM Hathi Transaction Manager Flash Packages SSD 6

Persistent Heap createHeap read_heap, write_heap allocate memory segments associate with a checkpoint on SSD read_heap, write_heap read from given heap address into user buffer write updates thread-local copy example data structures graphs, trees, hash tables Heap *hp = createHeap(size) read_heap(hp,offset,len,dstbuf) write_heap(hp,offset,len,srcbuf) update_item_price(items,n,newprice) { if(newprice < 0) return FALSE; items->price[n] = newprice; return TRUE; } Data Structure Update Compiler Instrumentation 7

Durability Interface Hathi Transaction (ACI-D) Txn Threads Memory Log Durable log Log Records tx_commit Hathi Transaction (ACI-D) update_item_price(items,n,newprice) { if(newprice < 0) return FALSE; tx_start items->price[n] = newprice; tx_commit return TRUE; } Transactional Update How to make tx_commit fast and scalable? 8

Split-Phase Commit Challenge Solution Worker Threads Challenge Achieve durability of sync commit with performance of async commit Borrow fsync idea for memory txns Solution initiate lsn=tx_commit(async) early optionally use isStable(lsn,wait) later more fine-grained durability control for txns than fsync tx_start read(hp,offset,len,dstbuf) … write(hp,offset,len,srcbuf) lsn=tx_commit(async) isStable(lsn,wait) 9

Partitioned Logging Challenge Solution Advantages Scale tx_commit with multiple cores Make tx_commit fast on SSDs Solution write inserts log records in a per-thread memory log tx_commit flushes log records to a per-thread durable log Advantages Lower contention More concurrent requests to SSD Partitioned Memory Logs Partitioned Durable Logs Memory Log Durable log 10

Partitioned Commit Challenge Solution Partitioned Durable Logs 2 1 tx_commit(partition) 3 Challenge Partitioned logging requires synchrony across txn threads Increases latency of txn commit Solution Observation: partitioned data structures do not require isolation commit(partition) flushes the local memory buffer to its thread log Partitioned Durable Logs T1 T2 T3 1 2 3 11

Checkpoint Challenge Solution Bounded log and recovery time Checkpointing heap should not conflict with concurrent transactions Solution Incremental checkpointing at chunk sizes minimizes conflicts STM protects chunk writes during checkpointing Checkpoint Thread Worker Threads for ith chunk in heap hp do tx_start read(hp,i,chunksize,copyBuffer) chunkLSN =tx_commit(async) end for isStable(lastChunkLSN,true) update checkpoint header sleep(timer) Memory Checkpointing 12

Recovery Challenge Solution Merge Sort Ordered Log Records Challenge Inter-partition log and checkpoint dependencies need to be resolved during recovery Solution Load checkpoint chunks in memory Merge log records in LSN order from all partitions Roll-forward replay until it reaches the end of one partition or a gap in the LSNs 2 Checkpoint Chunk LSN: 1 3 5 Checkpoint Chunk LSN: 4 6 5 2 8 9 3 6 On-flash log partitions 13

Outline Introduction Design Evaluation Conclusions Durability Cost Commit Mode Performance Conclusions 14

Methodology Systems for Comparison Workload: Synthetic and OLTP TinySTM: Software Transactional Memory (ACI) Hathi: TinySTM + Partitioned/Single Logging for Durability (ACI-D) with group commit support Workload: Synthetic and OLTP Synthetic: Each thread continuously executes transactions, six random read/write word offsets per transaction OLTP: STAMP travel reservation benchmark Setups: Two machines High-end Server: 3.0 GHz Intel Xeon quad-core server, 4 GB heap, 80 GB PCIe FusionIO ioDrive Mainstream: 2.5 GHz Intel Core 2 quad, 1 GB heap, 80 GB Intel X-25M SSD 15

Durability Cost Tx Throughput (1000 Txns/s) Number of Threads 38% short 1.25 M Txns/s Tx Throughput (1000 Txns/s) 130% faster Txn: memcpy six random words HP Proliant server FusionIO ioDrive (async commit) Number of Threads 16

Commit Mode Performance 47% short 15% faster Tx Throughput relative to async (%) STAMP Workload (Mainstream) Commit Mode 17

Summary Hathi: Rethinking Durable Storage Persistent Heap - simple programming interface for main-memory workloads Software Transactional Memory - fast memory transactions for concurrency control Partitioned & Split-Phase Commit - better performance on flash SSDs for durability 18

Thanks! Hathi: Durable Transactions for Memory using Flash Mohit Saxena University of Wisconsin-Madison Mehul Shah and Stavros Harizopoulos, Nou Data Michael M. Swift, University of Wisconsin-Madison Arif Merchant, Google 19 19