© 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki.

Slides:



Advertisements
Similar presentations
You have been given a mission and a code. Use the code to complete the mission and you will save the world from obliteration…
Advertisements

Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Managing Multi-User Databases (1) IS 240 – Database Management Lecture #18 – Prof. M. E. Kabay, PhD, CISSP Norwich University
1 Copyright © 2010, Elsevier Inc. All rights Reserved Fig 2.1 Chapter 2.
Business Transaction Management Software for Application Coordination 1 Business Processes and Coordination.
1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Title Subtitle.
FACTORING ax2 + bx + c Think “unfoil” Work down, Show all steps.
Data recovery 1. 2 Recovery - introduction recovery restoring a system, after an error or failure, to a state that was previously known as correct have.
Concurrency control 1. 2 Introduction concurrency more than one transaction have access to data simultaneously part of transaction processing.
1 Term 2, 2004, Lecture 6, TransactionsMarian Ursu, Department of Computing, Goldsmiths College Transactions 3.
So far Binary numbers Logic gates Digital circuits process data using gates – Half and full adder Data storage – Electronic memory – Magnetic memory –
1 Interprocess Communication 1. Ways of passing information 2. Guarded critical activities (e.g. updating shared data) 3. Proper sequencing in case of.
Making the System Operational
Remus: High Availability via Asynchronous Virtual Machine Replication
Technische Universität München Massively Parallel Sort-Merge Joins (MPSM) in Main Memory Multi-Core Database Systems Martina Albutiu, Alfons Kemper, and.
Chapter 1 Introduction Copyright © Operating Systems, by Dhananjay Dhamdhere Copyright © Introduction Abstract Views of an Operating System.
The Impact of Soft Resource Allocation on n-tier Application Scalability Qingyang Wang, Simon Malkowski, Yasuhiko Kanemasa, Deepal Jayasinghe, Pengcheng.
From A to E: Analyzing TPCs OLTP Benchmarks Pınar Tözün Ippokratis Pandis* Cansu Kaynak Djordje Jevdjic Anastasia Ailamaki École Polytechnique Fédérale.
Copyright © 2009 EMC Corporation. Do not Copy - All Rights Reserved.
Fast Crash Recovery in RAMCloud
1 CS411 Database Systems 12: Recovery obama and eric schmidt sysadmin song
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.
Critical Sections: Re-emerging Concerns for DBMS Ryan JohnsonIppokratis Pandis Anastasia Ailamaki Carnegie Mellon University École Polytechnique Féderale.
Centrifuge: Integrated Lease Management and Partitioning for Cloud Services Atul Adya,John Dunagan*,Alec Wolman* Google, *Microsoft Research 1 7th USENIX.
Data Structures: A Pseudocode Approach with C
Chapter 4 Memory Management Basic memory management Swapping
FIFO Queues CSE 2320 – Algorithms and Data Structures Vassilis Athitsos University of Texas at Arlington 1.
@ Carnegie Mellon Databases Data-oriented Transaction Execution VLDB 2010 Ippokratis Pandis Ryan Johnson Nikos Hardavellas Anastasia Ailamaki Carnegie.
Project 5: Virtual Memory
Hardware-assisted Virtualization
Page Replacement Algorithms
Cache and Virtual Memory Replacement Algorithms
Chapter 10: Virtual Memory
CS 6143 COMPUTER ARCHITECTURE II SPRING 2014 ACM Principles and Practice of Parallel Programming, PPoPP, 2006 Panel Presentations Parallel Processing is.
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Chapter 6 File Systems 6.1 Files 6.2 Directories
Lecture plan Transaction processing Concurrency control
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
Processes Management.
Indra Budi Transaction Indra Budi
Executional Architecture
Chapter 5 Test Review Sections 5-1 through 5-4.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
Addition 1’s to 20.
25 seconds left…...
Håkan Sundell, Chalmers University of Technology 1 Evaluating the performance of wait-free snapshots in real-time systems Björn Allvin.
Improving OLTP scalability using speculative lock inheritance Ryan Johnson, Ippokratis Pandis, Anastasia Ailamaki.
Week 1.
SE-292 High Performance Computing
We will resume in: 25 Minutes.
© DEEDS – OS Course WS11/12 Lecture 10 - Multiprocessing Support 1 Administrative Issues  Exam date candidates  CW 7 * Feb 14th (Tue): * Feb 16th.
Foundations of Data Structures Practical Session #7 AVL Trees 2.
The DDS Benchmarking Environment James Edmondson Vanderbilt University Nashville, TN.
OLTP on Hardware Islands Danica Porobic, Ippokratis Pandis*, Miguel Branco, Pınar Tözün, Anastasia Ailamaki Data-Intensive Application and Systems Lab,
Chapter 16: Recovery System
NetSlices: Scalable Multi-Core Packet Processing in User-Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee.
The University of Adelaide, School of Computer Science
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
Storage Manager Scalability on CMPs Ippokratis Pandis CIDR Gong Show.
CS510 Concurrent Systems Why the Grass May Not Be Greener on the Other Side: A Comparison of Locking and Transactional Memory.
Reducing OLTP Instruction Misses with Thread Migration
Alternative system models
CSCI5570 Large Scale Data Processing Systems
Repairing Write Performance on Flash Devices
Introduction of Week 13 Return assignment 11-1 and 3-1-5
Presentation transcript:

© 2010 Ippokratis Pandis Aether: A Scalable Approach to Logging VLDB 2010 Ryan Johnson Ippokratis Pandis Radu Stoica Manos Athanassoulis Anastasia Ailamaki Carnegie Mellon University École Polytechnique Fédérale de Carnegie Mellon Databases

© 2010 Ippokratis Pandis Scalability is key! Modern hardware needs software parallelism OLTP is inherently parallel at the request level Very good on providing high concurrency But, internal serializations limit execution parallelism 2 Need for scalable OLTP components

© 2010 Ippokratis Pandis Logging is crucial for OLTP Fault tolerance Crash recovery Transaction abort/rollback Performance Log changes for durability (no in-place updates) Write dirty pages back asynchronously 3 * (e.g., Amazon outage*) $$$ Need efficient and scalable logging solution

© 2010 Ippokratis Pandis Logging is bottleneck for scalability Working around the bottlenecks: Asynchronous commit Replace logging with replication and fail-over 4 (1) At commit, must yield for log flush synchronous I/O at critical path locks held for long time two context switches per commit (2) Must insert records to the log buffer centralized main-memory structure source of contention CPU-1 L1 L2 CPU-2 L1 CPU-N L1 DataLog CPU RAM HDD Workarounds compromise durability

© 2010 Ippokratis Pandis Does correct logging have to be so slow? Locks held for long time Not actually used during the flush Indirect way to enforce isolation Two context switches per commit Transactions nearly stateless at commit time Easy to migrate transactions between threads Log buffer is source of contention Log orders incoming requests, not threads Log records can be combined 5 No! Aether: uncompromised, yet scalable logging

© 2010 Ippokratis Pandis Agenda Logging-related problems Aether logging Reducing lock contention Reducing context switching Scalable log buffer implementation Conclusions 6

© 2010 Ippokratis Pandis Bottleneck 1: Amplified lock contention 7 Xct 1 Xct 2 Done! Commit Working Lock Mgr.Log Mgr.I/O Waiting Other transactions wait for locks while the log flush I/O completes

© 2010 Ippokratis Pandis Early Lock Release in case of a single log Finish transaction Release locks before commit Insert transaction commit record Wait until log record is flushed Dependent xct serialized at the log buffer No extra overhead, idea around for 30 years …but nobody uses it so far… 8 With ELR other transactions do not wait for locks held during log flushes

© 2010 Ippokratis Pandis ELR benefits Sun Niagara T2 (64 HW contexts), 64GB RAM Mem. resident TPC-B in Shore-MT Zipfian distribution on transaction inputs 9 ELR is simple and sometimes very useful

© 2010 Ippokratis Pandis Agenda Logging-related problems Aether logging Reducing lock contention Reducing context switching Scalable log buffer implementation Conclusions 10

© 2010 Ippokratis Pandis 11 Xct 1 Commit WorkingLog Mgr. I/O Waiting One context switch per log flush Pressure on the OS scheduler Bottleneck 2: Excessive context switching Must decouple thread scheduling from log flushes Time Xct 2 Context switch Sun Niagara T2 (64 HW contexts) Mem. resident TPC-B in Shore-MT

© 2010 Ippokratis Pandis Flush Pipelining Scheduler in the critical path and wastes CPU Multi-core HW only amplifies the problem But, transaction nearly stateless at commit Detach transaction state from worker thread Pass it to log writer Worker threads do not block at commit time 12 Thread 1 Time Xct 1 Xct 2 Thread 2

© 2010 Ippokratis Pandis Flush Pipelining Scheduler in the critical path and wastes CPU Multi-core HW only amplifies the problem But, transaction nearly stateless at commit Detach transaction state from worker thread Pass it to log writer Worker threads do not block at commit time 13 Thread 1 Time Xct 1 Xct 2 Thread 2 Log Writer Xct 3 Xct 4 Staged-like mechanism = low scheduling costs

© 2010 Ippokratis Pandis Impact of Flush Pipelining 14 Sun Niagara T2 (64 HW contexts) Mem. resident TPC-B in Shore-MT Match Asynchronous Commit throughput without compromising durability

© 2010 Ippokratis Pandis Agenda Logging-related problems Aether logging Reducing lock contention Reducing context switching Scalable log buffer implementation Conclusions 15

© 2010 Ippokratis Pandis 16 Bottleneck 3: Log buffer contention Xct 1 Xct 2 Working Log Mgr.I/O Waiting Time Xct 3 Log Buffer Latch Waiting Centralized log buffer Contention, which depends on participating number of threads size of modifications (kiB in case of physical logging)

© 2010 Ippokratis Pandis Eliminating critical sections Inspiration: elimination-based backoff * Critical sections can cancel each other out E.g., stack push/pop operations 17 * D. Hendler, N. Shavit, and L. Yerushalmi. A Scalable Lock-free Stack Algorithm. In Proc. SPAA, 2004 Adapt elimination-based backoff for db logging Attempt to acquire mutex If failed, backoff waiting on a array If someone else already waits there, eliminate requests w/o acquiring mutex push() Station area Stack push() pop()

© 2010 Ippokratis Pandis Accessing the log buffer Break log insert into three logical steps (a) Reserve space by updating head LSN (b) Copy log record (memcpy) (c) Make insert visible by updating tail LSN, in LSN order Steps (a) + (c) can be consolidated Accumulate requests off the critical path Send only group leader to fight for the critical section Move (b) out of critical section 18 (a)(b) (c)

© 2010 Ippokratis Pandis Mutex held Start/finish Copy into bufferWaiting Design evolution 19 Consolidation array (C) (D) Decoupled buffer insertHybrid design (CD) (B) Baseline (D) Decoupled buffer insertHybrid design (CD) (B) Baseline contention(work) = O(1) contention(# threads) = O(1) Decouple contention from the # of threads and average log entry size

© 2010 Ippokratis Pandis Performance as contention increases 20 Microbenchmark Bimodal distribution 48B and 160B 120B average Hybrid solution combines benefits of both

© 2010 Ippokratis Pandis Sensitivity to slot count # Slots # Threads Relatively insensitive to slot count (3 or 4 slots good enough for most cases) Colors/height is throughput (in MB/s)

© 2010 Ippokratis Pandis Case against distributed logging Distributing TPC-C log records over 8 logs 1 ms wall time, ~200 in flight transactions, 30 commits Horizontal blue line = 1 log Diagonal line = dependency (new = black, older = grey) 22 Large overhead keeping track dependencies and over-flushing

© 2010 Ippokratis Pandis Agenda Logging-related problems Aether logging Reducing context switching Scalable log buffer implementation Conclusions 23

© 2010 Ippokratis Pandis Putting it all together 24 Gap increases w/ # threads! Sun Niagara T2 (64 HW contexts) Mem. Resident, TPC-B +60% from Baseline Eliminate current log bottlenecks Future-proof system against contention +15%

© 2010 Ippokratis Pandis Conclusions Logging is an essential component for OLTP Simplifies recovery, improves performance without the need of physically partitioning data.. but need to address all lurking bottlenecks Aether is a holistic approach to logging Leverages existing techniques (Early lock release) Reduces context switches (Flush Pipelining) Eliminates log contention (Consolidation-based backoff) Can achieve 2GB/s of log throughput per node 25 Thank you!