Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.

Slides:



Advertisements
Similar presentations
Critical Sections: Re-emerging Concerns for DBMS Ryan JohnsonIppokratis Pandis Anastasia Ailamaki Carnegie Mellon University École Polytechnique Féderale.
Advertisements

Optimistic Methods for Concurrency Control By : H.T. Kung & John T. Robinson Presenters: Munawer Saeed.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.
Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Exploiting Distributed Version Concurrency in a Transactional Memory Cluster Kaloian Manassiev, Madalin Mihailescu and Cristiana Amza University of Toronto,
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.
5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.
Erhan Erdinç Pehlivan Computer Architecture Support for Database Applications.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
What will my performance be? Resource Advisor for DB admins Dushyanth Narayanan, Paul Barham Microsoft Research, Cambridge Eno Thereska, Anastassia Ailamaki.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.
Lock-free Cache-friendly Software Queue for Decoupled Software Pipelining Student: Chen Wen-Ren Advisor: Wuu Yang 學生 : 陳韋任 指導教授 : 楊武 Abstract Multicore.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Meanwhile RAM cost continues to drop Moore’s Law on total CPU processing power holds but in parallel processing… CPU clock rate stalled… Because.
1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.
1 Lecture 7: Transactional Memory Intro Topics: introduction to transactional memory, “lazy” implementation.
1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Transaction Management and Concurrency Control.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Chris Colohan 1,2, Anastassia Ailamaki 2, J. Gregory Steffan 3 and Todd C. Mowry.
Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.
Multiscalar processors
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
Highly Available ACID Memory Vijayshankar Raman. Introduction §Why ACID memory? l non-database apps: want updates to critical data to be atomic and persistent.
Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.
Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
QoS-enabled Tree-based Distributed Mutexes James Edmondson Brian Sulcer.
The HDF Group Multi-threading in HDF5: Paths Forward Current implementation - Future directions May 30-31, 2012HDF5 Workshop at PSI 1.
Copyright © 1997 – 2014 Curt Hill Concurrent Execution of Programs An Overview.
Authors: Stavros HP Daniel J. Yale Samuel MIT Michael MIT Supervisor: Dr Benjamin Kao Presenter: For Sigmod.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
CSC Multiprocessor Programming, Spring, 2012 Chapter 11 – Performance and Scalability Dr. Dale E. Parson, week 12.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
Virtual Application Profiler (VAPP) Problem – Increasing hardware complexity – Programmers need to understand interactions between architecture and their.
Copyright © Curt Hill Concurrent Execution An Overview for Database.
Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.
Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.
Brian Bershad, Thomas Anderson, Edward Lazowska, and Henry Levy Presented by: Byron Marohn Published: 1991.
On Transactional Memory, Spinlocks and Database Transactions Khai Q. Tran Spyros Blanas Jeffrey F. Naughton (University of Wisconsin Madison)
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Dynamic Region Selection for Thread Level Speculation Presented by: Jeff Da Silva Stanley Fung Martin Labrecque Feb 6, 2004 Builds on research done by:
Speculative Lock Elision
The University of Adelaide, School of Computer Science
Atomic Operations in Hardware
Predictive Performance
Lecture 22: Consistency Models, TM
Programming with Shared Memory Specifying parallelism
The University of Adelaide, School of Computer Science
CSC3050 – Computer Architecture
CS333 Intro to Operating Systems
Programming with Shared Memory Specifying parallelism
The University of Adelaide, School of Computer Science
Presentation transcript:

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie Mellon University 2 University of Toronto 3 Intel Research Pittsburgh

Copyright 2005 Chris Colohan 2 Chip Multiprocessors are Here! 2 cores now, soon will have 4, 8, 16, or 32 Multiple threads per core How do we best use them? IBM Power 5 AMD Opteron Intel Yonah

Copyright 2005 Chris Colohan 3 Multi-Core Enhances Throughput Database ServerUsers TransactionsDBMSDatabase Cores can run concurrent transactions and improve throughput

Copyright 2005 Chris Colohan 4 Multi-Core Enhances Throughput Database ServerUsers TransactionsDBMSDatabase Can multiple cores improve transaction latency? Can multiple cores improve transaction latency?

Copyright 2005 Chris Colohan 5 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-query parallelism Used for long-running queries (decision support) Does not work for short queries Short queries dominate in commercial workloads

Copyright 2005 Chris Colohan 6 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Each thread spans multiple queries Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution…

Copyright 2005 Chris Colohan 7 Parallelizing transactions SELECT cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock; quantity--; UPDATE stock WITH quantity; INSERT item INTO order_line; } DBMS Intra-transaction parallelism Breaks transaction into threads Hard to add to existing systems! Need to change interface, add latches and locks, worry about correctness of parallel execution… Thread Level Speculation (TLS) makes parallelization easier. Thread Level Speculation (TLS) makes parallelization easier.

Copyright 2005 Chris Colohan 8 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time Parallel *p= *q= =*p =*q =*p =*q Epoch 1Epoch 2

Copyright 2005 Chris Colohan 9 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel Use epochs Detect violations Restart to recover Buffer state Worst case: Sequential Best case: Fully parallel Data dependences limit performance. Epoch 1Epoch 2

Copyright 2005 Chris Colohan 10 Transactions DBMS Hardware A Coordinated Effort TPC-C BerkeleyDB Simulated machine

Copyright 2005 Chris Colohan 11 Transaction Programmer DBMS Programmer Hardware Developer A Coordinated Effort Choose epoch boundaries Remove performance bottlenecks Add TLS support to architecture

Copyright 2005 Chris Colohan 12 So what’s new? Intra-transaction parallelism Without changing the transactions With minor changes to the DBMS Without having to worry about locking Without introducing concurrency bugs With good performance Halve transaction latency on four cores

Copyright 2005 Chris Colohan 13 Related Work Optimistic Concurrency Control (Kung82) Sagas (Molina&Salem87) Transaction chopping (Shasha95)

Copyright 2005 Chris Colohan 14 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

Copyright 2005 Chris Colohan 15 Case Study: New Order (TPC-C) Only dependence is the quantity field Very unlikely to occur (1/100,000) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time

Copyright 2005 Chris Colohan 16 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; }

Copyright 2005 Chris Colohan 17 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

Copyright 2005 Chris Colohan 18 Dependences in DBMS Time

Copyright 2005 Chris Colohan 19 Dependences in DBMS Time Dependences serialize execution! Performance tuning: Profile execution Remove bottleneck dependence Repeat

Copyright 2005 Chris Colohan 20 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 1 put_page(5) ref: 0

Copyright 2005 Chris Colohan 21 get_page(5) put_page(5) Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) TLS ensures first epoch gets page first. Who cares? TLS ensures first epoch gets page first. Who cares? get_page(5) put_page(5)

Copyright 2005 Chris Colohan 22 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) = Escape Speculation Escape speculation Invoke operation Store undo function Resume speculation Escape speculation Invoke operation Store undo function Resume speculation get_page(5) put_page(5) get_page(5)

Copyright 2005 Chris Colohan 23 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Not undoable! get_page(5) put_page(5) = Escape Speculation

Copyright 2005 Chris Colohan 24 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Delay put_page until end of epoch Avoid dependence = Escape Speculation

Copyright 2005 Chris Colohan 25 Removing Bottleneck Dependences We introduce three techniques: Delay operations until non-speculative Mutex and lock acquire and release Buffer pool, memory, and cursor release Log sequence number assignment Escape speculation Buffer pool, memory, and cursor allocation Traditional parallelization Memory allocation, cursor pool, error checks, false sharing

Copyright 2005 Chris Colohan 26 Outline Introduction Related work Dividing transactions into epochs Removing bottlenecks in the DBMS Results Conclusions

Copyright 2005 Chris Colohan 27 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput CPU 32KB 4-way L1 $ Rest of memory system CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ 2MB 4-way L2 $

Copyright 2005 Chris Colohan 28 Optimizing the DBMS: New Order Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy Cache misses increase Other CPUs not helping Can’t optimize much more 26% improvement

Copyright 2005 Chris Colohan 29 Optimizing the DBMS: New Order Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Violated Cache Miss Busy This process took me 30 days and <1200 lines of code. This process took me 30 days and <1200 lines of code.

Copyright 2005 Chris Colohan 30 Other TPC-C Transactions New OrderDeliveryStock LevelPaymentOrder Status Time (normalized) Idle CPU Failed Cache Miss Busy 3/5 Transactions speed up by 46-66%

Copyright 2005 Chris Colohan 31 Conclusions A new form of parallelism for databases Tool for attacking transaction latency Intra-transaction parallelism Without major changes to DBMS TLS can be applied to more than transactions Halve transaction latency by using 4 CPUs

Any questions? For more information, see:

Backup Slides Follow…

Copyright 2005 Chris Colohan 34 TPC-C Transactions on 2 CPUs New OrderDeliveryStock LevelPaymentOrder Status Time (normalized) Idle CPU Failed Cache Miss Busy 3/5 Transactions speed up by 30-44%

LATCHES

Copyright 2005 Chris Colohan 36 Latches Mutual exclusion between transactions Cause violations between epochs Read-test-write cycle  RAW Not needed between epochs TLS already provides mutual exclusion!

Copyright 2005 Chris Colohan 37 Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Release Large critical section

Copyright 2005 Chris Colohan 38 Latches: Lazy Acquire Acquire …work… Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Small critical sections

HARDWARE

Copyright 2005 Chris Colohan 40 TLS in Database Systems Non-Database TLS Time TLS in Database Systems Large epochs: More dependences Must tolerate More state Bigger buffers Large epochs: More dependences Must tolerate More state Bigger buffers Concurrent transactions

Copyright 2005 Chris Colohan 41 Feedback Loop I know this is parallel! for() { do_work(); } par_for() { do_work(); } Must…Make …Faster think feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

Copyright 2005 Chris Colohan 42 Violations == Feedback *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel 0x0FD8  0xFD20 0x0FC0  0xFC18 Must…Make …Faster

Copyright 2005 Chris Colohan 43 Eliminating Violations *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Time 0x0FD8  0xFD20 0x0FC0  0xFC18 Optimization may make slower?

Copyright 2005 Chris Colohan 44 Tolerating Violations: Sub-epochs Time *q= Violation! Sub-epochs =*q *q= =*q Violation! Eliminate *p Dep.

Copyright 2005 Chris Colohan 45 Sub-epochs Started periodically by hardware How many? When to start? Hardware implementation Just like epochs Use more epoch contexts No need to check violations between sub-epochs within an epoch *q= Violation! Sub-epochs =*q