Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Chris Colohan 1,2, Anastassia Ailamaki 2, J. Gregory Steffan 3 and Todd C. Mowry.

Slides:

Advertisements

Similar presentations

1/1/ / faculty of Electrical Engineering eindhoven university of technology Speeding it up Part 3: Out-Of-Order and SuperScalar execution dr.ir. A.C. Verschueren.

Advertisements

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Anshul Kumar, CSE IITD CSL718 : VLIW - Software Driven ILP Hardware Support for Exposing ILP at Compile Time 3rd Apr, 2006.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

DBMSs on a Modern Processor: Where Does Time Go? Anastassia Ailamaki Joint work with David DeWitt, Mark Hill, and David Wood at the University of Wisconsin-Madison.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,

Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.

More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.

1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic Parallelization J.G. Steffan and T.C. Mowry Proceedings.

1 Lecture 23: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 8: Transactional Memory – TCC Topics: “lazy” implementation (TCC)

Language Support for Lightweight transactions Tim Harris & Keir Fraser Presented by Narayanan Sundaram 04/28/2008.

Multiscalar processors

EENG449b/Savvides Lec /13/04 April 13, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

MemTracker Efficient and Programmable Support for Memory Access Monitoring and Debugging Guru Venkataramani, Brandyn Roemer, Yan Solihin, Milos Prvulovic.

Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.

Optimistic Intra-Transaction Parallelism on Chip-Multiprocessors Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry 1,3 1 Carnegie.

Optimistic Intra-Transaction Parallelism using Thread Level Speculation Chris Colohan 1, Anastassia Ailamaki 1, J. Gregory Steffan 2 and Todd C. Mowry.

1 Improving Value Communication…Steffan Carnegie Mellon Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia.

Lecture 12 Recoverability and failure. 2 Optimistic Techniques Based on assumption that conflict is rare and more efficient to let transactions proceed.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

ReSlice: Selective Re-execution of Long-retired Misspeculated Instructions Using Forward Slicing Smruti R. Sarangi, Wei Liu, Josep Torrellas, Yuanyuan.

Authors: Stavros HP Daniel J. Yale Samuel MIT Michael MIT Supervisor: Dr Benjamin Kao Presenter: For Sigmod.

1 Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5)

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

Carnegie Mellon Compiler Optimization of Memory-Resident Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill & David A. Wood Presented by: Eduardo Cuervo.

1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.

Carnegie Mellon Compiler Optimization of Scalar Value Communication Between Speculative Threads Antonia Zhai, Christopher B. Colohan, J. Gregory Steffan.

Transaction Management Transparencies. ©Pearson Education 2009 Chapter 14 - Objectives Function and importance of transactions. Properties of transactions.

The Standford Hydra CMP  Lance Hammond  Benedict A. Hubbert  Michael Siu  Manohar K. Prabhu  Michael Chen  Kunle Olukotun Presented by Jason Davis.

1 A Scalable Approach to Thread-Level SpeculationSteffan Carnegie Mellon A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,

CS3771 Today: Distributed Coordination  Previous class: Distributed File Systems Issues: Naming Strategies: Absolute Names, Mount Points (logical connection.

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

Free Transactions with Rio Vista

Lecture: Large Caches, Virtual Memory

Speculative Lock Elision

Minh, Trautmann, Chung, McDonald, Bronson, Casper, Kozyrakis, Olukotun

Transactional Memory : Hardware Proposals Overview

Lecture: Large Caches, Virtual Memory

Multiscalar Processors

CSC 4250 Computer Architectures

5.2 Eleven Advanced Optimizations of Cache Performance

Antonia Zhai, Christopher B. Colohan,

Two Ideas of This Paper Using Permissions-only Cache to deduce the rate at which less-efficient overflow handling mechanisms are invoked. When the overflow.

Predictive Performance

Lecture 6: Transactions

Free Transactions with Rio Vista

Yiannis Nikolakopoulos

Lecture 22: Consistency Models, TM

Presentation transcript:

Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Chris Colohan 1,2, Anastassia Ailamaki 2, J. Gregory Steffan 3 and Todd C. Mowry 2,4 1 Google, Inc. 2 Carnegie Mellon University 3 University of Toronto 4 Intel Research Pittsburgh

Copyright 2006 Chris Colohan 2 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time Parallel *p= *q= =*p =*q =*p =*q Thread 1Thread 2

Copyright 2006 Chris Colohan 3 Thread Level Speculation (TLS) *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel Use threads Detect violations Restart to recover Buffer state Worst case: Sequential Best case: Fully parallel Data dependences limit performance. Thread 1Thread 2

Copyright 2006 Chris Colohan 4 Violations as a Feedback Signal *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel 0x0FD8  0xFD20 0x0FC0  0xFC18 Must…Make …Faster

Copyright 2006 Chris Colohan 5 Violations as a Feedback Signal *p= *q= =*p =*q Sequential Time *p= *q= =*p R2 Violation! =*p =*q Parallel

Copyright 2006 Chris Colohan 6 Eliminating Violations *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Time 0x0FD8  0xFD20 0x0FC0  0xFC18 Optimization may make slower? All-or-nothing execution makes optimization harder

Copyright 2006 Chris Colohan 7 Tolerating Violations: Sub-threads Time *q= Violation! Sub-threads =*q *q= =*q Violation! Eliminate *p Dep.

Copyright 2006 Chris Colohan 8 Sub-threads Periodic checkpoints of a speculative thread Makes TLS work well with: Large speculative threads Unpredictable frequent dependences *q= Violation! Sub-threads =*q Speed up database transaction response time by a factor of 1.9 to 2.9. Speed up database transaction response time by a factor of 1.9 to 2.9.

Copyright 2006 Chris Colohan 9 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results

Copyright 2006 Chris Colohan 10 Case Study: New Order (TPC-C) Only dependence is the quantity field Very unlikely to occur (1/100,000) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach(item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } 78% of transaction execution time

Copyright 2006 Chris Colohan 11 Case Study: New Order (TPC-C) GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; } GET cust_info FROM customer; UPDATE district WITH order_id; INSERT order_id INTO new_order; TLS_foreach (item) { GET quantity FROM stock WHERE i_id=item; UPDATE stock WITH quantity-1 WHERE i_id=item; INSERT item INTO order_line; }

Copyright 2006 Chris Colohan 12 Optimizing the DBMS: New Order Sequential No Optimizations Latches Locks Malloc/Free Buffer Pool Cursor Queue Error Checks False Sharing B-Tree Logging Time (normalized) Idle CPU Failed Cache Miss Busy This process took me 30 days and <1200 lines of code. This process took me 30 days and <1200 lines of code. Results from Colohan, Ailamaki, Steffan and Mowry VLDB2005

Copyright 2006 Chris Colohan 13 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results

Copyright 2006 Chris Colohan 14 Threads from Transactions Thread Size (Dyn. Instrs.) Dependent loads New Order 62k75 New Order k75 Delivery 33k20 Delivery Outer490k34 Stock Level 17k29 Challenge: buffering large threads

Copyright 2006 Chris Colohan 15 Buffering Large Threads Prior work: Cintra et. al. [ISCA’00] Oldest thread in each chip can store state in L2 Prvulovic et. al. [ISCA’01] Speculative state can overflow into RAM What we need: Fast Deals well with many forward dependences Easy to add sub-epoch support Buffer speculative state in shared L2

Copyright 2006 Chris Colohan 16 L1 cache changes Add Speculatively Modified bit per line Line modified by current thread or an older thread On violation invalidate all SM lines dataS

Copyright 2006 Chris Colohan 17 L2 cache changes Add Speculatively Modified and Speculatively Loaded bit per line, one pair per speculative thread If two threads modify a line, replicate Within the associative set Add a small speculative victim cache Catch over-replicated lines dataSMSLSMSL T1T2

Copyright 2006 Chris Colohan 18 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results

Copyright 2006 Chris Colohan 19 data Sub-thread support Add one thread contexts per sub-thread No dependence tracking between sub-threads SMSL SMSL T1a SMSL T1T2T1b dataSMSL T2bT2a

Copyright 2006 Chris Colohan 20 When to start new sub-threads? Right before “high-risk” loads? Results show that starting periodically works well *q= =*q Violation! =*q

Copyright 2006 Chris Colohan 21 Secondary Violations Sub-thread start table: How far to rewind on a secondary violation? *q= Violation! =*q

Copyright 2006 Chris Colohan 22 Secondary Violations Sub-thread start table: How far to rewind on a secondary violation? *q= Violation! =*q

Copyright 2006 Chris Colohan 23 Overview TLS and database transactions Buffering large speculative threads Hardware support for sub-threads Results

Copyright 2006 Chris Colohan 24 Experimental Setup Detailed simulation Superscalar, out-of-order, 128 entry reorder buffer Memory hierarchy modeled in detail TPC-C transactions on BerkeleyDB In-core database Single user Single warehouse Measure interval of 100 transactions Measuring latency not throughput CPU 32KB 4-way L1 $ Rest of memory system CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ CPU 32KB 4-way L1 $ 2MB 4-way L2 $

Copyright 2006 Chris Colohan 25 Time (normalized) TPC-C on 4 CPUs New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status Without sub-thread support With sub-threads Ignore violations (Amdahl’s Law limit) Idle CPU Failed Cache Miss Busy NSLNSLNSLNSLNSLNSLNSL N = no sub-threadsS = with sub-threadsL = limit, ignoring violations

Copyright 2006 Chris Colohan 26 TPC-C on 4 CPUs Idle CPU Failed Cache Miss Busy Time (normalized) New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status NSLNSLNSLNSLNSLNSLNSL N = no sub-threadsS = with sub-threadsL = limit, ignoring violations Sub-threads improve performance by limiting the impact of failed speculation Sub-threads improve performance by limiting the impact of failed speculation

Copyright 2006 Chris Colohan 27 TPC-C on 4 CPUs Idle CPU Failed Cache Miss Busy Time (normalized) New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status NSLNSLNSLNSLNSLNSLNSL N = no sub-threadsS = with sub-threadsL = limit, ignoring violations Sub-threads have minimal impact on cache misses Sub-threads have minimal impact on cache misses

Copyright 2006 Chris Colohan 28 Victim Cache Usage 4-way8-way16-way New Order54 40 New Order Delivery14 00 Delivery Outer62 40 Stock Level40 00 L2 cache associativity Small victim cache is sufficient

Copyright 2006 Chris Colohan 29 Sub-thread size Time (normalized) New Order Idle CPU Failed Cache Miss Busy 2 Sub-threads4 Sub-threads8 Sub-threads

Copyright 2006 Chris Colohan 30 Sub-thread size Time (normalized) New Order Idle CPU Failed Cache Miss Busy 2 Sub-threads4 Sub-threads8 Sub-threads Periodically starting sub-threads works surprisingly well Periodically starting sub-threads works surprisingly well

Copyright 2006 Chris Colohan 31 Related Work Checkpointing Use cache to simulate larger reorder buffer [Martínez02] Tolerating dependences Selective re-execution [Sarangi05] Predicting and synchronizing dependences [many papers] Using speculation for manual parallelization As applied to SPEC [Prabhu03] TCC [Hammond04] TLS and Transactional Memory: Multiscalar, IACOMA, Hydra, RAW

Copyright 2006 Chris Colohan 32 Conclusion Sub-threads let TLS tolerate unpredictable dependences Makes incremental feedback-directed parallelization possible Makes TLS with large threads practical Can now parallelize database transactions Hardware: simple extensions to previous TLS schemes Speeds up 3 of 5 TPC-C transactions: By a factor of 1.9 to 2.9

Any questions?

BACKUP SLIDES FOLLOW

Copyright 2006 Chris Colohan 35 Why Parallelize Transactions? Do not use if you have no idle CPUs Database people only care about throughput! Some transactions are latency sensitive e.g., financial transactions Lock bound workloads Free up locks faster == more throughput!

Copyright 2006 Chris Colohan 36 Buffering Large Threads store X, 0x00 L1$ 0x00: 0x01: L2$ X 0x00: 0x01: L1$ 0x00: 0x01: XS1 Store and load bit per thread

Copyright 2006 Chris Colohan 37 Buffering Large Threads store X, 0x00 store A, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: L1$ 0x00: 0x01: X A S1 0x01:

Copyright 2006 Chris Colohan 38 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: X X A S1 L2

Copyright 2006 Chris Colohan 39 XL2 XS1 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: XY AS1 YS2L2 Replicate line – one version per thread

Copyright 2006 Chris Colohan 40 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: Y A S1 S2L2 S1L2

Copyright 2006 Chris Colohan 41 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: Y A S1 YS2L2 S1L2 B B

Copyright 2006 Chris Colohan 42 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: S1 L2 B B Y YS2L2 a { b { Divide into two sub-threads Only roll back violated sub-thread

Copyright 2006 Chris Colohan 43 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: A S1a A A S2aL2a L2b Y a { b { Store and load bit per sub-thread store B, 0x01 B

Copyright 2006 Chris Colohan 44 A AA L2b S1a Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X Y L1$ 0x00: 0x01: Y S1a A S2aL2a B store B, 0x01 S1b AB a { b {

Copyright 2006 Chris Colohan 45 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 1 put_page(5) ref: 0

Copyright 2006 Chris Colohan 46 get_page(5) put_page(5) Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) TLS ensures first thread gets page first. Who cares? TLS ensures first thread gets page first. Who cares? get_page(5) put_page(5)

Copyright 2006 Chris Colohan 47 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) = Escape Speculation Escape speculation Invoke operation Store undo function Resume speculation Escape speculation Invoke operation Store undo function Resume speculation get_page(5) put_page(5) get_page(5)

Copyright 2006 Chris Colohan 48 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Not undoable! get_page(5) put_page(5) = Escape Speculation

Copyright 2006 Chris Colohan 49 Buffer Pool Management CPU Buffer Pool get_page(5) ref: 0 put_page(5) Time get_page(5) put_page(5) get_page(5) put_page(5) Delay put_page until end of thread Avoid dependence = Escape Speculation

Copyright 2006 Chris Colohan 50 TLS in Database Systems Non-Database TLS Time TLS in Database Systems Concurrent transactions Large threads: More dependences Must tolerate More state Bigger buffers Large threads: More dependences Must tolerate More state Bigger buffers Large threads: More dependences Must tolerate More state Bigger buffers Large threads: More dependences Must tolerate More state Bigger buffers

Copyright 2006 Chris Colohan 51 Feedback Loop I know this is parallel! for() { do_work(); } par_for() { do_work(); } Must…Make …Faster think feed back feed back feed back feed back feed back feed back feed back feed back feed back feed back feed

LATCHES

Copyright 2006 Chris Colohan 53 Latches Mutual exclusion between transactions Cause violations between threads Read-test-write cycle  RAW Not needed between threads TLS already provides mutual exclusion!

Copyright 2006 Chris Colohan 54 Latches: Aggressive Acquire Acquire latch_cnt++ …work… latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Homefree latch_cnt++ …work… (enqueue release) Commit work latch_cnt-- Release Large critical section

Copyright 2006 Chris Colohan 55 Latches: Lazy Acquire Acquire …work… Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Homefree (enqueue acquire) …work… (enqueue release) Acquire Commit work Release Small critical sections

Copyright 2006 Chris Colohan 56 Applying TLS 1. Parallelize loop 2. Run benchmark 3. Remove bottleneck 4. Go to 2 Time