Download presentation
Presentation is loading. Please wait.
1
1 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The University of Texas at Austin †Carnegie Mellon University ‡IBM Research
2
2 Background To leverage CMPs: –Programs must be split into threads Mutual Exclusion: –Threads are not allowed to update shared data concurrently Accesses to shared data are encapsulated inside critical sections Only one thread can execute a critical section at a given time
3
Example of Critical Section from MySQL 3 × × List of Open Tables × × × Thread 0 Thread 1 Thread 2 Thread 3 A × BCD × E Thread 3: OpenTables(D, E) Thread 2: CloseAllTables()
4
Example Critical Section from MySQL 4 ABCD 0 2 2 1 0 3 E 3
5
5 End of Transaction: foreach (table opened by thread) if (table.temporary) table.close() LOCK_open Acquire() LOCK_open Release()
6
6 Contention for Critical Sections t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 t1t1 t2t2 t3t3 t4t4 t5t5 t6t6 t7t7 Critical Sections execute 2x faster Thread 1 Thread 2 Thread 3 Thread 4 Thread 1 Thread 2 Thread 3 Thread 4 Critical Section Parallel Idle Accelerating critical sections not only helps the thread executing the critical sections, but also the waiting threads
7
7 Impact of Critical Sections on Scalability Contention for critical sections increases with the number of threads and limits scalability MySQL (oltp-1) Chip Area (cores) Speedup
8
8 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary
9
9 The Asymmetric Chip Multiprocessor (ACMP) Provide one large core and many small cores Execute parallel part on small cores for high throughput Accelerate serial part using the large core Niagara -like core Large core ACMP Approach
10
10 Conventional ACMP EnterCS() PriorityQ.insert(…) LeaveCS() On-chip Interconnect 1.P2 encounters a Critical Section 2.Sends a request for the lock 3.Acquires the lock 4.Executes Critical Section 5.Releases the lock Core executing critical section P1 P2P3P4
11
11 Accelerating Critical Sections (ACS) Accelerate Amdahl’s serial part and critical sections using the large core Niagara -like core Large core ACMP Approach Critical Section Request Buffer (CSRB)
12
12 Accelerated Critical Sections (ACS) EnterCS() PriorityQ.insert(…) LeaveCS() Onchip- Interconnect Critical Section Request Buffer (CSRB) 1. P2 encounters a Critical Section 2. P2 sends CSCALL Request to CSRB 3. P1 executes Critical Section 4. P1 sends CSDONE signal Core executing critical section P4P3P2 P1
13
13 Architecture Overview ISA extensions –CSCALL LOCK_ADDR, TARGET_PC –CSRET LOCK_ADDR Compiler/Library inserts CSCALL/CSRET On a CSCALL, the small core: –Sends a CSCALL request to the large core Arguments: Lock address, Target PC, Stack Pointer, Core ID –Stalls and waits for CSDONE Large Core –Critical Section Request Buffer (CSRB) –Executes the critical section and sends CSDONE to the requesting core
14
14 False Serialization ACS can serialize independent critical sections Selective Acceleration of Critical Sections (SEL) –Saturating counters to track false serialization CSCALL (A) CSCALL (B) Critical Section Request Buffer (CSRB) 4 4 A B 32 5 To large core From small cores
15
15 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary
16
16 Performance Tradeoffs Fewer threads vs. accelerated critical sections –Accelerating critical sections offsets loss in throughput –As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality –ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data
17
17 Cache misses for private data Private Data: NewSubProblems Shared Data: The priority heap PriorityHeap.insert(NewSubProblems) Puzzle Benchmark
18
18 Performance Tradeoffs Fewer threads vs. accelerated critical sections –Accelerating critical sections offsets loss in throughput –As the number of cores (threads) on chip increase: Fractional loss in parallel performance decreases Increased contention for critical sections makes acceleration more beneficial Overhead of CSCALL/CSDONE vs. better lock locality –ACS avoids “ping-ponging” of locks among caches by keeping them at the large core More cache misses for private data vs. fewer misses for shared data –Cache misses reduce if shared data > private data
19
19 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary
20
20 Experimental Methodology Niagara -like core SCMP All small cores Conventional locking Niagara -like core Large core ACMP One large core (area-equal 4 small cores) Conventional locking Niagara -like core Large core ACS ACMP with a CSRB Accelerates Critical Sections
21
21 Experimental Methodology Workloads –12 critical section intensive applications from various domains –7 use coarse-grain locks and 5 use fine-grain locks Simulation parameters: –x86 cycle accurate processor simulator –Large core: Similar to Pentium-M with 2-way SMT. 2GHz, out-of-order, 128-entry ROB, 4-wide issue, 12-stage –Small core: Similar to Pentium 1, 2GHz, in-order, 2-wide issue, 5- stage –Private 32 KB L1, private 256KB L2, 8MB shared L3 –On-chip interconnect: Bi-directional ring
22
22 Workloads with Coarse-Grain Locks Chip Area = 16 cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores Equal-area comparison Number of threads = Best threads Chip Area = 32 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores 210150210150
23
23 Workloads with Fine-Grain Locks Equal-area comparison Number of threads = Best threads Chip Area = 16 cores SCMP = 16 small cores ACMP/ACS = 1 large and 12 small cores Chip Area = 32 small cores SCMP = 32 small cores ACMP/ACS = 1 large and 28 small cores
24
Equal-Area Comparisons 24 Speedup over a small core Chip Area (small cores) (a) ep(b) is(c) pagemine(d) puzzle(e) qsort(f) tsp (i) oltp-1(i) oltp-2(h) iplookup(k) specjbb (l) webcache (g) sqlite Number of threads = No. of cores ------ SCMP ------ ACMP ------ ACS
25
25 ACS on Symmetric CMP Majority of benefit is from large core
26
26 Outline Background Mechanism Performance Trade-Offs Evaluation Related Work and Summary
27
27 Related Work Improving locality of shared data by thread migration and software prefetching (Sridharan+, Trancoso+, Ranganathan+) ACS not only improves locality but also uses a large core to accelerate critical section execution Asymmetric CMPs (Morad+, Kumar+, Suleman+, Hill+) ACS not only accelerates the Amdahl’s bottleneck but also critical sections Remote procedure calls (Birrell+) ACS is for critical sections among shared memory cores
28
28 Hiding Latency of Critical Sections Transactional memory (Herlihy+) ACS does not require code modification Transactional Lock Removal (Rajwar+) and Speculative Synchronization (Martinez+) –Hide critical section latency by increasing concurrency ACS reduces latency of each critical section –Overlaps execution of critical sections with no data conflicts ACS accelerates ALL critical sections –Does not improve locality of shared data ACS improves locality of shared data ACS outperforms TLR (Rajwar+) by 18% (details in paper)
29
29 Conclusion Critical sections reduce performance and limit scalability Accelerate critical sections by executing them on a powerful core ACS reduces average execution time by: –34% compared to an equal-area SCMP –23% compared to an equal-area ACMP ACS improves scalability of 7 of the 12 workloads
30
30 Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures M. Aater Suleman* Onur Mutlu† Moinuddin K. Qureshi‡ Yale N. Patt* *The University of Texas at Austin †Carnegie Mellon University ‡IBM Research
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.