Architectural Support for Scalable Speculative Parallelization in Shared- Memory Multiprocessors Marcelo Cintra, José F. Martínez, Josep Torrellas Department.

Slides:



Advertisements
Similar presentations
Eliminating Squashes Through Learning Cross-Thread Violations in Speculative Parallelization for Multiprocessors Marcelo Cintra and Josep Torrellas University.
Advertisements

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 Lecture 4: Directory Protocols Topics: directory-based cache coherence implementations.
Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.
THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.
CS 258 Parallel Computer Architecture Lecture 15.1 DASH: Directory Architecture for Shared memory Implementation, cost, performance Daniel Lenoski, et.
The Stanford Directory Architecture for Shared Memory (DASH)* Presented by: Michael Bauer ECE 259/CPS 221 Spring Semester 2008 Dr. Lebeck * Based on “The.
Single-Chip Multiprocessor Nirmal Andrews. Case for single chip multiprocessors Advances in the field of integrated chip processing. - Gate density (More.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher B. Colohan, Antonia Zhai, and Todd C. Mowry Carnegie Mellon University.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
EECC722 - Shaaban #1 lec # 10 Fall A New Approach to Speculation in the Stanford Hydra Chip Multiprocessor (CMP) A Chip Multiprocessor.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Chapter 17 Parallel Processing.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Multiscalar processors
1 Lecture 3: Directory-Based Coherence Basic operations, memory-based and cache-based directories.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18,
Rebound: Scalable Checkpointing for Coherent Shared Memory Rishi Agarwal, Pranav Garg, and Josep Torrellas Department of Computer Science University of.
Cooperative Caching for Chip Multiprocessors Jichuan Chang Guri Sohi University of Wisconsin-Madison ISCA-33, June 2006.
Thread-Level Speculation Karan Singh CS
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
On-chip Parallelism Alvin R. Lebeck CPS 221 Week 13, Lecture 2.
1 CPRE 585 Term Review Performance evaluation, ISA design, dynamically scheduled pipeline, and memory hierarchy.
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Hybrid Multi-Core Architecture for Boosting Single-Threaded Performance Presented by: Peyman Nov 2007.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
MULTIPLEX: UNIFYING CONVENTIONAL AND SPECULATIVE THREAD-LEVEL PARALLELISM ON A CHIP MULTIPROCESSOR Presented by: Ashok Venkatesan Chong-Liang Ooi, Seon.
IMPROVING THE PREFETCHING PERFORMANCE THROUGH CODE REGION PROFILING Martí Torrents, Raúl Martínez, and Carlos Molina Computer Architecture Department UPC.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
1 A Scalable Approach to Thread-Level SpeculationSteffan Carnegie Mellon A Scalable Approach to Thread-Level Speculation J. Gregory Steffan, Christopher.
Transactional Memory Coherence and Consistency Lance Hammond, Vicky Wong, Mike Chen, Brian D. Carlstrom, John D. Davis, Ben Hertzberg, Manohar K. Prabhu,
CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Tree-Based Networks Cache Coherence Dr. Xiao Qin Auburn University
The University of Adelaide, School of Computer Science
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Rebound: Scalable Checkpointing for Coherent Shared Memory
Computer Architecture Principles Dr. Mike Frank
Architecture and Design of AlphaServer GS320
Multiscalar Processors
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Jason F. Cantin, Mikko H. Lipasti, and James E. Smith
The University of Adelaide, School of Computer Science
CMSC 611: Advanced Computer Architecture
The University of Adelaide, School of Computer Science
Lecture 8: Directory-Based Cache Coherence
How to improve (decrease) CPI
Improving Multiple-CMP Systems with Token Coherence
Lecture 7: Directory-Based Cache Coherence
Instruction Level Parallelism (ILP)
The University of Adelaide, School of Computer Science
BulkCommit: Scalable and Fast Commit of Atomic Blocks
Lecture 17 Multiprocessors and Thread-Level Parallelism
How to improve (decrease) CPI
Lecture 17 Multiprocessors and Thread-Level Parallelism
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Architectural Support for Scalable Speculative Parallelization in Shared- Memory Multiprocessors Marcelo Cintra, José F. Martínez, Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

Intl. Symp. on Computer Architecture - June Speculative Parallelization  Codes with access patterns not analyzable by compiler  Speculative parallelization can extract some parallelism  Several designs of speculative CMPs  Goal & Contribution: Scalable speculative architecture using speculative CMPs as building blocks –trivially defaults for single-processor nodes Avg. speedup of 5.2 for 16 processors (for dominant, non-analyzable code sections)

Intl. Symp. on Computer Architecture - June Outline  Motivation  Background  Speculative CMP  Scalable Speculation  Evaluation  Related Work  Conclusions

Intl. Symp. on Computer Architecture - June Speculative Parallelization  Assume no dependences and execute threads in parallel  Track data accesses  Detect violations  Squash offending threads and restart them Do I = 1 to N … = A(L(I))+… A(K(I)) = … EndDo Iteration J+2 … = A(5)+… A(6) =... Iteration J+1 … = A(2)+… A(2) =... Iteration J … = A(4)+… A(5) =... RAW

Intl. Symp. on Computer Architecture - June Outline  Motivation  Background  Speculative CMP  Scalable Speculation  Evaluation  Related Work  Conclusions

Intl. Symp. on Computer Architecture - June Speculative CMP MDT Memory Disambiguation Table [Krishnan and Torrellas, ICS 98]

Intl. Symp. on Computer Architecture - June Memory Disambiguation Table (MDT)  Purpose: detect dependences + locate versions  1 entry per memory line touched  Load and Store bits per word per processor Word 0 Load Store TagValid 0x Word 1 Processor 1 has modified word 0 Processor 2 has loaded word 0 P0 P1 P2 P3

Intl. Symp. on Computer Architecture - June  On a load use S bits to find most up-to-date version Handling Data Dependences: Loads Miss in L1 1 1 L Most up-to-date version

Intl. Symp. on Computer Architecture - June  Use L and S bits to detect RAW violations and squash Handling Data Dependences: Stores (I) 1 Miss in L1 Violation: squash S

Intl. Symp. on Computer Architecture - June  Use L and S bits to detect RAW violations and squash Handling Data Dependences: Stores (II) 1 Miss in L1 S 1 No violation: shielding

Intl. Symp. on Computer Architecture - June Summary of Protocol  Per-word speculative information  Multiple versions RAW WAR WAW No Yes No No Squash? In-order Out-of-order same- word false same- word false same- word false

Intl. Symp. on Computer Architecture - June Speculative CMP  Storing Versions –L1 maintains versions of speculative threads –L2 maintains only safe versions –Commit: write back dirty versions from L1 to L2  Static mapping and in-order scheduling of tasks  L1 and MDT overflows cause stalls

Intl. Symp. on Computer Architecture - June Outline  Motivation  Background  Speculative CMP  Scalable Speculation  Evaluation  Related Work  Conclusions

Intl. Symp. on Computer Architecture - June Scalable Speculative Multiprocessor  Goals:  1. Use largely unmodified speculative CMP in a scalable system –trivially defaults for single-processor nodes  2. Provide simple integration into CC-NUMA

Intl. Symp. on Computer Architecture - June Hierarchical Approach: CMP as Nodes DIR Directory NC Network Controller LMDT Local Memory Disambiguation Table GMDT Global Memory Disambiguation Table

Intl. Symp. on Computer Architecture - June Hierarchy NodeSystem Spec L1L2 Safe L2Mem ProcessorNode L1 L2L2 Mem LMDTGMDT ProcessorsNodes ThreadChunk StaticDynamic Granularity of info Mapping of tasks Commits Versions Spec info

Intl. Symp. on Computer Architecture - June Hierarchical Approach: Loads 1. Miss in L1 2. Identify on-chip version 3. Provide version 1. Miss in L1 2. No on-chip version 3. Miss in L2 4. Identify off-chip version 5. Identify on-chip version 6. Provide version 1S: L: 1 S: L: 1S: L:

Intl. Symp. on Computer Architecture - June Hierarchical Approach: Stores 1. Store 3. Store 1 S: L: S: L: 1 S: L:

Intl. Symp. on Computer Architecture - June Mapping of Tasks  Chunk = consecutive set of threads NodeSystem Assignment Threads ProcessorsChunks Nodes Mapping StaticDynamic Reason Minimize spec CMP modifications Minimize load imbalance

Intl. Symp. on Computer Architecture - June Mapping of Tasks Task: n0 Most-Spec Least-Spec n1 n2 Node 1 finishes iterations Writeback L1 D data to L2 Update queue GMDT

Intl. Symp. on Computer Architecture - June Mapping of Tasks Task: n0 Most-Spec Least-Spec n1 n2 n GMDT

Intl. Symp. on Computer Architecture - June Node Commits Task: n0 Most-Spec Least-Spec n1 n2 n Node 0 finishes iterations Writeback L1 D data to L2 Writeback L2 D data to memory Update queue GMDT

Intl. Symp. on Computer Architecture - June Node Commits Task: n0 Most-Spec Least-Spec n1 n2 n Node 1 has already finished chunk 1 Writeback L2 D data to memory Update queue GMDT

Intl. Symp. on Computer Architecture - June Node Commits Task: n0 Most-Spec Least-Spec n2 n GMDT

Intl. Symp. on Computer Architecture - June GMDT Features + Allows displacement of clean speculative data from caches  30% faster because of no stall + Identify versions to read with a single lookup + Selective invalidations with shielding  50% fewer network messages + Squash all faulting threads in parallel & restart in parallel  5% faster in a 4 node system - Commit and squash require sync of GMDT (not processors)

Intl. Symp. on Computer Architecture - June Outline  Motivation  Background  Speculative CMP  Scalable Speculation  Evaluation  Related Work  Conclusions

Intl. Symp. on Computer Architecture - June Simulation Environment  Execution-driven simulation  Detailed superscalar processor model  Coherent+speculative memory back-end  Scalable multiprocessor: 4 nodes  Node: spec CMP + 1M L2 + 2K-entry GMDT  CMP: 4 x (processor + 32K L1) entry LMDT  Processor: 4-issue, dynamic

Intl. Symp. on Computer Architecture - June Applications  Applications dominated by non-analyzable loops (subscripted subscripts)  Track, BDNA (PERFECT)  APSI (SPECfp95)  DSMC3D, Euler (HPF2)  Tree (Univ. of Hawaii)  Non-analyzable loops and accesses identified by the Polaris parallelizing compiler  Results shown for the non-analyzable loops only Non-analyzable loops take on avg. 51% of sequential time

Intl. Symp. on Computer Architecture - June Overall Performance Avg. speedup=4.4 Memory time is most significant performance bottleneck Squash time is only noticeable for DSMC3D L1 overflow is only noticeable for BDNA and APSI

Intl. Symp. on Computer Architecture - June Loop Unrolling Base: 1 iteration per processor Unrolling: 2 and 4 iterations per processor  Increased performance: avg. speedup of 5.2  Reduction in memory time  Increase in L1 overflows

Intl. Symp. on Computer Architecture - June Granularity of Speculative State Line = 16 words (64 Bytes) Increased number of squashes

Intl. Symp. on Computer Architecture - June Related Work (I)  CMP schemes:  Multiscalar (Wisconsin), TLDS (CMU), Hydra (Stanford), MDT (Illinois), Superthreaded (Minnesota), Speculative Multithreading (UPC) –designed for tightly-coupled systems: not scalable

Intl. Symp. on Computer Architecture - June Related Work (II)  Scalable schemes: TLDS (CMU), Zhang et al. (Illinois): –Speculative state dispersed along with data –Flat view of processors –Zhang et al: More sophisticated (handles reduction, load imbalance, large working sets) but complex

Intl. Symp. on Computer Architecture - June Conclusions  Extended speculative parallelization to scalable system  Integrated largely unmodified speculative CMP –trivially defaults for single-processor nodes  Promising results: speedup of 5.2 for 16 processors  Need to support per-word speculative state to avoid excessive squashes

Architectural Support for Scalable Speculative Parallelization in Shared- Memory Multiprocessors Marcelo Cintra, José F. Martínez, Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign

Intl. Symp. on Computer Architecture - June Same Word False Same Word False Same Word False Number 0.14, ,880 Distance Number 00 95, ,312 95, ,312 Distance Number 147,390 9,350, , ,315 85,343 8,939,798 Distance 2, , ,047 2, Number 0 104, Distance Number 0032,422 48, ,500 1,492,510 Distance BDNA Track APSI DSMC3D Euler RAWWARWAW Application Parameter (Average) Cross Iteration Dependences False: dependence between different words of the same a cache line

Intl. Symp. on Computer Architecture - June GMDT Features + Allows displacement of clean speculative data from caches  30% faster because of no stall + Identify versions to read with a single lookup + Selective invalidations with shielding  50% fewer network messages + Squash all faulting threads in parallel & restart in parallel  5% faster in a 4 node system - Commit and squash require sync of GMDT (not processors)

Intl. Symp. on Computer Architecture - June Application Behavior Access Pattern Multiple Versions Per-Word State Appl. 1. Random, clustered Yes Track, DSMC3D 2. Random, sparse DSMC3D, Euler 3. Often write followed by read Yes APSI, BDNA, Tree No, but may suffer more squashes i0 i1 i2 i0 i1 i2 i0 i1 i2 W,R

Intl. Symp. on Computer Architecture - June Minimizing Squashes False: dependence between different words of the same a cache line multiple versionsper-word state YY ooosame-wordRAW WAR WAW same-wordRAW falseWAW Support for same-wordooo NY Squash? YN ooo

Intl. Symp. on Computer Architecture - June Interaction GMDT-Directory  In general: GMDT operates on speculative data Dir operates on coherent data  However: Dir sharing vector is kept up-to-date for speculative data for two reasons: –smoothen transition in and out of speculative sections –further filter our invalidation messages on speculative stores

Intl. Symp. on Computer Architecture - June Multiprogramming  Replicate window for each job  Add job id to GMDT entries

Intl. Symp. on Computer Architecture - June Simulation Environment Processor Param.ValueMemory Param.Value Issue width 4 L1,L2,VC size 32KB,1MB,64KB Instruction window size 64 L1,L2,VC assoc. 2-way,4-way,8-way No. functional units(Int,FP,Ld/St) 3,2,2 L1,L2,VC,line size 64B,64B,64B No. renaming registers(Int,FP) 32,32 L1,L2,VC,latency 1,12,12 cycles No. pending memory ops.(Ld,St) 8,16 L1,L2,VC banks 2,3,2 Local memory latency 75 cycles 2-hop memory latency 290 cycles 3-hop memory latency 360 cycles LMDT,GMDT size 512,2K entries LMDT,GMDT assoc. 8-way,8-way LMDT,GMDT lookup 4,20 cycles L1-to-LMDT latency 3 cycles LMDT-to-L2 latency 8 cycles Max. active window 8 chunks

Intl. Symp. on Computer Architecture - June Application Characteristics  Performance data reported refer to the loops only Application Loops to Parallelize % of Sequential Time Speculative Data (KB) Tracknfilt_ APSIrun_[20,30,40,60,100] 2140 DSMC3Dmove3_ Euler dflux_[100,200] eflux_[100,200,300] psmoo_ BDNAactfor_ Treeaccel_ average: 51

Intl. Symp. on Computer Architecture - June Loop Unrolling Avg. speedup=4.7 (Blk 2) 4.6 (Blk 4) Memory time is reducedPotential increase in L1 overflows