Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Slides:

Advertisements

Similar presentations

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,

Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Lecture 6: Multicore Systems

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.

System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

CPU Cache Prefetching Timing Evaluations of Hardware Implementation Ravikiran Channagire & Ramandeep Buttar ECE7995 : Presentation.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Two-issue Super Scalar CPU. CPU structure, what did we have to deal with: -double clock generation -double-port instruction cache -double-port instruction.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster computers –shared memory model ( access nsec) –message passing multiprocessor.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,

Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.

CISC 879 : Advanced Parallel Programming Rahul Deore Dept. of Computer & Information Sciences University of Delaware Exploring Memory Consistency for Massively-Threaded.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

My Coordinates Office EM G.27 contact time:

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath, Dean Tullsen 1 Micro 2015.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Shared Memory and Memory Consistency Models Daniel J. Sorin Duke University UPMARC 2016.

Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin

CS427 Multicore Architecture and Parallel Computing

High-performance transactions for persistent memories

Memory Consistency Models

Memory Consistency Models

Lecture 26: Multiprocessors

Hardware Multithreading

CPE 631 Lecture 05: Cache Design

Multiprocessor Highlights

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Lecture 27: Multiprocessors

Chapter 4 Multiprocessors

Hardware Multithreading

Why we have Counterintuitive Memory Models

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

6- General Purpose GPU Programming

CSE 502: Computer Architecture

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering.

CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming

Presentation transcript:

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0

Executive Summary Massively Threaded Throughput-Oriented Processors (MTTOPs) like GPUs are being integrated on chips with CPUs and being used for general purpose programming Conventional wisdom favors weak consistency on MTTOPs We implement a range of memory consistency models (SC, TSO and RMO) on MTTOPs We show that strong consistency is viable for MTTOPs 1

What is an MTTOP? Massively Threaded Throughput-Oriented – 4-16 core clusters – 8-64 threads wide SIMD – deep SMT  Thousands of concurrent threads Massively Threaded Throughput-Oriented – Sacrifice latency for throughput Heavily banked caches and memories Many cores, each of which is simple 2

Example MTTOP 3 E EEE EEE Decode Fetch L1 Core Cluster L2 Bank L2 Bank Core Cluster L2 Bank L2 Bank Core Cluster L2 Bank L2 Bank Core Cluster L2 Bank L2 Bank Core Cluster Memory Controller Cache Coherent Shared Memory

What is Memory Consistency? 4 Thread 0 ST B = 1 LD r1, A Thread 1 ST A = 1 LD r2, B Sequential Consistency : {r1,r2} = 0,1; 1,0; 1,1 0,0 Weak Consistency : {r1,r2} = 0,1; 1,0; 1,1; 0,0  enables store buffering MTTOP hardware concurrency seems likely to be constrained by Sequential Consistency (SC) In this work, we explore hardware consistency models Initially A = B = 0

(CPU) Memory Consistency Debate Conclusion for CPUs: trading off ~10-40% performance for programmability – “Is SC + ILP = RC?” (Gniady ISCA99) 5 Strong Consistency Weak Consistency PerformanceSlowerFaster ProgrammabilityEasierHarder But does this conclusion apply to MTTOPs?

Memory Consistency on MTTOPs GPUs have undocumented hardware consistency models Intel MIC uses x86-TSO for the full chip with directory cache coherence protocol MTTOP programming languages provide weak ordering guarantees – OpenCL does not guarantee store visibility without a barrier or kernel completion – CUDA includes a memory fence that can enable global store visibility 6

MTTOP Conventional Wisdom Highly parallel systems benefit from less ordering – Graphics doesn’t need ordering Strong Consistency seems likely to limit MLP Strong Consistency likely to suffer extra latencies Weak ordering helps CPUs, does it help MTTOPs? It depends on how MTTOPs differ from CPUs … 7

Diff 1: Ratio of Loads to Stores 8 MTTOPs perform more loads per store  store latency optimizations will not be as critical to MTTOP performance CPUs MTTOPs Prior work shows CPUs perform 2-4 loads per store Weak Consistency reduces impact of store latency on performance

Diff 2: Outstanding L1 cache misses 9 Weak consistency enables more outstanding L1 misses per thread MTTOPs have more outstanding L1 cache misses  thread reordering enabled by weak consistency is less important to handle memory latency

Diff 3: Memory System Latencies 10 E E EE Decode Fetch L1 LSQ 1-2 cycles 5-20 cycles cycles ROBROB Issue/Sel L2 Mem CPU core E EEE EEE Decode Fetch L cycles cycles cycles L2 Mem MTTOP core cluster MTTOPs have longer memory latencies  small latency savings will not significantly improve performance Weak consistency enables reductions of store latencies

Diff 4: Frequency of Synchronization 11 MTTOPs have more threads to compute a problem  each thread will have fewer independent memory ops between syncs. CPUs MTTOPs split problem into regions assign regions to threads do: work on local region synchronize Weak consistency only re-orders memory ops between sync CPU local region: ~private cache size MTTOP local region: ~private cache size/threads per cache

Diff 5: RAW Dependences Through Memory 12 MTTOP algorithms have fewer RAW memory dependencies  there is little benefit to being able to read from a write buffer CPUs MTTOPs Blocking for cache performance Frequent function calls Few architected registers  Many RAW dependencies through memory Coalescing for cache performance Inlined function calls Many architected registers  Few RAW dependencies through memory Weak consistency enables store to load forwarding

MTTOP Differences & Their Impact Other differences are mentioned in the paper How much do these differences affect performance of memory consistency implementations on MTTOPs? 13

Memory Consistency Implementations 14 E EEE EEE Decode Fetch L1 SC simple E EEE EEE Decode Fetch L1 SC wb E EEE EEE Decode Fetch L1 TSO E EEE EEE Decode Fetch L1 RMO No write buffer Per-lane FIFO write buffer drained on LOADS Per-lane FIFO write buffer drained on FENCES Per-lane CAM for outstanding write addresses FIFO WB CAMCAM Strongest Weakest

Methodology Modified gem5 to support SIMT cores running a modified version of the Alpha ISA Looked at typical MTTOP workloads – Had to port workloads to run in system model Ported Rodinia benchmarks – bfs, hotspot, kmeans, and nn Handwritten benchmarks – dijkstra, 2dconv, and matrix_mul 15

Target MTTOP System 16 ParameterValue core clusters16 core clusters; 8 wide SIMD corein-order, Alpha-like ISA, 64 deep SMT interconnection network2D torus coherence protocolWriteback MOESI protocol L1I cache (shared by cluster)perfect, 1-cycle hit L1D cache (shared by cluster)16KB, 4-way, 20-cycle hit, no local memory L2 cache (shared by all clusters)256KB, 8 banks, 8-way, 50-cycle hit consistency model-specific features (give benefit to weaker models) write buffer (SC wb and TSO)perfect, instant access CAM for store address matchingperfect, instant access

Results 17

Results 18

Conclusions Strong Consistency should not be ruled out for MTTOPs on the basis of performance Improving store performance with write buffers appears unnecessary Graphics-like workloads may get significant MLP from load reordering (dijkstra, 2dconv) 19 Conventional wisdom may be wrong about MTTOPs

Caveats and Limitations Results do not necessarily apply to all possible MTTOPs or MTTOP software Evaluation with writeback caches when current MTTOPs use write-through caches Potential of future workloads to be more CPU- like 20

Thank you! Questions? 21