Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University

Slides:

Advertisements

Similar presentations

SE-292 High Performance Computing

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Lecture 12 Reduce Miss Penalty and Hit Time

PUMA 2 : Bridging the CPU/Memory Gap through Prediction & Speculation Babak Falsafi Team Members: Chi Chen, Chris Gniady, Jangwoo Kim, Tom Wenisch, Se-Hyun.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

Scalable Load and Store Processing in Latency Tolerant Processors Amit Gandhi 1,2 Haitham Akkary 1 Ravi Rajwar 1 Srikanth T. Srinivasan 1 Konrad Lai 1.

Teaching Old Caches New Tricks: Predictor Virtualization Andreas Moshovos Univ. of Toronto Ioana Burcea’s Thesis work Some parts joint with Stephen Somogyi.

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong.

1 Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Lecture 13: Consistency Models

1 Lecture 25: Interconnection Networks Topics: flow control, router microarchitecture Final exam:  Dec 4 th 9am – 10:40am  ~15-20% on pre-midterm  post-midterm:

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

Predictor-Directed Stream Buffers Timothy Sherwood Suleyman Sair Brad Calder.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.

Flexible Reference-Counting-Based Hardware Acceleration for Garbage Collection José A. Joao * Onur Mutlu ‡ Yale N. Patt * * HPS Research Group University.

SMTp: An Architecture for Next-generation Scalable Multi-threading Mainak Chaudhuri Computer Systems Laboratory Cornell University Mark Heinrich School.

A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,

Lecture 10 Memory Hierarchy and Cache Design Computer Architecture COE 501.

Ioana Burcea * Stephen Somogyi §, Andreas Moshovos*, Babak Falsafi § # Predictor Virtualization *University of Toronto Canada § Carnegie Mellon University.

On the Value Locality of Store Instructions Kevin M. Lepak Mikko H. Lipasti University of Wisconsin—Madison

Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Ravikumar Source:

Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.

Coherence Decoupling: Making Use of Incoherence J. Huh, J. Chang, D. Burger, G. Sohi ASPLOS 2004.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Scavenger: A New Last Level Cache Architecture with Global Block Priority Arkaprava Basu, IIT Kanpur Nevin Kirman, Cornell Mainak Chaudhuri, IIT Kanpur.

Lluc Álvarez, Lluís Vilanova, Miquel Moretó, Marc Casas, Marc Gonzàlez, Xavier Martorell, Nacho Navarro, Eduard Ayguadé, Mateo Valero Coherence.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.

On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Dynamic Verification of Sequential Consistency Albert Meixner Daniel J. Sorin Dept. of Computer Dept. of Electrical and Science Computer Engineering Duke.

Lecture 20: Consistency Models, TM

Lecture 23: Interconnection Networks

Multiscalar Processors

The University of Adelaide, School of Computer Science

Lecture 11: Consistency Models

Temporal Streaming of Shared Memory

The University of Adelaide, School of Computer Science

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Using Dead Blocks as a Virtual Victim Cache

Presented to CS258 on 3/12/08 by David McGrogan

15-740/ Computer Architecture Lecture 5: Precise Exceptions

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Lecture 10: Consistency Models

The University of Adelaide, School of Computer Science

Lecture 23: Virtual Memory, Multiprocessors

Dynamic Verification of Sequential Consistency

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CSE 486/586 Distributed Systems Cache Coherence

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Is SC + ILP = RC? C. Gniady, B. Falsafi, and T.N. Vijaykumar - Purdue

Is SC + ILP = RC? Chris Gniady, Babak Falsafr, and T.N. Vijaykumar

Lecture 11: Consistency Models

Presentation transcript:

Speculative Sequential Consistency with Little Custom Storage Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University Chris Gniady and Babak Falsafi

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage CPU … Cache Memory Bus Memory DSM Hardware Network Distributed Shared Memory (DSM) Logically shared but physically distributed memory  Shared-memory programming  Scalable  Long shared memory access can be a bottleneck!

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Programming DSM To achieve high performance:  Release Consistency (RC)  Relaxes memory order  Software annotation What programmers want:  Sequential Consistency (SC)  Intuitive  Memory order enforced  slow Prior work: Speculative SC (SC++) [ISCA’99]  Hardware speculatively relaxes order  High performance & intuitive  Large custom “speculative history” queue

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage This Talk’s Contributions 1.Characterize history size across apps  Varies from 16 to 8K entries!  Bursty: Over 85% of time empty 2.Propose SC++Lite  Allocates history in memory hierarchy  Enhances scalability across apps & systems  Reduces custom storage from 51 KB to 2 KB Result  Speculative SC (almost) for Free!

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Outline  Overview  Memory Ordering in RC  Memory Ordering in SC++  SC++Lite: SC++ with Little Custom Storage  Results  Conclusions

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage ST X ST A LD A ALU Retired Out of order Memory Ordering in RC  “LD A” & “ST A” retire out of order  Overlaps “ST X”, “LD Y” & “LD Z” misses  Software guarantees overlap is ok! Reorder Buffer LD Y LD Z... LD/ST Queue LD Z Miss ST X Miss LD Y Miss...

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Speculative Retirement ST X ST A LD A ALU SC++: Hardware Relaxes Memory Order [ISCA’99]  Speculatively retires instructions in hardware  Rolls back when coherence messages hit in history Reorder Buffer LD Y LD Z... Speculative History Queue... Look up for potential rollback Coherence Messages LD/ST Queue LD Z Miss ST X Miss LD Y Miss...

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage SC++’s Implementation Overhead Speculative History Queue:  On-chip custom storage  Grows up to subsequent missing load  Size is application & system dependent — Must assume worst-case size at design! Can we (virtually) eliminate custom storage in SC++?

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage SC++Lite: SC++ with Little Custom Storage Store history into memory hierarchy! 1.Queue allocated at boot time in physical memory 2.Use block buffer to pack history, ship to L2 3.Store ack updates head pointer (in LD/ST queue) 4.ROB retirement updates tail pointer 5.“Dead” history is not written back!

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Memory Ordering in SC++Lite  Only history burst retires into L2  History in L2 typically discarded ST A LD A ALU Speculative Block Buffer LD/ST Queue Location in L2 Speculative Retirement Reorder Buffer LD Y LD Z... Cache block to L2 LD Z Miss ST Z Miss LD Y Miss... ST X Miss ROB Index ROB... Head Look up for potential rollback Coherence Messages

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage SC++Lite Design Requirements Avoid perturbing application’s critical path! SBB:  Size depends on L2 latency & retirement rate  Large enough to filter store hits into L2 L2:  Retirement rate proportional to required bandwidth  Large blocks help  Small blocks may need multiporting  Head & tail registers reduce history traffic

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Outline  Overview  Memory Ordering in RC  Memory Ordering in SC++  SC++Lite: SC++ with Little Custom Storage  Results  Conclusions

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Experimental Methodology Using RSIM  16 nodes with 1 GHz, 8-issue CPU  128-entry ROB & LD/ST queue  Average remote-to-local access ratio of ~2  32-Kbyte, direct-mapped L1 cache  512-Kbyte, 8-way L2 cache, 64 GB/s  256-entry Lookup Table  32-entry SBB

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage History Size Characterization  System & application dependent: varies 16–4K  History is bursty: non-empty < 15% time

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Base RC, SC++ & SC++Lite  Up to 80% gap between SC & RC  31% average speedup for SC++, 28% for SC++lite

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Sensitivity to 4x Network Latency  SC++ requires 2x queue size to perform best  SC++Lite’s performance remains stable

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Custom Storage Requirements SC++:  ~51KB of custom storage  Doubles for 4x network latency  Radix shows worst-case history SC++Lite:  ~2KB of custom storage for all apps  Performance insensitive to network latency

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage Conclusions Previously showed [ISCA’99]:  Speculative SC achieves RC’s performance This talk:  Proposed SC++Lite  Allocates history in memory hierarchy  Enhances scalability across apps & systems Result  Speculative SC (almost) for Free!

PACT 2002 Copyright 2002  Chris Gniady Speculative Sequential Consistency with Little Custom Storage For More Information Please visit our web site at Impetus Group Computer Architecture Lab (CALCM) Carnegie Mellon University