Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula

Slides:



Advertisements
Similar presentations
Virtual Hierarchies to Support Server Consolidation Michael Marty and Mark Hill University of Wisconsin - Madison.
Advertisements

Cache coherence for CMPs Miodrag Bolic. Private cache Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level.
Please do not distribute
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Department of Computer Sciences Revisiting the Complexity of Hardware Cache Coherence and Some Implications Rakesh Komuravelli Sarita Adve, Ching-Tsun.
The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.
Zhongkai Chen 3/25/2010. Jinglei Wang; Yibo Xue; Haixia Wang; Dongsheng Wang Dept. of Comput. Sci. & Technol., Tsinghua Univ., Beijing, China This paper.
Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.
Department of Computer Science iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf Karthikeyan Sankaralingam Vertical.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Toward Cache-Friendly Hardware Accelerators
Please do not distribute
Please do not distribute
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
The MachSuite Benchmark
Tutorial Outline Time Topic 9:00 am – 9:30 am Introduction 9:30 am – 10:10 am Standalone Accelerator Simulation: Aladdin 10:10 am – 10:30 am Standalone.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
TEMPLATE DESIGN © Hardware Design, Synthesis, and Verification of a Multicore Communication API Ben Meakin, Ganesh Gopalakrishnan.
Decoupled Compressed Cache: Exploiting Spatial Locality for Energy-Optimized Compressed Caching Somayeh Sardashti and David A. Wood University of Wisconsin-Madison.
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Low-Power Cache Organization Through Selective Tag Translation for Embedded Processors with Virtual Memory Support Xiangrong Zhou and Peter Petrov Proceedings.
Improving Disk Throughput in Data-Intensive Servers Enrique V. Carrera and Ricardo Bianchini Department of Computer Science Rutgers University.
Adaptive GPU Cache Bypassing Yingying Tian *, Sooraj Puthoor†, Joseph L. Greathouse†, Bradford M. Beckmann†, Daniel A. Jiménez * Texas A&M University *,
Full and Para Virtualization
Caches for Accelerators
컴퓨터교육과 이상욱 Published in: COMPUTER ARCHITECTURE LETTERS (VOL. 10, NO. 1) Issue Date: JANUARY-JUNE 2011 Publisher: IEEE Authors: Omer Khan (Massachusetts.
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.
Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.
Quantifying Acceleration: Power/Performance Trade-Offs of Application Kernels in Hardware WU DI NOV. 3, 2015.
The University of Adelaide, School of Computer Science
Design and Modeling of Specialized Architectures Yakun Sophia Shao May 9 th, 2016 Harvard University P HD D ISSERTATION D EFENSE.
Co-Designing Accelerators and SoC Interfaces using gem5-Aladdin
Please do not distribute
CMSC 611: Advanced Computer Architecture
Please do not distribute
Improving Cache Performance using Victim Tag Stores
Cache and Scratch Pad Memory (SPM)
Jonathan Walpole Computer Science Portland State University
Please do not distribute
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Please do not distribute
Lecture: Large Caches, Virtual Memory
Lecture: Large Caches, Virtual Memory
The Multikernel: A New OS Architecture for Scalable Multicore Systems
Please do not distribute
Stash: Have Your Scratchpad and Cache it Too
The University of Adelaide, School of Computer Science
Using Destination-Set Prediction to Improve the Latency/Bandwidth Tradeoff in Shared-Memory Multiprocessors Milo Martin, Pacia Harper, Dan Sorin§, Mark.
Lecture: Large Caches, Virtual Memory
Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor
Address Translation for Manycore Systems
CMSC 611: Advanced Computer Architecture
Amoeba-Cache: Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar, Hongzhou Zhao†, Arrvindh Shriraman Eric Matthews∗, Sandhya.
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Lecture 22: Cache Hierarchies, Memory
Outline Midterm results summary Distributed file systems – continued
Lecture: Cache Innovations, Virtual Memory
Natalie Enright Jerger, Li Shiuan Peh, and Mikko Lipasti
Lecture: Cache Hierarchies
The University of Adelaide, School of Computer Science
Lecture 24: Virtual Memory, Multiprocessors
The University of Adelaide, School of Computer Science
Border Control: Sandboxing Accelerators
Presentation transcript:

Fusion : Design Tradeoffs in Coherent Cache Hierarchies for Accelerators Snehasish Kumar, Arrvindh Shriraman, and Naveen Vedula School of Computing Sciences, Simon Fraser University ISCA-’15 Presented by : Keshav Mathur

Outline Motivation Proposed Design Implementation details Evaluation Methods Results Conclusion and Comments References

Accelerators : We need them, but .. Specialised Hardware to reduce power and gain performance Power and Clock Gating easier Specialization Spectrum Dynamically Specialized Datapaths for Energy Efficient Computing, Venkatraman Govindaraju Chen-Han Ho Karthikeyan Sankaralingam Vertical Research Group University of Wisconsin-Madison Granularity of Accelerators : Fine Grained fix function vs coarse domain specific Programmability Concerns : How easy is the discovery and how frequent is the invocation

Fixed Function Accelerators Fine grain off-loading of functions to multiple accelerators Enables data path reuse + Saves control path power DMA LLC func1(); func2(); func3(); func4(); Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of customized architectures Creates producer - consumer scenario - Incurs frequent data movement ( DMA Calls ) Forwarding Buffers ? Co-located, shared memory? - Scratchpad ? Cache ? Stash ? Both ? Always ? ??

Architecture : Tile + ? - Tiled architecture : Multiple fixed function accelerators on a single tile - Independent memory hierarchy - Independent coherence protocol - Multiple Tiles per core Indeterministic behaviour (Hit/Miss) Coherent , Non polluting memory Capture locality, enable reuse H/w address translation energy and latency Implicit data movement, lazy writebacks Deterministic Access Low Load Use Latency Efficient Memory Utilization Incoherent ,Private Address Space Software Managed Scratchpad Caches

Baseline Systems: SCRATCH and SHARED Shared L1 per tile L1X takes part in MESI based coherence L2 at host maintains inclusion with L1X Capture spatial and temporal locality Provide coherent view of memory Customized hierarchy needed for efficiency Size vs Latency tradeoffs as needs to be sized for multiple s/w threads Private Scratchpad 1 per Accelerator DMA controller Good for Compute Intensive workloads Large Size of Scratch. amortizes DMA overhead Large Size increases access latency, energy High overhead for kernels of seq. programs with high locality eg: SoC systems :AXI - ARM [1], CAPI -IBM [2]

Fusion Architecture PID1 PID2 Private L0X per accelerator, independently sized Banked L1x : shared ACC Coherence between L0 and L1 L1x Coherent with core via MESI protocol implementation Virtual address (no TLB) space for accelerators Timestamp based coherence between accelerators L0x and L1x PID Tags in caches PID1 PID2

Fusion : Data flow Write forwarding avoids write back to cache Exploits producer consumer scenario Reduced data migration by avoiding DMAs Frequent Write Backs between L0 and L1 cause energy overhead

Virtual Memory ACX -0 ACX -0 AX- TLB L0x L1x L1D L1D Shared L2 L1x V. Addr L0x L1x L1D L1D Shared L2 L1x MESI ACC TLB provides virtual to physical address translations on a L1x Miss This is used to index to shared L2 and participate in MESI actions

Virtual Memory Phy. Block Addr. Virtual Cache line ptr ACX -0 ACX -0 AX- RMAP L0x L1x L1D L1D Shared L2 L1x MESI ACC Requests filtered at L2 Directory based on sharer list

Accelerator Coherence Protocol Timestamp based , Self - Invalidation protocol 2 Hop Protocol , saves energy Motivation to enable data mitigation b/w Acc rather than concurrent sharing Supports Sequential Consistency for accelerators Host Side : 3 Hop Directory based MESI Lease time based on operation and known compute latency of accel. Local Time , Valid = Time < LTime ; L0x Cache Line LTime L1x Cache Line GTime Global Time , Valid = max( LTime ) for all L0x LTimes

Accelerator Memory Operations Request : Load A, #10 Misses in L0x -> Misses in L1x Virtual -> Physical Addr. in Ax TLB MESI read request at L2 Response: Data + Phys. Block addr , Add to sharer list Physical -> Line Pointer map Data + Line Pointer @L1 Consumed by Acc

Host Requests

Fusion vs Fusion -Dx

Evaluation Methods MACSIM : Simulator for Heterogenous Computing Systems C A GEMS: Simulator for Memory Hierarchy Aladdin Like Flow L1D L1D Shared L2 GPROF

Benchmarks Characterisation Most memory Intensive : FFT, DISP, TRACK, HIST Compute Intensive : ADPCM, SUSAN, FILTER Max. Data Re-use : FFT , Tracking , ADPCM, HIST Max. Per tile accelerators : 6 ( FFT )

Evaluation Specs

Results Baseline: Scratchpad Observation: Shared L1x helps memory intensive kernels but hurts compute dominated kernels Observation: Private L0x captures spatial locality in SUSAN and ADPCM , better than SHARED Observation :L0x Caches reduce L2 access energy by filtering DMA calls for memory intensive prog. Observation : Fusion’s L1x further filters L2 accesses, but increased coherence message result in no significant energy improvement

Results Write Through caches are energy expensive Large Caches at L1x and L0x still may not capture all the working sets and hence fail to give any energy benefits Write Through caches are energy expensive Protocol Extensions like write forwarding can reduce energy consumption in Fusion model. Address Translation overheads need to be mitigated -> reduced in Fusion by removing TLB from critical path.

Comments Not all benchmarks with high share percentage are evaluated on write-forwarding. Should kernels that are candidates for write forwarding be designed as single accelerators ? How does providing private L0 ( 4-8KB) per ACx scale for more than 6 ( max here ) ACx per tile ? Adding timestamp comparison logic / update logic in the cache is a major change in cache design. Do these affect access latency ?

References Goodridge. The effect and technique of system coherence in arm multicore technology POWER8 Coherent Accelerator Processor Interface http://www-304.ibm.com/webapp/set2/sas/f/capi/home.html Towards Cache Friendly Hardware Accelerators , Yakun Sophia Shao, Sam Xi, Viji Srinivasan† , Gu-Yeon Wei, David Brooks Rakesh Komuravelli, Matthew D. Sinclair, Johnathan Alsop, Muhammad Huzaifa, Maria Kotsifakou, Prakalp Srivastava, Sarita V. Adve, and Vikram S. Adve. 2015. Stash: have your scratchpad and cache it too. In Proceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA '15). ACM, New York, NY, USA, 707-719. DOI=http://dx.doi.org/10.1145/2749469.2750374 Y. S. S. B. R. Gu and Y. W. D. Brooks. Aladdin: A pre-rtl, powerperformance accelerator simulator enabling large design space exploration of customized architectures B. Reagen, R. Adolf, S. Y. Shao, G.-Y. Wei, and D. Brooks. Machsuite: Benchmarks for accelerator design and customized architectures. In IEEE International Symposium on Workload Characterization (IISWC), 2014

Thank you Questions ??