DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

Slides:

Advertisements

Similar presentations

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

Advertisements

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

Lecture 12 Reduce Miss Penalty and Hit Time

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

1 Adapted from UCB CS252 S01, Revised by Zhao Zhang in IASTATE CPRE 585, 2004 Lecture 14: Hardware Approaches for Cache Optimizations Cache performance.

The Locality-Aware Adaptive Cache Coherence Protocol George Kurian 1, Omer Khan 2, Srini Devadas 1 1 Massachusetts Institute of Technology 2 University.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Amoeba-Cache Adaptive Blocks for Eliminating Waste in the Memory Hierarchy Snehasish Kumar Arrvindh Shriraman Eric Matthews Lesley Shannon Hongzhou Zhao.

Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

CSIE30300 Computer Architecture Unit 10: Virtual Memory Hsin-Chou Chi [Adapted from material by and

2/15/2006"Software-Hardware Cooperative Memory Disambiguation", Alok Garg, HPCA Software-Hardware Cooperative Memory Disambiguation Ruke Huang, Alok.

Chapter 12 Pipelining Strategies Performance Hazards.

Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.

University of Michigan Electrical Engineering and Computer Science 1 Liquid SIMD: Abstracting SIMD Hardware Using Lightweight Dynamic Mapping Nathan Clark,

Chapter 12 CPU Structure and Function. Example Register Organizations.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)

Systems I Locality and Caching

Optimizing RAM-latency Dominated Applications

Memory Management in Windows and Linux &. Windows Memory Management Virtual memory manager (VMM) –Executive component responsible for managing memory.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

MonetDB/X100 hyper-pipelining query execution Peter Boncz, Marcin Zukowski, Niels Nes.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

Colorama: Architectural Support for Data-Centric Synchronization Luis Ceze, Pablo Montesinos, Christoph von Praun, and Josep Torrellas, HPCA 2007 Shimin.

Virtual Memory. Virtual Memory: Topics Why virtual memory? Virtual to physical address translation Page Table Translation Lookaside Buffer (TLB)

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Next Generation ISA Itanium / IA-64. Operating Environments IA-32 Protected Mode/Real Mode/Virtual Mode - if supported by the OS IA-64 Instruction Set.

Gather-Scatter DRAM In-DRAM Address Translation to Improve the Spatial Locality of Non-unit Strided Accesses Vivek Seshadri Thomas Mullins, Amirali Boroumand,

IMP: Indirect Memory Prefetcher

Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.

Advanced Topics: Prefetching ECE 454 Computer Systems Programming Topics: UG Machine Architecture Memory Hierarchy of Multi-Core Architecture Software.

Prefetching Techniques. 2 Reading Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000)

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

My Coordinates Office EM G.27 contact time:

Memory Hierarchy— Five Ways to Reduce Miss Penalty.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

Xbox 360 Architecture Presenter: Ataç Deniz Oral Date: 30/11/06.

1 Lecture 5a: CPU architecture 101 boris.

COSC3330 Computer Architecture

CMSC 611: Advanced Computer Architecture

Memory COMPUTER ARCHITECTURE

CHAINSAW Von-Neumann Accelerators To Leverage Fused Instruction Chains

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Today How was the midterm review? Lab4 due today.

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

CS 105 Tour of the Black Holes of Computing

Buffered Compares: Excavating the Hidden Parallelism inside DRAM Architectures with Lightweight Logic Jinho Lee, Kiyoung Choi, and Jung Ho Ahn Seoul.

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Hardware Multithreading

ECE Dept., University of Toronto

15-740/ Computer Architecture Lecture 10: Runahead and MLP

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Virtual Memory Overcoming main memory size limitation

The Vector-Thread Architecture

Mattan Erez The University of Texas at Austin

Cache Performance Improvements

Overview Problem Solution CPU vs Memory performance imbalance

Stream-based Memory Specialization for General Purpose Processors

Presentation transcript:

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan (IBM Research)

DASX : Hardware Accelerator for Software Data Structures 2 Executive Summary Data void simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; } mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4 for each array element! mov CORE Reorder Buffer mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4 for each array element! mov mul add ld Extra work encumbers the core! DASX : Accelerate the access of and compute on software data structures H1 H2 H3 H4 H5 High level info lost!

DASX : Hardware Accelerator for Software Data Structures 3 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

DASX : Hardware Accelerator for Software Data Structures 4 Challenge 1/3 : Instruction Overhead 1D Vector : 2 2D Vector : 3 6D Vector : 12 Instructions / Element OLAP Cube [Gray et al. DMKD ‘96] upto 15D! Unordered Set : avg. 12 instructions BTree : 100s of instructions COMPUTE DATA 9%66% void simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; }

DASX : Hardware Accelerator for Software Data Structures 5 Challenge 2/3 : Memory Level Parallelism Each element independent mov CORE Reorder Buffer mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4 for each array element! mov mul add ld Cant discover more MLP! Accessing multiple data structures makes this worse!

DASX : Hardware Accelerator for Software Data Structures 6 Challenge 3/3 : Managing Cache Space CPU L1 L2 MEM Not enough space in cache

DASX : Hardware Accelerator for Software Data Structures 7 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

DASX : Hardware Accelerator for Software Data Structures 8 Existing Mechanisms – Prefetching + Increases Memory Level Parallelism – Increases instructions (SW PF) – Best effort (HW PF) – Can cause cache thrashing void simple() { for (int i = 0; i<size; ++i){ prefetch(a + k); prefetch(b + k); prefetch(c + k); a[i] += b[i] + c[i]; } add Reorder Buffer add pref mov load

DASX : Hardware Accelerator for Software Data Structures 9 Existing Mechanisms – SIMD + Reduce Instructions – Algorithm change – Increase power void simple(){ for (int i = 0; i<size; i+=k){ SIMD_LOAD(a[i]:a[i+k]); SIMD_LOAD(b[i]:b[i+k]); SIMD_LOAD(c[i]:c[i+k]); SIMD_ADD(a[…], b[…], c[…]); } add load add load add Reorder Buffer

DASX : Hardware Accelerator for Software Data Structures 10 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

CACHE OOO CORE DASX : Hardware Accelerator for Software Data Structures 11 Our Approach – DASX SHARED LAST LEVEL CACHE Collector Processing Elements (PEs) DASX Data structure specific fetch engine Lightweight pipelines All ins. fixed latency

DASX : Hardware Accelerator for Software Data Structures 12 DASX – Sample Programmer’s API void simple() { for (int i=0; i<size; ++i){ a[i] = b[i] + c[i]; } coll_a = new coll(ST, &a, INT, size, 0, VEC); coll_b = new coll(LD, &b, INT, size, 0, VEC); coll_c = new coll(LD, &c, INT, size, 0, VEC); BEGIN SIMPLE END SIMPLE auto kfn = [](auto i, auto j) { return i + j; } Initialize Collectors group::add(coll_a, coll_b, coll_c); start(kfn, size); Run in lock-step Start processing

DASX : Hardware Accelerator for Software Data Structures 13 DASX – Data Structure Accelerator 1 CACHE MEM Translate key, fetch elements 2 Allocate 3 Lock iteration data 4 Fill local storage 5 Compute (SPMD) STOP GO 6 Write back dirty data 7 Unlock iteration data STOP Collector PEs

Collector DASX : Hardware Accelerator for Software Data Structures 14 DASX – Data Structure Accelerator CACHE MEM Lock iteration data Write back dirty data STOP Compute (SPMD) Fill local storage 1 Translate key, fetch elements Allocate Unlock iteration data DECOUPLED ACCESS (1 – 3) EXECUTE (5 – 7) PEs

DASX : Hardware Accelerator for Software Data Structures 15 Challenges Recap  Challenge 1 : Reduce Instruction Overhead  Challenge 2 : Increase Memory Level Parallelism  Challenge 3 : Better Cache Management

DASX : Hardware Accelerator for Software Data Structures 16 DASX – Processing Elements Instruction Memory (1KB) REG (32) REG (32) … … LANE 1 LANE 8 … Features 3 stage pipeline Single Program Multiple Data Each PE – exec. 1 iteration No address generation Reference data using “keys” “Reduce Instruction Overhead” by using SPMD Model and removing address generation.

DASX : Hardware Accelerator for Software Data Structures 17 DASX – Key Interface Vector Keys LD Key == LD Iter * Size + Offset Hash Table Keys LD KEY BTree Keys 1230 KeyData Remove address generation overhead

DASX : Hardware Accelerator for Software Data Structures 18 DASX – Collector Data structure fetch engine Specialize traversal User defined elements Data StructureCollector HW OP VectorAddress / Stride Calc. – ADD, CMP Hash TableIndex Calc + Bucket Traversal. – INT ALU BTreeTraversal – CMOV, ADD, CMP Tasks – 1) Prefetch 2) Manage Cache Space

DASX : Hardware Accelerator for Software Data Structures 19 Collector Task 1 : Prefetch 1 CACHE MEM Translate keys, fetch elements 2 Allocate Run asynchronously with compute Reduce address generation cost Granularity of access : Data structure element Enhanced memory level parallelism Collector

DASX : Hardware Accelerator for Software Data Structures 20 Collector Task 2 : Manage Cache Space CACHE 3 Lock iteration data 4 Fill local storage 6 Write back dirty data 7 Unlock iteration data Manage cache fill and replacement Bulk fill OBJ-Store before iteration Per element refill from cache to OBJ-Store Collector PEs OBJ-Store

DASX : Hardware Accelerator for Software Data Structures 21 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

DASX : Hardware Accelerator for Software Data Structures 22 Benchmarks RecommenderText SearchHash Table OLAP CubingBTreeBlack-Scholes H1 H2 H3 H4 H5

DASX : Hardware Accelerator for Software Data Structures 23 Evaluation – Setup DASX vs 8 1KB 32 KB L1 IO CORE MT (8 threads) LLC – 4MB, 16 WAY, NUCA DRAM – DDR2-400, 16GB, 4 Chn. 64 KB L1 OOO CORE vs OOO

DASX : Hardware Accelerator for Software Data Structures 24 Evaluation – Performance Breakdown D. Cube (Memory Bound) Black. (Compute Bound) 1 In-Order Core at LLC Normalized to OOO Core ( Lower is better) + Collector (data structure engine) – Address Gen. + Local Store X 8 MT

DASX : Hardware Accelerator for Software Data Structures 25 Evaluation – Performance MT (8)

DASX : Hardware Accelerator for Software Data Structures 26 Evaluation – Energy vs Performance Execution Cycles Energy Data-Cubing MT-32 MT-16 MT-8 DASX-4DASX-8 OOO Best

DASX : Hardware Accelerator for Software Data Structures 27 Summary  Highlighted the challenges of data-centric workloads  Demonstrated the effectiveness of using data structure specific information  Data structure aware hardware accelerator achieves 4.4X performance improvement

DASX : Hardware Accelerator for Software Data Structures 28 Q & A

DASX : Hardware Accelerator for Software Data Structures 29 Backup 1.Percentage of data structure instructions – 30 2.Why collector groups? – 31 3.Energy breakdown – 32 4.Obj-Store details – 33 5.Address Translation for keys – 34

DASX : Hardware Accelerator for Software Data Structures 30 Percentage of data structure instructions

DASX : Hardware Accelerator for Software Data Structures 31 Why collector groups

DASX : Hardware Accelerator for Software Data Structures 32 Evaluation – Energy Reduction Streaming Cache Thrashing

DASX : Hardware Accelerator for Software Data Structures 33 DASX – OBJ-Store Reduce energy – filter access to LLC Organization : Decoupled sector cache (1KB) Minimize tag overhead for vectors Adapt to spatial locality (eg. struct fields) KEYV/ILLC* Tag LD / ST – PE Write backs Data

DASX : Hardware Accelerator for Software Data Structures 34 DASX – Address Translation for Keys Reduce energy overhead Keys are coalesced by the collector into cache lines Only one translation per line vs. per access No reverse translation, due to back pointer (refer OBJ-Store)