A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Chapter 1 CSF 2009 Computer Performance. Defining Performance Which airplane has the best performance? Chapter 1 — Computer Abstractions and Technology.

1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Panda: MapReduce Framework on GPU’s and CPU’s

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

GPGPU platforms GP - General Purpose computation using GPU

Cache-Conscious Runtime Optimization for Ranking Ensembles Xun Tang, Xin Jin, Tao Yang Department of Computer Science University of California at Santa.

Motivation “Every three minutes a woman is diagnosed with Breast cancer” (American Cancer Society, “Detailed Guide: Breast Cancer,” 2006) Explore the use.

Computer System Architectures Computer System Software

February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Computer Performance Computer Engineering Department.

Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.

Threads, Thread management & Resource Management.

An Integration Framework for Sensor Networks and Data Stream Management Systems.

Three fundamental concepts in computer security: Reference Monitors: An access control concept that refers to an abstract machine that mediates all accesses.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

C OMPUTER O RGANIZATION AND D ESIGN The Hardware/Software Interface 5 th Edition Chapter 1 Computer Abstractions and Technology Sections 1.5 – 1.11.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.

G53SEC 1 Reference Monitors Enforcement of Access Control.

1 Latest Generations of Multi Core Processors

Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.

Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

Welcome to CPS 210 Graduate Level Operating Systems –readings, discussions, and programming projects Systems Quals course –midterm and final exams Gateway.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.

Introduction Why are virtual machines interesting?

B5: Exascale Hardware. Capability Requirements Several different requirements –Exaflops/Exascale single application –Ensembles of Petaflop apps requiring.

Debunking the 100X GPU vs. CPU Myth An Evaluation of Throughput Computing on CPU and GPU Present by Chunyi Victor W Lee, Changkyu Kim, Jatin Chhugani,

Computer Organization Instruction Set Architecture (ISA) Instruction Set Architecture (ISA), or simply Architecture, of a computer is the.

Sunpyo Hong, Hyesoon Kim

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.

Some GPU activities at the CMS experiment Felice Pantaleo EP-CMG-CO EP-CMG-CO 1.

Background Computer System Architectures Computer System Software.

GPGPU Performance and Power Estimation Using Machine Learning Gene Wu – UT Austin Joseph Greathouse – AMD Research Alexander Lyashevsky – AMD Research.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Software Architecture of Sensors. Hardware - Sensor Nodes Sensing: sensor --a transducer that converts a physical, chemical, or biological parameter into.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,

Computer Organization and Architecture Lecture 1 : Introduction

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

Reza Yazdani Albert Segura José-María Arnau Antonio González

Seth Pugsley, Jeffrey Jestes,

CS427 Multicore Architecture and Parallel Computing

Concurrent Data Structures for Near-Memory Computing

Chapter 4 Threads.

The Problem Finding a needle in haystack An expert (CPU)

BitWarp Energy Efficient Analytic Data Processing on Next Generation General Purpose GPUs Jason Power || Yinan Li || Mark D. Hill || Jignesh M. Patel.

The Small batch (and Other) solutions in Mantle API

SOC Runtime Gregory Stoner.

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

The Yin and Yang of Processing Data Warehousing Queries on GPUs

Exploring Non-Uniform Processing In-Memory Architectures

Simulation of exascale nodes through runtime hardware monitoring

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Chapter 1 Introduction.

Threads Chapter 4.

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

Windows Virtual PC / Hyper-V

COMS 361 Computer Organization

CS703 - Advanced Operating Systems

Presentation transcript:

A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344 and subcontract B from Advanced Micro Devices, Inc. on behalf of AMD Advanced Research LLC. These data may be reproduced and used by the Department of Energy on a need-to-know basis, with the express limitation that they will not, without written permission of the AMD Advanced Research LLC, be used for purposes of manufacture nor disclosed outside the Government. This notice shall be marked on any reproduction of these data, in whole or in part. Dong Ping Zhang, Nuwan Jayasena, Alexander Lyashevsky, Joe Greathouse, Mitesh Meswani, Mark Nutter, Mike Ignatowski AMD Research

2 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC PROCESSING-IN-MEMORY?

3 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC HIGHLIGHT  Memory system is a key limiter –At 4TB/s, vast majority of node energy could be consumed by the memory system  Prior PIM research constrained by –Implementation technology –Non-traditional programming models  Our focus –3D die stacking –Use base logic die(s) in memory stack  General-purpose processors  Support familiar programming models  Arbitrary programs vs a set of special operations Logic die with PIM Memory stack

4 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC Moving compute close to memory promises significant gains Memory is a key limiter (performance and power) Exascale goals of 4TB/s/node and <5pj/bit Prior research Integration of caches and computation “A logic-in-memory computer” (1970) No large scale integration was possible. Logic in DRAM processes In-memory processors with reduced performance or highly specialized for a limited set of operations. Reduced DRAM due to the presence of compute logic. Embedded DRAM in logic processes Not cost-effectively accommodate sufficient memory capacity Reduced density of embedded memory PROCESSING-IN-MEMORY (PIM) --- OVERVIEW (I)

5 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC New opportunity: logic die stacked with memory Logic die needed anyway for signal redistribution and integrity Potential for non-trivial compute Key benefits Reduce bandwidth bottlenecks Improve energy efficiency Increase compute for a fixed interposer area Processor can be optimized for high BW/compute ratio Challenges Programming models and interfaces Architectural tradeoffs Application refactoring PROCESSING-IN-MEMORY (PIM) --- OVERVIEW (II)

6 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC OUTLINE  PIM architecture baseline  API specification  Emulator and performance models  Application studies

7 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC NODE HARDWARE ORGANIZATION  Single-level of PIM-attached memory stacks  Host has direct access to all memory –Non-PIM-enabled apps still work  Unified virtual address space –Shared page-tables between host and PIMs  Low-bandwidth inter-PIM interconnect Host Processor Interposer or board Logic die with PIM Memory dies Memory stack

8 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC PIM API OVERVIEW  Current focus is on single PIM-enabled node –Current PIM hardware baseline is general-purpose processor  Key goals of API –Facilitate data layout control –Dispatch compute to PIMs using standard parallel abstractions  A “convenient” level of abstraction for early PIM evaluations –Aimed at PIM feasibility studies –annotation of data and compute for PIM with reasonable programmer effort  Key API features:  discover device  query device characteristics  manage locality  dispatch compute

9 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC PIM PTHREAD EXAMPLE: PARALLEL PREFIX SUM … list_of_pims = malloc(max_pims * sizeof(pim_device_id)); failure = pim_get_device_ids(PIM_CLASS_0, max_pims, list_of_pims, &num_pims); for (i = 0; i < num_pims; i++) { failure = pim_get_device_info(list_of_pims[i], PIM_CPU_CORES, needed_size, device_info, NULL); … } … for (i = 0; i < num_pims; i++) { parallel_input_array[i] = pim_malloc(sizeof(uint32_t) * chunk_size, list_of_pims[i], PIM_MEM_DEFAULT_FLAGS, PIM_PLATFORM_PTHREAD_CPU); parallel_output_array[i] = pim_malloc(sizeof(uint64_t) * chunk_size, list_of_pims[i], PIM_MEM_DEFAULT_FLAGS, PIM_PLATFORM_PTHREAD_CPU); } … for (i = 0; i < num_pims; i++) { pim_args[PTHREAD_ARG_THREAD] = &(pim_threads[i]); // pthread_t arg_size[PTHREAD_ARG_THREAD] = sizeof(pthread_t); pim_args[PTHREAD_ARG_ATTR] = NULL; // pthread_attr_t arg_size[PTHREAD_ARG_ATTR] = sizeof(pthread_attr_t); pim_args[PTHREAD_ARG_INPUT] = &(thread_input[i]); // void * for thread input arg_size[PTHREAD_ARG_INPUT] = sizeof(void *); pim_function.func_ptr = parallel_prefix_sum; spawn_error = pim_spawn(pim_function, pim_args, arg_size, NUM_PTHREAD_ARGUMENTS, list_of_pims[i], PIM_PLATFORM_PTHREAD_CPU); } for (i = 0; i < num_pims; i++) { pthread_join(pim_threads[i], NULL); } …

10 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC PIM EMULATION AND PERFORMANCE MODEL  Phase 1: Native execution on commodity hardware  Capture execution trace and performance stats  Phase 2: Post-process with performance models  Predict overall performance on future memory and processors Phase 2: Perf Model Phase 1: Emulation Commodity Hardware (CPU/GPU/APU) Linux Event Trace & Stats User Application PIM Software Stack Emulated Machine Description Performance Models Trace Post-processor Host Tasks PIM Tasks

11 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC PARALLEL PREFIX SUM – HOST-ONLY VS HOST+PIM 0.466s CPU0 Time CPU s CPU2 CPU3 Initialization 0.279s 0.305s 0.306s Parallel s s s Parallel 2 Host frequency: 4GHz Latency: Same as present Total runtime: 0.943s 0.466s Host CPU s PIM0 PIM1 PIM2 PIM s 0.224s 0.225s Host frequency: 4GHz PIM frequency: 2GHZ Host latency: current PIM latency: 30% reduction Total runtime: 0.820s 15% Faster PIM computation benefits from reduced latency Improvement

12 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC PARALLEL PREFIX SUM – HOST-ONLY VS HOST+PIM Time Initialization Parallel 1 Parallel s Host CPU PIM0 PIM1 PIM2 PIM Host frequency: 4GHz PIM frequency: 2GHZ Host latency: current PIM latency: 50% reduction Total runtime: 0.728s 30% Faster Improvement 0.466s Host CPU s PIM0 PIM1 PIM2 PIM s 0.313s 0.314s 0.162s 0.178s Host frequency: 4GHz PIM frequency: 2GHZ Host latency = PIM latency Total runtime: 0.958s 2% Slower

13 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC 0.24ms WAXPBY – HOST-ONLY VS HOST+PIM CPU0 Time GPU0 Initialization 0.47ms waxpby Host GPU frequency: 1GHz Host GPU CUs: 192 Host DRAM BW: 2TB/s (4 stacks) Host CPU0 PIM0 PIM1 PIM2 PIM3 PIM GPU frequency: 800MHz PIM GPU CUs: 16 (per PIM) PIM DRAM BW: 1TB/s/stack Improvement 0.24ms 0.12ms Host CPU0 PIM0 PIM1 PIM2 PIM3 PIM GPU frequency: 800MHz PIM GPU CUs: 16 (per PIM) PIM DRAM BW: 2TB/s/stack Kernel performance scales with Bandwidth Improvement 0.12ms

14 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC SUMMARY AND FUTURE WORK  PIM architecture baseline  API specification  Emulator and performance models  Application studies -Design space study -Evaluation of the performance models -Further work on API, execution model, applications etc.

15 | Dong Ping Zhang: A new perspective on processing-in-memory architecture design | 21 June 2013 | AMD Research | MSPC ACKNOWLEDGEMENT: Lee Howes Gabe Loh QUESTIONS? FURTHER DISCUSSIONS: