A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.

Slides:

Advertisements

Similar presentations

An Overview Of Virtual Machine Architectures Ross Rosemark.

Advertisements

Pooja ROY, Manmohan MANOHARAN, Weng Fai WONG National University of Singapore ESWEEK (CASES) October 2014 EnVM : Virtual Memory Design for New Memory Architectures.

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

Data Marshaling for Multi-Core Architectures M. Aater Suleman Onur Mutlu Jose A. Joao Khubaib Yale N. Patt.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.

1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

The Memory Behavior of Data Structures Kartik K. Agaram, Stephen W. Keckler, Calvin Lin, Kathryn McKinley Department of Computer Sciences The University.

Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.

Mitigating Prefetcher-Caused Pollution Using Informed Caching Policies for Prefetched Blocks Vivek Seshadri Samihan Yedkar ∙ Hongyi Xin ∙ Onur Mutlu Phillip.

1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.

NVM Programming Model. 2 Emerging Persistent Memory Technologies Phase change memory Heat changes memory cells between crystalline and amorphous states.

Computer Architecture ECE 4801 Berk Sunar Erkay Savas.

Korea Univ B-Fetch: Branch Prediction Directed Prefetching for In-Order Processors 컴퓨터 · 전파통신공학과 최병준 1 Computer Engineering and Systems Group.

Operating System Support for Virtual Machines Samuel T. King, George W. Dunlap,Peter M.Chen Presented By, Rajesh 1 References [1] Virtual Machines: Supporting.

CSC 7080 Graduate Computer Architecture Lec 12 – Advanced Memory Hierarchy 2 Dr. Khalaf Notes adapted from: David Patterson Electrical Engineering and.

Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,

Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.

Memory Hierarchy Adaptivity An Architectural Perspective Alex Veidenbaum AMRM Project sponsored by DARPA/ITO.

Memory Management memory hierarchy programs exhibit locality of reference - non-uniform reference patterns temporal locality - a program that references.

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads Amir Yazdanbakhsh, Bradley Thwaites, Hadi Esmaeilzadeh Gennady Pekhimenko, Onur Mutlu,

Computer Architecture Lecture 12: Virtual Memory I

Zorua: A Holistic Approach to Resource Virtualization in GPUs

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

CE 454 Computer Architecture

ECE232: Hardware Organization and Design

Memory COMPUTER ARCHITECTURE

Lecture 3: MIPS Instruction Set

Transparent Offloading and Mapping (TOM) Enabling Programmer-Transparent Near-Data Processing in GPU Systems Kevin Hsieh Eiman Ebrahimi, Gwangsun Kim,

Adaptive Cache Partitioning on a Composite Core

GPU Memory Details Martin Kruliš by Martin Kruliš (v1.1)

From Address Translation to Demand Paging

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris ATC’15

Andes Technology Innovate SOC ProcessorsTM

Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy

CS510 Operating System Foundations

RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads

Many-core Software Development Platforms

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Energy-Efficient Address Translation

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

From Address Translation to Demand Paging

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

COSC121: Computer Systems

Address-Value Delta (AVD) Prediction

ECE Dept., University of Toronto

CARP: Compression-Aware Replacement Policies

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin.

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Henk Corporaal TUEindhoven 2011

Manjunath Shevgoor, Rajeev Balasubramonian, University of Utah

José A. Joao* Onur Mutlu‡ Yale N. Patt*

Summary 3 Cs: Compulsory, Capacity, Conflict Misses Reducing Miss Rate

Virtual Memory Prof. Eric Rotenberg

What is Computer Architecture?

Computer Architecture

Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

A Case for Richer Cross-layer Abstractions:

Paging Andrew Whitaker CSE451.

Presented by Florian Ettinger

Stream-based Memory Specialization for General Purpose Processors

DSPatch: Dual Spatial pattern prefetcher

Presentation transcript:

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin Hsieh, Gennady Pekhimenko Eiman Ebrahimi, Nastaran Hajinazar, Phillip B. Gibbons, Onur Mutlu

SW Optimization Functionality Performance? HW Optimization Software Performance? HW-SW Interfaces (ISA & Virtual Memory) Hardware HW Optimization

Higher-level information is not visible to HW Code Optimizations Access Patterns Data Structures Float Integer Software Data Type Char Hardware Instructions Memory Addresses 100011111… 101010011…

Performance Functionality Software Higher-level Program Semantics Expressive Memory “XMem” ISA Virtual Memory Hardware

Outline Why do we need a richer cross-layer abstraction? Designing Expressive Memory (XMem) Evaluation (with a focus on one use case)

Performance optimization in hardware

What we do today: We design hardware to infer and predict program behavior to optimize for performance Caches Memory Controllers Prefetchers

With a richer abstraction: SW can provide program information can significantly help hardware Float Integer Data Type/Layout Char Data Structures Access Patterns Software Hardware Data Placement Prefetcher Data Compression

Benefits of a richer abstraction: Optimizations: Cache Management Data Placement in DRAM Data Compression Approximation DRAM Cache Management NVM Management NUCA/NUMA Optimizations …. Express: Data structures Access semantics Data types Working set Reuse Access frequency …..

Optimizing for performance in software

What we do today: Use platform-specific optimizations to tune SW Example: SW-based cache optimizations SW optimizations make assumptions regarding HW resources Tune working set Significant portability and programmability challenges 8MB cache 6MB cache

With a richer interface: HW can alleviate burden on SW SW only expresses program information System/HW handles optimizing for specific system details e.g. exact cache size, memory organization, NUMA organization Working set Reuse

Benefits of a richer abstraction: Express: Data structures Access semantics Data types Working set Reuse Access frequency ….. Optimizations: Cache Management Data Placement in DRAM Data Compression Approximation DRAM Cache Management NVM Management NUCA/NUMA Optimizations …. Performance HW optimizations SW optimizations Programmability Portability

SW information in HW is proven to be useful, but.. Lots of research on hints/directives to hardware and HW-SW co-designs Cache hints, Prefetcher hints, Annotations for data placement, … Downsides: Not scalable – can’t add new instructions for each optimization Not portable – make assumptions about underlying resources These downsides significantly limit adoption of otherwise useful approaches

Design a rich, general, unifying abstraction Our Goal: Design a rich, general, unifying abstraction between HW and SW for performance

Outline Why do we need a richer cross-layer abstraction? Designing Expressive Memory (XMem) Evaluation (with a focus on one use case)

Key design goals Supplemental and hint-based only General and extensible Architecture-agnostic Low overhead

An Overview of Expressive Memory Application Interface to Application System to summarize and save program information Expressive Memory Interface to System/Architecture Prefetcher Memory Controller OS Caches . . . DRAM Cache

Challenge 1: Generality and architecture-agnosticism High-level Application Data structures, data type, access patterns, … Interface to Application Expressive Memory Architecture-specific, Low-level What to prefetch? Which data to cache? Interface to System/Architecture Prefetcher Memory Controller OS Caches . . . DRAM Cache

Challenge 2: Tracking changing program properties with low overhead Program behavior keeps changing: Data structures are accessed differently in different phases New data structures are allocated We want to convey lots of information! Application Expressive Memory Dynamic interface that continually tracks program behavior System/Architecture Potentially very high storage/communication overhead at run time

A new HW-SW abstraction Application Software Atom:1 Memory Region Atom: 2 Memory Region Atom: 3 Memory Region Hardware System/Architecture

The Atom: A closer look Atom: X Atom Unique Atom ID A hardware-software abstraction to convey program semantics Data Value Properties: INT, FLOAT, CHAR,… COMPRESSIBLE, APPROXIMABLE Access Properties: Read-Write Characteristics Access Pattern Access Intensity (“Hotness”) 3) Data Locality: Working Set Reuse 4) …. Atom: X Program Attributes Mapping State Atom Valid/invalid at current execution point

The three Atom operators Program Attributes Mapping State 2) MAP/UNMAP 1) CREATE 3) ACTIVATE/DEACTIVATE

Using Atoms to express program semantics A = malloc ( size ); Atom1 = CreateAtom( “INT”, “Regular”, …); MapAtom( Atom1, A, size ); ActivateAtom(Atom1); …. Atom2 = CreateAtom( “INT”, “Irregular”, …); UnMapAtom( Atom1, A, size); MapAtom( Atom2, A, size ); ActivateAtom(Atom2); Memory Region A (3D Array) Atom: 1 Attributes cannot be changed Atom: 2

Implementing the Atom Compile Time (CREATE) Load Time (CREATE) Run Time (MAP and ACTIVATE)

High overhead operations are handled at compile time Compile Time (CREATE) A = malloc ( size ); Atom1 = CreateAtom( “INT”, “Regular”, …); MapAtom( Atom1, A, size ); ActivateAtom(Atom1); …. Atom2 = CreateAtom( “INT”, “Irregular”, …); UnMapAtom( Atom1, A, size); MapAtom( Atom2, A, size ); ActivateAtom(Atom2); Atom: 1 Program Attributes 1 High overhead operations are handled at compile time Atom: 2 Program Attributes 2 Compiler Compiler Atom Segment in Object File

Load Time (CREATE) OS Atom Segment in Object File Architecture-agnostic, general Architecture-specific, low-level OS Attribute Translator Atom ID Attributes … Cache Controllers Memory Controller Prefetcher

Run Time (MAP and ACTIVATE) A = malloc ( size ); Atom1 = CreateAtom( “INT”, “Regular”, …); MapAtom( Atom1, A, size ); ActivateAtom(Atom1); …. Atom2 = CreateAtom( “INT”, “Irregular”, …); UnMapAtom( Atom1, A, size); MapAtom( Atom2, A, size ); ActivateAtom(Atom2); Application Design challenge: How to do this with low overhead? Expressive Memory New insts in ISA OS Caches Memory Controller . . . Prefetcher DRAM Cache

Outline Why do we need a richer cross-layer abstraction? Designing Expressive Memory (XMem) Evaluation (with a focus on one use case)

A fresh approach to traditional optimizations Express: Data structures Access semantics Data types Working set Reuse Access frequency ….. Optimizations: Cache Management Data Placement in DRAM Data Compression Approximation DRAM Cache Management NVM Management NUCA/NUMA Optimizations …. Performance HW optimizations SW optimizations Programmability Portability

Use Case 1: Improving portability of SW cache optimization SW-based cache optimizations try to fit the working set in the cache Examples: hash-join partitioning, cache tiling, stencil pipelining machine learning, computer vision, numerical analysis, etc.) A B C

Methodology (Use Case 1) Evaluation Infrastructure: Zsim, DRAMSim2 Workloads: Polybench System Parameters: Core: 3.6 GHz, Westmere-like OOO, 4-wide issue, 128-entry ROB L1 Cache: 32KB Inst and 32KB Data, 8 ways, 4 cycles, LRU L2 Cache: 128KB private per core, 8 ways, 8 cycles, DRRIP L3 Cache: 8MB (1MB/core, partitioned), 16 ways, 27 cycles, DRRIP Prefetcher: Multi-stride prefetcher at L3, 16 strides Memory: DRAM DDR3-1066, 2 channels, 1 rank/channel, 8 banks/rank, 17GB/s (2.1GB/s/core), FR-FCFS, open-row policy

Correctly sizing the working set is critical Gemm Cache Thrashing Execution Time 71% Bigger is better (more reuse) Tile Size (kB) Optimal tile size depends on available cache space: This causes portability and programmability challenges

Leveraging Expressive Memory for cache tiling SW expresses program-level semantic information Map tile to an atom, specifying high reuse and tile size HW manages cache space to optimize for performance If tile size < available cache space: default policy If tile size > available cache space: pin a part of the tile, prefetch the rest (avoid thrashing)

Cache tiling with Expressive Memory Gemm Execution Time 47% Tile Size (kB) Knowledge of locality semantics enables more intelligent cache management Improves portability and programmability

Results across more workloads Normalized Exec. Time Tile Size in kB Baseline Expressive Memory

More in the paper Use Case 2: Leveraging data structure semantics to enable more intelligent OS-based page placement More details on the implementation Overhead analysis Other use cases of XMem

Key program information to aid system/HW components in optimization Conclusion General and architecture-agnostic interface to SW to express program semantics Software Higher-level Program Semantics Expressive Memory “XMem” ISA Virtual Memory Key program information to aid system/HW components in optimization Hardware

A Case for Richer Cross-layer Abstractions: Bridging the Semantic Gap with Expressive Memory Nandita Vijaykumar Abhilasha Jain, Diptesh Majumdar, Kevin Hsieh, Gennady Pekhimenko Eiman Ebrahimi, Nastaran Hajinazar, Phillip B. Gibbons, Onur Mutlu