High Performance Computing (HIPC)

Slides:

Advertisements

Similar presentations

Lecture 8: Memory Hierarchy Cache Performance Kai Bu

Advertisements

Lecture 12 Reduce Miss Penalty and Hit Time

CMSC 611: Advanced Computer Architecture Cache Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from.

CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

1 A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang*, Frank Vahid**, and Roman Lysecky *Dept. of Electrical Engineering Dept. of Computer.

Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.

S CRATCHPAD M EMORIES : A D ESIGN A LTERNATIVE FOR C ACHE O N - CHIP M EMORY IN E MBEDDED S YSTEMS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Embedded Software Optimization for MP3 Decoder Implemented on RISC Core Yingbiao Yao, Qingdong Yao, Peng Liu, Zhibin Xiao Zhejiang University Information.

Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Memory Allocation via Graph Coloring using Scratchpad Memory

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

A Compiler-in-the-Loop (CIL) Framework to Explore Horizontally Partitioned Cache (HPC) Architectures Aviral Shrivastava*, Ilya Issenin, Nikil Dutt *Compiler.

CSC 4250 Computer Architectures December 5, 2006 Chapter 5. Memory Hierarchy.

Microprocessor-based systems Curse 7 Memory hierarchies.

Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.

1 Computer Architecture Research Overview Rajeev Balasubramonian School of Computing, University of Utah

Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Spring 2008 CSE 591 Compilers for Embedded Systems Aviral Shrivastava Department of Computer Science and Engineering Arizona State University.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

1 of 14 Lab 2: Design-Space Exploration with MPARM.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

PipeliningPipelining Computer Architecture (Fall 2006)

1 Compiler Managed Dynamic Instruction Placement In A Low-Power Code Cache Rajiv Ravindran, Pracheeti Nagarkar, Ganesh Dasika, Robert Senger, Eric Marsman,

CS 704 Advanced Computer Architecture

Cache and Scratch Pad Memory (SPM)

Lynn Choi School of Electrical Engineering

Lecture 3: MIPS Instruction Set

Reducing Code Management Overhead in Software-Managed Multicores

Reducing Hit Time Small and simple caches Way prediction Trace caches

William Stallings Computer Organization and Architecture 8th Edition

Improving Memory Access 1/3 The Cache and Virtual Memory

CSC 4250 Computer Architectures

Modeling of Digital Systems

Architecture & Organization 1

5.2 Eleven Advanced Optimizations of Cache Performance

Cache Memory Presentation I

Morgan Kaufmann Publishers Memory & Cache

Flow Path Model of Superscalars

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Splitting Functions in Code Management on Scratchpad Memories

Spare Register Aware Prefetching for Graph Algorithms on GPUs

Cache Memories September 30, 2008

Architecture & Organization 1

Lecture 08: Memory Hierarchy Cache Performance

Address-Value Delta (AVD) Prediction

Ann Gordon-Ross and Frank Vahid*

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Adapted from slides by Sally McKee Cornell University

Hyesoon Kim Onur Mutlu Jared Stark* Yale N. Patt

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Computer Evolution and Performance

Spring 2008 CSE 591 Compilers for Embedded Systems

Code Transformation for TLB Power Reduction

Increasing Effective Cache Capacity Through the Use of Critical Words

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Automatic Tuning of Two-Level Caches to Embedded Applications

Overview Problem Solution CPU vs Memory performance imbalance

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

High Performance Computing (HIPC) SDRM: Simultaneous Determination of Regions and Function-to-Region Mapping for Scratchpad Memories Amit Pabalkar, Aviral Shrivastava, Arun Kannan and Jongeun Lee Compiler and Micro-architecture Lab School of Computing and Informatics Arizona State University High Performance Computing (HIPC) December 2008 6/2/2018 http://www.public.asu.edu/~ashriva6

Agenda Motivation SPM Advantage SPM Challenges Previous Approach Code Mapping Technique Results Continuing Effort 6/2/2018 http://www.public.asu.edu/~ashriva6

Motivation - The Power Trend Within same process technology, a new processor design with 1.5x to 1.7x performance consumes 2x to 3x the die area [1] and 2x to 2.5x the power[2] For a particular process technology with fixed transistor budget, the performance/power and performance/unit area scales with the number of cores. Cache consumes around 44% of total processor power Cache architecture cannot scale on a many-core processor due to cache coherency attributed performance degradation. 4.1.2 Will we really fit 1000s of cores on one economical chip This significant reduction in the size and complexity of the basic processor building block of the future means that many more cores can be economically implemented on a single die; furthermore, this number can double with each generation of silicon. For example, the “manycore” progression might well be 128, 256, 512, ... cores instead of the current “multicore” plan of 2, 4, 8, ... cores over the same semiconductor process generations. There is strong empirical evidence that we can achieve 1000 cores on a die when 30nm technology is available. (As Intel has taped out a 45-nm technology chip, 30 nm is not so distant in the future.) Cisco today embeds in its routers a network processor with 188 cores implemented in 130 nm technology. [Eatherton 2005] This chip is 18mm by 18mm, The Landscape of Parallel Computing Research: A View From Berkeley 23 dissipates 35W at a 250MHz clock rate, and produces an aggregate 50 billion instructions per second. The individual processor cores are 5-stage Tensilica processors with very small caches, and the size of each core is 0.5 mm2. About a third of the die is DRAM and special purpose functions. Simply following scaling from Moore's Law would arrive at 752 processors in 45nm and 1504 in 30nm. Unfortunately, power may not scale down with size, but we have ample room before we push the 150W limit of desktop or server applications. 6/2/2018 Go to References http://www.public.asu.edu/~ashriva6

Scratchpad Memory(SPM) High speed SRAM internal memory for CPU SPM falls at the same level as the L1 Caches in memory hierarchy Directly mapped to processor’s address space. Used for temporary storage of data, code in progress for single cycle access by CPU 6/2/2018 http://www.public.asu.edu/~ashriva6

The SPM Advantage Cache SPM Tag Array Data Array Tag Comparators, Muxes Address Decoder Address Decoder Cache SPM 40% less energy as compared to cache Absence of tag arrays, comparators and muxes 34 % less area as compared to cache of same size Simple hardware design (only a memory array & address decoding circuitry) Faster access to SPM than physically indexed and tagged cache 6/2/2018 http://www.public.asu.edu/~ashriva6

Challenges in using SPMs Application has to explicitly manage SPM contents Code/Data mapping is transparent in cache based architectures Mapping Challenges Partitioning available SPM resource among different data Identifying data which will benefit from placement in SPM Minimize data movement between SPM and external memory Optimal data allocation is an NP-complete problem Binary Compatibility Application compiled for specific SPM size Sharing SPM in a multi-tasking environment Need completely automated solutions (read compiler solutions) 6/2/2018 http://www.public.asu.edu/~ashriva6

Using SPM Code: Predictable due to spatial and temporal locality int global; FUNC2() { int a, b; global = a + b; } FUNC1(){ FUNC2(); int global; FUNC2() { int a,b; DSPM.fetch.dma(global) global = a + b; DSPM.writeback.dma(global) } FUNC1(){ ISPM.overlay(FUNC2) FUNC2(); Code: Predictable due to spatial and temporal locality Size remains unchanged Lifetime of a function extends from start of program to the last reference Read only, no need to write back to main memory Greater scope of energy reduction by mapping code to SPM Original Code SPM Aware Code

Previous Work Static Techniques [3,4]. Contents of SPM do not change during program execution – less scope for energy reduction. Profiling is widely used but has some drawbacks [3, 4, 5, 6, 7,8] Profile may depend heavily depend on input data set Profiling an application as a pre-processing step may be infeasible for many large applications It can be time consuming, complicated task ILP solutions do not scale well with problem size [3, 5, 6, 8] Some techniques demand architectural changes in the system [6,10] 6/2/2018 http://www.public.asu.edu/~ashriva6

Code Allocation on SPM What to map? Where to map? Segregation of code into cache and SPM Eliminates code whose penalty is greater than profit No benefits in architecture with DMA engine Not an option in many architecture e.g. CELL Where to map? Address on the SPM where a function will be mapped and fetched from at runtime. To efficiently use the SPM, it is divided into bins/regions and functions are mapped to regions What are the sizes of the SPM regions? What is the mapping of functions to regions? The two problems if solved independently leads to sub-optimal results Our approach is a pure software dynamic technique based on static analysis addressing the ‘where to map’ issue. It simultaneously solves the region size and function-to-region mapping sub-problems http://www.public.asu.edu/~ashriva6

Problem Formulation Input Output Objective Function Set V = {v1 , v2 … vf } – of functions Set S = {s1 , s2 … sf } – of function sizes Espm/access and E cache/access Embst energy per burst for the main memory Eovm energy consumed by overlay manager instruction Output Set {S1, S2, … Sr} representing sizes of regions R = {R1, R2, … Rr } such that ∑ Sr ≤ SPM-SIZE Function to Region mapping, X[f,r] = 1, if function f is mapped to region r, such that ∑ Sf x X[f,r] ≤ Sr Objective Function Minimize Energy Consumption Evihit = nhitvi x (Eovm + Espm/access x si) Evimiss = nmissvi x (Eovm + Espm/access x si + Embst x (si + sj) / Nmbst Etotal = ∑ (Evihit + Evimiss) Maximize Runtime Performance 6/2/2018 http://www.public.asu.edu/~ashriva6

Overview GCCFG Weight Assignment SDRM Heuristic/ILP Interference Graph http://www.public.asu.edu/~ashriva6 6/2/2018 Static Analysis Function Region Mapping Cycle Accurate Simulation GCCFG Weight Assignment SDRM Heuristic/ILP Interference Graph Instrumented Binary Link Phase Application Energy Statistics Compiler Framework Performance Statistics 11 6/2/2018 http://www.public.asu.edu/~ashriva6

Limitations of Call Graph MAIN ( ) F2 ( ) F1( ) for for F6 ( ) F2 ( ) F3 ( ) end for while END MAIN F4 ( ) end while F5 (condition) end for if (condition) F5( ) condition = … END F2 F5() end if END F5 F2 F5 F3 F6 F4 F1 main Limitations No information on relative ordering among nodes (call sequence) No information on execution count of functions Problems in profiling, problems in call graphs. GCCFG not needed here 6/2/2018 http://www.public.asu.edu/~ashriva6

Global Call Control Flow Graph MAIN ( ) F2 ( ) F1( ) for for F6 ( ) F2 ( ) F3 ( ) end for while END MAIN F4 ( ) end while F5 (condition) end for if (condition) if() condition = … F5( ) else else F5(condition) F1() end if end if END F5 END F2 L1 L2 F2 F5 F3 L3 F6 F4 1000 100 20 10 F1 main I1 I2 T F Loop Factor 10 Recursion Factor 2 Advantages Strict ordering among the nodes. Left child is called before the right child Control information included (L-nodes and I-nodes) Node weights indicate execution count of functions Recursive functions identified 6/2/2018 http://www.public.asu.edu/~ashriva6

Interference Graph Create Interference Graph. main Caller-Callee-no-loop Caller-Callee-in-loop Create Interference Graph. Node of I-Graph are functions or F-nodes from GCCFG There is an edge between two F-nodes nodes if they interfere with each other. The edges are classified as Caller-Callee-no-loop, Caller-Callee-in-loop, Callee-Callee-no-loop, Callee-Callee-in-loop Assign weights to edges of I-Graph Caller-Callee-no-loop: cost[i,j] = (si + sj) x wj Caller-Callee-in-loop: cost[i,j] = (si + sj) x wj Callee-Callee-no-loop: cost[i,j] = (si+ sj) x wk, where wk= MIN (wi , wj ) Callee-Callee-in-loop: F1 F1 Callee-Callee-in-loop L3 20 F2 F2 F5 F5 10 L3 100 F6 F6 F3 F3 L3 1000 100 F4 F4 3000 500 120 500 400 routines Size F2 2 F3 3 F4 1 F6 4 F1 F5 600 700 6/2/2018 http://www.public.asu.edu/~ashriva6

SDRM Heuristic F2 F4,F3 F4 F3 F6 F6,F3 F3 F6 F6 1 2 3 4 5 6 7 routines Size F2 2 F3 3 F4 1 F6 4 Interference Graph F6 F2 F3 F4 3000 400 700 500 600 Interference Graph F6 F2 1 R1 2 Suppose SPM size is 7KB F4,F3 F4 3 R2 F3 F6 F6,F3 F3 4 R3 Region Routine Size Cost 5 R1 F2 2 F6 6 R2 R2 F4,F3 F4 3 1 400 F6 7 R3 R3 F6 F6,F3 4 4 700 700 Total Total Total 9 7 10 3 700 6/2/2018

Flow Recap Static Analysis Cycle Accurate Simulation GCCFG Function Region Mapping Cycle Accurate Simulation GCCFG Weight Assignment SDRM Heuristic/ILP Interference Graph Instrumented Binary Link Phase Application Energy Statistics Compiler Framework Performance Statistics 16 6/2/2018 http://www.public.asu.edu/~ashriva6

Overlay Manager Overlay Table Region Table main …. F1 F3 F2 F1(){ ISPM.overlay(F3) F3(); } F3() { ISPM.overlay(F2) F2() … ISPM.return ID Region VMA LMA Size F1 0x30000 0xA00000 0x100 F2 0x30000 0xA00100 0x200 F3 1 0x30200 0xA00300 0x1000 F4 1 0x30200 0xA01300 0x300 F5 2 0x31200 0xA01600 0x500 Region Table Region ID F2 F1 F1 1 F3 2 F5 main …. F1 F3 F2

Performance Degradation Scratchpad Overlay Manager is mapped to cache Branch Target Table has to be cleared between function overlays to same region Transfer of code from main memory to SPM is on demand FUNC1( ) { ISPM.overlay(FUNC2) computation … FUNC2(); } FUNC1( ) { computation … ISPM.overlay(FUNC2) FUNC2(); } 6/2/2018 http://www.public.asu.edu/~ashriva6

SDRM-prefetch SDRM SDRM-prefetch Modified Cost Function main F1 F2 L1 F3 L2 L3 F4 F6 F5 Q = 10 C = 10 1 100 1000 10 C3 C1 C2 MAIN ( ) F2 ( ) F1( ) for for computation F2 ( ) F6 ( ) end for computation END MAIN F3 ( ) F5 (condition) while if (condition) F4 ( ) end while F5() end for end if computation END F5 F5( ) END F2 Modified Cost Function costp[vi, vj ] = (si + sj) x min(wi,wj) x latency cycles/byte - (Ci + Cj) cost[vi,vj] = coste[vi, vj ] x costp[vi, vj ] Region ID F1 2 F3 1 F4,F5 F2 F2,F1 F3,F6 F4 3 F6 F5 SDRM SDRM-prefetch 19 6/2/2018

Energy Model ETOTAL = ESPM + EI-CACHE + ETOTAL-MEM ESPM = NSPM x ESPM-ACCESS EI-CACHE = EIC-READ-ACCESS x { NIC-HITS + NIC-MISSES } + EIC-WRITE-ACCESS x 8 x NIC-MISSES ETOTAL-MEM = ECACHE-MEM + EDMA ECACHE-MEM = EMBST x NIC-MISSES EDMA = NDMA-BLOCK x EMBST x 4 20 6/2/2018 http://www.public.asu.edu/~ashriva6

Performance Model chunks = block-size + (bus width - 1) / bus width (64 bits) mem lat[0] = 18 [first chunk] mem lat[1] = 2 [inter chunk] total-lat = mem lat[0] + mem lat[1] x (chunks - 1) latency cycles/byte = total-lat / block-size 21 6/2/2018 http://www.public.asu.edu/~ashriva6

Average Energy Reduction of 25.9% for SDRM SDRM is power efficient Average Energy Reduction of 25.9% for SDRM 6/2/2018 http://www.public.asu.edu/~ashriva6

Cache Only vs Split Arch. ARCHITECTURE 1 X bytes Instruction Cache X bytes Instruction Cache Data Cache On chip ARCHITECTURE 2 x/2 bytes Instruction cache Data Cache Avg. 35% energy reduction across all benchmarks Avg. 2.08% performance degradation x/2 bytes Instruction SPM On chip 6/2/2018 http://www.public.asu.edu/~ashriva6

SDRM with prefetching is better Average Performance Improvement 6% Average Energy Reduction 32% (3% less) 6/2/2018 http://www.public.asu.edu/~ashriva6

Conclusion By splitting an Instruction Cache into an equal sized SPM and I-Cache, a pure software technique like SDRM will always result in energy savings. Tradeoff between energy savings and performance improvement. SPM are the way to go for many-core architectures. 6/2/2018 http://www.public.asu.edu/~ashriva6

Continuing Effort Improve static analysis Investigate effect of outlining on the mapping function Explore techniques to use and share SPM in a multi-core and multi-tasking environment 6/2/2018 http://www.public.asu.edu/~ashriva6

References http://www.public.asu.edu/~ashriva6 6/2/2018 New Microarchitecture Challenges for the Coming Generations of CMOS Process Technologies. Micro32. GROCHOWSKI, E., RONEN, R., SHEN, J., WANG, H. 2004. Best of Both Latency and Throughput. 2004 IEEE International Conference on Computer Design (ICCD ‘04), 236-243. S. Steinke et al. : Assigning program and data objects to scratchpad memory for energy reduction. F. Angiolini et al: A post-compiler approach to scratchpad mapping code. B Egger, S.L. Min et al. : A dynamic code placement technique for scratchpad memory using postpass optimization B Egger et al : Scratchpad memory management for portable systems with a memory management unit M. Verma et al. : Dynamic overlay of scratchpad memory for energy minimization M. Verma and P. Marwedel : Overlay techniques for scratchpad memories in low power embedded processors* S. Steinke et al. : Reducing energy consumption by dynamic copying of instructions onto onchip memory A. Udayakumaran and R. Barua: Dynamic Allocation for Scratch-Pad Memory using Compile-time Decisions