Dynamic Code Mapping Techniques for Limited Local Memory Systems

Slides:

Advertisements

Similar presentations

Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.

Advertisements

© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.

CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.

A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.

CML CML Presented by: Aseem Gupta, UCI Deepa Kannan, Aviral Shrivastava, Sarvesh Bhardwaj, and Sarma Vrudhula Compiler and Microarchitecture Lab Department.

5th International Conference, HiPEAC 2010 MEMORY-AWARE APPLICATION MAPPING ON COARSE-GRAINED RECONFIGURABLE ARRAYS Yongjoo Kim, Jongeun Lee *, Aviral Shrivastava.

CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,

Variable-Based Multi-Module Data Caches for Clustered VLIW Processors Enric Gibert 1,2, Jaume Abella 1,2, Jesús Sánchez 1, Xavier Vera 1, Antonio González.

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Architectural and Compiler Techniques for Energy Reduction in High-Performance Microprocessors Nikolaos Bellas, Ibrahim N. Hajj, Fellow, IEEE, Constantine.

Multi-core processors. History In the early 1970’s the first Microprocessor was developed by Intel. It was a 4 bit machine that was named the 4004 The.

Memory Allocation via Graph Coloring using Scratchpad Memory

To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,

Exploring the Tradeoffs of Configurability and Heterogeneity in Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable.

Prospector : A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by.

Mahdi Hamzeh, Aviral Shrivastava, and Sarma Vrudhula School of Computing, Informatics, and Decision Systems Engineering Arizona State University June 2013.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

A Dynamic Code Mapping Technique for Scratchpad Memories in Embedded Systems Amit Pabalkar Compiler and Micro-architecture Lab School of Computing and.

CML REGISTER FILE ORGANIZATION FOR COARSE GRAINED RECONFIGURABLE ARCHITECTURES (CGRAs) Dipal Saluja Compiler Microarchitecture Lab, Arizona State University,

2013/12/09 Yun-Chung Yang Partitioning and Allocation of Scratch-Pad Memory for Priority-Based Preemptive Multi-Task Systems Takase, H. ; Tomiyama, H.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Software Architecture for Dynamic Thermal Management in Datacenters Tridib Mukherjee Graduate Research Assistant IMPACT Lab ( Department.

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Analysis of Cache Tuner Architectural Layouts for Multicore Embedded Systems + Also Affiliated with NSF Center for High- Performance Reconfigurable Computing.

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.

WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.

CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.

CML Branch Penalty Reduction by Software Branch Hinting Jing Lu Yooseong Kim, Aviral Shrivastava, and Chuan Huang Compiler Microarchitecture Lab Arizona.

CPU (Central Processing Unit). The CPU is the brain of the computer. Sometimes referred to simply as the processor or central processor, the CPU is where.

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany.

Reducing Code Management Overhead in Software-Managed Multicores

Software Coherence Management on Non-Coherent-Cache Multicores

CS427 Multicore Architecture and Parallel Computing

High Performance Computing (HIPC)

For Massively Parallel Computation The Chaotic State of the Art

Selective Code Compression Scheme for Embedded System

A Study of Group-Tree Matching in Large Scale Group Communications

Multi-core processors

Resource Elasticity for Large-Scale Machine Learning

“Temperature-Aware Task Scheduling for Multicore Processors”

Ching-Chi Lin Institute of Information Science, Academia Sinica

Multi-core processors

Ke Bai and Aviral Shrivastava Presented by Bryce Holton

Splitting Functions in Code Management on Scratchpad Memories

Supporting Fault-Tolerance in Streaming Grid Applications

RegLess: Just-in-Time Operand Staging for GPUs

Software Cache Coherent Control by Parallelizing Compiler

Ke Bai，Aviral Shrivastava Compiler Micro-architecture Lab

EPIMap: Using Epimorphism to Map Applications on CGRAs

Department of Computer Science University of California, Santa Barbara

Ann Gordon-Ross and Frank Vahid*

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Jian Cai, Aviral Shrivastava Presenter: Yohan Ko

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Optimizing Heap Data Management on Software Managed Many-core Architectures By: Jinn-Pean Lin.

Spring 2008 CSE 591 Compilers for Embedded Systems

Department of Electrical Engineering Joint work with Jiong Luo

Department of Computer Science, University of Tennessee, Knoxville

Realizing Closed-loop, Online Tuning and Control for Configurable-Cache Embedded Systems: Progress and Challenges Islam S. Badreldin*, Ann Gordon-Ross*,

Code Transformation for TLB Power Reduction

Reiley Jeyapaul and Aviral Shrivastava Compiler-Microarchitecture Lab

Run-time environments

Presentation transcript:

Dynamic Code Mapping Techniques for Limited Local Memory Systems Seungchul Jung Compiler Microarchitecture Lab Department of Computer Science and Engineering Arizona State University 12/9/2018

Multicore Landscape Singlecore Architecture Multicore Architecture Power Temperature NVidia IBM Heat Reliability Multicore architecture has less power density Multicore Architecture Intel 12/9/2018

Element Interconnect Bus Multi-core Memory Critical issues with cache in DMS Scalability Coherency Protocols SPU LS DMA Element Interconnect Bus PPU Memory Controller Bus Interface As the number of core increases, the number of local memory increases too. Managing those memories in hardware may cause serious performance overhead (it consumes a lot of power, and it does not scale well with the number of cores due to data coherency protocol) Consequently, in distributed memory system, is the only option for memory scaling. One example of LLM is cell 12/9/2018

Core Memory Management Local Memory size is limiting factor Need for Automatic Management Application developers are already busy Code, Variable, Heap and stack int global; F1(){ int var1, var2; global = var1 + var2; F2(); } int global; F1(){ int var1, var2; DLM.fetch(global); global = var1 + var2; DLM.writeback(global); ILM.fetch(F2); F2(); } Scales for any number of cores. Cuz it’s direct/physical mapping on SPM Mention Cell processor 12/9/2018

Code Management Mechanism SECTIONS { OVERLAY { F1.o F3.o } F2.o (b) Linker.Script F1 F2 stack F3 heap F1 F1 (a) Application Call Graph variable F2 F2 F2 F3 F3 code F1 F3 I haven’t mentioned that the architecture is the Cell processor (c) Local Memory (d) Main Memory 12/9/2018

Code Management Problem F1 F5 REGION REGION F2 F3 F7 REGION F6 • F4 Local Memory Code Section # of Regions and Function-To-Region Mapping Two extreme cases Wise code management due to NP-Complete Minimum data transfer with given space How do you map all the functions into limited space (define problem) then, figure ou number of regions and function-to-region mapping What’s issue? 12/9/2018

Related Work Cluster functions into regions ILP formulation [2,3,4,5] minimize the intra-cluster interference ILP formulation [2,3,4,5] Intractable for large application Heuristics are proposed Best-fit[6], First-Fit[4], SDRM[5] Egger et al. Scratchpad memory management.. EMSOFT '06 Steinke et al. Reducing energy consumption.. ISSS '02 Egger et al. A dynamic code placement.. CASES '06 Verma et al. Overlay techniques.. VLSI’06 Pabalkar et al. SDRM: Simultaneous.. HIPC’08 Udayakumaran et al. Dynamic allocation for.. ACM Trans.’06 Brief Two last explanation 12/9/2018

Limitations of Previous Works 1 1KB F1(){ F2(); F3(); } F1 1 1 F3 1KB F2 1KB 10 10 10 100 F2(){ for(i=0;i<10;i++){ F4(); } for(i=0;i<100;i++){ F5(); 1KB 1KB 1KB 1KB F6 F7 F4 F5 Clear Execution Order (b) Call Graph 1KB F1 1 1 F3(){ for(i=0;i<10;i++){ F6(); F7(); } F2 1KB F3 1KB Interference estimate – what information to start with Call graph + profile Lost information GCCFG Has information We need to know which function is executed first, so we need function order Importance of Loop existence in GCCFG Effect in call count 1 1 1 L1 10 L2 100 L3 10 1 1 1 1 1KB 1KB 1KB 1KB F4 F5 F6 F7 (a) Example Application (c) GCCFG 12/9/2018

Limitations of Previous Works 2 } F2(){ F3(); F3(){ F4(); (a) Example F1, 2KB F1 F1,F4 F1 Regoin 0 2Kb Regoin 0 2Kb Regoin 0 2Kb 1 Region 1 1.5Kb F2 F2,F3 Region 1 1.5Kb F2,F3 F2, 1.5KB Region 1 1.5Kb F2,F3,F4 1 (b) Intermediate Mapping (c) NOT considering other functions (d) Considering other functions F3, 0.4KB 2Kb + 0.4Kb 2Kb + 0.2Kb 2Kb + 0.2Kb Regoin 0 Regoin 0 1 Regoin 0 F4, 0.2KB 1.5Kb + 0.4Kb 1.5Kb + 0.2Kb + 0.4Kb + 0.2Kb + 0.4Kb + 0.2Kb Region 1 Region 1 Region 1 21% (a) Call Graph Give overall picture of purpose in this slide Interference cost depends on where other functions are mapped Show that this is true Show it is important to consider it 7.6Kb 6Kb 12/9/2018

Our Approach FMUM FMUP F1() { F2(); for(int I = 0; I < 200; i++){ } F2() { F3(); for(int I = 0; I < 100; i++){ F4(); F1 1 1 F2 L2 FMUM FMUP 1 1 200 F3 L1 F5 100 F4 12/9/2018

FMUM Heuristic Maximum (7.5KB) Given (5.5KB) (a) Start (b) Next step F1,F5 1.5KB F2 1.5KB F2 F2 1.5KB 1.5KB F3 0.5KB F3 0.5KB F3 0.5KB F4 2KB F4 2KB F4 2KB F5 F5 1.5KB F6 F6 1KB F6 1KB (a) Start (b) Next step (c) Final Mention the metric of evaluation (data transfer amount) (and effect in performance) 12/9/2018

FMUP Heuristic Minimum (2KB) Given Size (5KB) (a) START (c) STEP2 (e) FINAL 1.5KB F1 F2 F3 2KB F4 F5 F6 2KB New Region New Region 1.5KB 1.5KB (b) STEP1 (d) STEP3 12/9/2018

Interference Cost Calculation Ancestor Relation F1 • Direct Caller-Callee : (F2 + F3) X 3 1 1 F2 L2 • Caller-Callee without a loop in-between : (F1 + F3) X 1 3 1 200 F3 L1 F5 • Caller-Callee with a loop in-between : (F2 + F4) X 100 100 F4 Between slide 9 and 10 LCA relation (F3 + F5) X 1 12/9/2018

Experiments Setup Generate / Apply Code Mapping Application FMUM Binary Profiling Info. FMUP Binary Given Code Size. SDRM Binary FMUM FMUP SDRM legend 12/9/2018

Typical Performance Result FMUM performs better FMUP performs better Why FMUP(FMUM) is beeter 12/9/2018

Number of Times of Improvement Pick the better of FMUM and FMUP 82% of time, FMUM + FMUP gives better result Percentage of number of points which gives better result than SDRM (explain with previous slide) Explain bars/setup more clearly 12/9/2018

Average 12% reduction in runtime FMUM + FMUP gives better Perf. by 12% 12/9/2018

Utilizing Given Code Space Given code space is fully utilized Give introduction of 3 limitations 12/9/2018

Efficient in-Loop-Functions Mapping In-loop-functions are mapped separatly 12/9/2018

100% mappability Guaranteed Increase Map-ability 100% mappability Guaranteed 12/9/2018

Impact of Different GCCFG Weight Assignment Can Reduce compile time overhead 1.04 0.96 12/9/2018

Performance w/ Increased SPU Threads Scalability with increased number of cores Adpcm uses bigger input data EIB is overloaded Explain about abnormality 2 SPU are reserved for PS3 itself 12/9/2018

Conclusion Trend of Computer Architecture Multicore with Limited Local Memory System Memory Management is required Code, Variable, Heap and Stack Better performance with limited resource Limitations of previous works Call Graph and fixed Interference Cost Two new heuristics (FMUM, FMUP) Overall Performance Increase by 12% Tolerable Compile Time Overhead 12/9/2018

Contributions and Outcomes Problem formulation using GCCFG Updating interference cost between functions Outcomes Software release (www.public.asu.edu/~sjung) Paper submission to ASAP2010 Plans Journal submission prepared 12/9/2018