Ke Bai,Aviral Shrivastava Compiler Micro-architecture Lab

Slides:



Advertisements
Similar presentations
Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories Muthu Baskaran 1 Uday Bondhugula.
Advertisements

Part IV: Memory Management
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Algorithms and data structures
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
Memory Management: Overlays and Virtual Memory
High Performing Cache Hierarchies for Server Workloads
CML CML Managing Stack Data on Limited Local Memory Multi-core Processors Saleel Kudchadker Compiler Micro-architecture Lab School of Computing, Informatics.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
A SOFTWARE-ONLY SOLUTION TO STACK DATA MANAGEMENT ON SYSTEMS WITH SCRATCH PAD MEMORY Arizona State University Arun Kannan 14 th October 2008 Compiler and.
CPS3340 COMPUTER ARCHITECTURE Fall Semester, /17/2013 Lecture 12: Procedures Instructor: Ashraf Yaseen DEPARTMENT OF MATH & COMPUTER SCIENCE CENTRAL.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Prof. Necula CS 164 Lecture 141 Run-time Environments Lecture 8.
CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,
Memory Allocation. Three kinds of memory Fixed memory Stack memory Heap memory.
CS 536 Spring Run-time organization Lecture 19.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Chapter 5: Memory Management Dhamdhere: Operating Systems— A Concept-Based Approach Slide No: 1 Copyright ©2005 Memory Management Chapter 5.
Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.
Run-time Environment and Program Organization
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Compiler Construction Lecture 17 Mapping Variables to Memory.
Dynamic Memory Allocation Questions answered in this lecture: When is a stack appropriate? When is a heap? What are best-fit, first-fit, worst-fit, and.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
Multi-Core Architectures
March 12, 2007 Introduction to PS3 Cell BE Programming Narate Taerat.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.
CML SSDM: Smart Stack Data Management for Software Managed Multicores Jing Lu Ke Bai, and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Processes and Virtual Memory
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Memory Management Overview.
Memory Management: Overlays and Virtual Memory. Agenda Overview of Virtual Memory –Review material based on Computer Architecture and OS concepts Credits.
Efficient Dynamic Heap Allocation of Scratch-Pad Memory Ross McIlroy, Peter Dickman and Joe Sventek Carnegie Trust for the Universities of Scotland.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
WCET-Aware Dynamic Code Management on Scratchpads for Software-Managed Multicores Yooseong Kim 1,2, David Broman 2,3, Jian Cai 1, Aviral Shrivastava 1,2.
CML CML A Software Solution for Dynamic Stack Management on Scratch Pad Memory Arun Kannan, Aviral Shrivastava, Amit Pabalkar, Jong-eun Lee Compiler Microarchitecture.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Embedded Real-Time Systems
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Software Coherence Management on Non-Coherent-Cache Multicores
CS427 Multicore Architecture and Parallel Computing
Microarchitecture.
CS5102 High Performance Computer Systems Thread-Level Parallelism
Chapter 9 – Real Memory Organization and Management
Fault-Tolerant NoC-based Manycore system: Reconfiguration & Scheduling
Run-time organization
Structural Simulation Toolkit / Gem5 Integration
Ke Bai and Aviral Shrivastava Presented by Bryce Holton
Hyperthreading Technology
Splitting Functions in Code Management on Scratchpad Memories
Jonathan Mak & Alan Mycroft University of Cambridge
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
CS399 New Beginnings Jonathan Walpole.
Dynamic Code Mapping Techniques for Limited Local Memory Systems
Memory Management Overview
Optimizing Heap Data Management on Software Managed Many-core Architectures By: Jinn-Pean Lin.
José A. Joao* Onur Mutlu‡ Yale N. Patt*
Spring 2008 CSE 591 Compilers for Embedded Systems
Code Transformation for TLB Power Reduction
Lecture 4: Instruction Set Design/Pipelining
COMP755 Advanced Operating Systems
6- General Purpose GPU Programming
Run-time environments
Presentation transcript:

Ke Bai,Aviral Shrivastava Compiler Micro-architecture Lab Heap Data Management for Limited Local Memory (LLM) Multicore Processors Ke Bai,Aviral Shrivastava Compiler Micro-architecture Lab

From multi- to many-core processors Simpler design and verification Reuse the cores Can improve performance without much increase in power Each core can run at a lower frequency Tackle thermal and reliability problems at core granularity Moving to multi-core was inevitable to get performance improvements we have enjoyed over the past two decades Early multi-cores were based on the shared memory architecture What happens if we have 100’s of cores? Shared memory architectures are not scalable and can limit the potential performance available, Cache coherency is a problem Within same process technology, a new processor design with 1.5x to 1.7x performance consumes 2x to 3x the die area [1] and 2x to 2.5x the power[2] Dynamic increases squarely, leakage increases exponentially Needs outstanding performance, especially on game and multimedia applications. Challenges: Power Wall, Frequency Wall, Memory Wall; Complicate in architecture design; Can’t achieve high performance and high power efficiency at the same time (Caches consume 44% in core); Can’t scales well in power consumption. 2. Power components: – Active power – Passive power • Gate leakage • Sub-threshold leakage (source-drain leakage) Result: air cooling, power consumption 3. Branch predictor, cache 4. Distributed system with single core can’t scale well in power consumption GeForce 9800 GT IBM XCell 8i Tilera TILE64 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Memory Scaling Challenge In Chip Multi Processors (CMPs) , caches provide the illusion of a large unified memory Bring required data from wherever into the cache Make sure that the application gets the latest copy of the data Caches consume too much power 44% power, and greater than 34 % area Cache coherency protocols do not scale well Intel 48-core Single Cloud-on-a-Chip, and Intel 80-core processors have non-coherent caches Strong ARM 1100 Intel 80 core chip 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Element Interconnect Bus (EIB) Limited Local Memory Architecture Cores have small local memories (scratch pad) Core can only access local memory Accesses to global memory through explicit DMAs in the program E.g. IBM Cell architecture, which is in Sony PS3. LS SPU PPE SPE 1 SPE 3 SPE 5 SPE 7 Element Interconnect Bus (EIB) Off-chip Global Memory SPE 0 SPE 2 SPE 4 SPE 6 PPE: Power Processor Element SPE: Synergistic Processor Element LS: Local Store 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

LLM Programming Thread based programming, MPI like communication #include<libspe2.h> extern spe_program_handle_t hello_spu; int main(void) { int speid, status; speid (&hello_spu); } <spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); } Local Core <spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); } Local Core <spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); } Local Core = spe_create_thread <spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); } Local Core <spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); } Local Core <spu_mfcio.h> int main(speid, argp) { printf("Hello world!\n"); } Local Core Main Core Extremely power-efficient computation If all code and data fit into the local memory of the cores 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

What if thread data is too large? Two Options Repartition and re-parallelize the application Can be counter-intuitive and hard 24 KB 32 KB 24 KB 32 KB 24 KB There are two closely coupled challenges in developing applications for such architectures. All data should be located in the local memory of a core. If they can fit in the local memory, execution is efficient! Two threads with 32 KB memory each Three cores with 24 KB memory each Manage data to execute in limited memory of core Easier and portable 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Managing data Original Code Local Memory Aware Code int global; f1(){ int a,b; global = a + b; f2(); } int global; f1(){ int a,b; DMA.fetch(global) global = a + b; DMA.writeback(global) DMA.fetch(f2) f2(); } Original Code Local Memory Aware Code 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Heap Data Management All code and data need to be managed Stack, heap, code and global This paper focuses on heap data management Heap data management is difficult Heap size is dynamic, while the size of code and global data are statically known Heap data size can be unbounded Cell programming manual suggests “Use heap data at your own risk”. Restricting heap usage is restrictive for programmers stack stack heap heap heap global code Since malloc() is being used for a long time, restriction would require programmers to abandon many dynamic structures and related algorithms, which would impede the imagination and creativity of programmers. Grow opposite to stack and data dependent Use or not to use malloc() function Not use: severely restrict programming. Use: be responsible for the size of heaps. Best case: program crash Worst case: generate wrong results main() { for (i=0; i<N; i++) { item[i] = malloc(sizeof(Item)); } F1(); 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Outline of the talk Motivation Related works on heap data management Our Approach of Heap Data Management Experiments 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Related Works Local memories in each core are similar to SPMs Extensive works are proposed for SPM Stack: Udayakumaran2006,Dominguez2005, Kannan2009 Global: Avissar2002, Gao2005, Kandemir2002, Steinke2002 Code: Janapsatya2006, Egger2006, Angiolini2004, Pabalkar2008 Heap: Dominguez2005, Mcllroy2008 direct access ARM SPM SPE LLM DMA DMA Global Memory Global Memory Dominguez2005: statically allocate heap data in the scratch-pad memory; everything is decided at the compile-time. 1, it partitions the program into regions, e.g. the start and end of every procedure; 2, it did some analysis to determine the time-order between the regions by finding the set of possible predecessors and successors of each region; 3, copy portions of heap variables into the scratch-pad. Mcllroy2008:This paper presents memory management algorithm. This algorithm uses a variety of techniques to reduce the size of data structures required to manage memory. They are all simplistic. (statically decide which heap should go to where) ARM Memory Architecture IBM Cell Memory Architecture SPM is for Optimization SPM is Essential 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

sizeof(student)=16bytes Our Approach Heap Size = 32bytes sizeof(student)=16bytes typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } student[i].id = i; malloc3 malloc2 malloc1 HP GM_HP Local Memory Global Memory malloc() allocates space in local memory mymalloc() May need to evict older heap objects to global memory It may need to allocate more global memory 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

How to evict data to global memory? Can use DMA to transfer heap object to global memory DMA is very fast – no core-to-core communication But eventually, you can overwrite some other data Need OS mediation Global Memory Execution Core DMA malloc Global Memory Execution Core Main Core malloc malloc Thread communication between cores is slow! 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Hybrid DMA + Communication Can use DMA to transfer heap object to global memory DMA is very fast – no core-to-core communication But eventually, you can overwrite some other data Need OS mediation DMA write from local memory to global memory malloc() { if (enough space in global memory) then write function frame using DMA else request more space in global memory } S allocate ≥S space mail-box based communication startAddr endAddr Execution Thread on execution core Main core Global Memory free() frees global space. Communication is similar to malloc(). Sent the global address to global thread 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Address Translation Functions main() { for (i=0; i<N; i++) { student[i] = malloc( sizeof(Student) ); } student[i].id = i; Heap Size = 32bytes sizeof(student)=16bytes malloc3 student[i] = p2s(student[i]); HP malloc2 student[i] = s2p(student[i]); malloc1 GM_HP Local Memory Global Memory Mapping from SPU address to global address is one to many. Cannot easily find global address from SPU address All heap accesses must happen through global addresses p2s() will translate the global address to spu address Make sure the heap object is in the local memory s2p() will translate the spu address to global address More details in the paper 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Heap Management API Code with Heap Management Original Code malloc() allocate space in local memory and global memory and return global addr free() free space in the global memory p2s() Assures heap variable exists in the local memory and uses spuAddr. s2p() Translate the spuAddr back to ppuAddr. Original Code typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; } typedef struct{ int id; float score; }Student; main() { for (i=0; i<N; i++) { student[i] = malloc(sizeof(Student)); student[i].id = i; } student[i] = p2s(student[i]); student[i] = s2p(student[i]); Our approach provides an illusion of unlimited space in the local memory! 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Experimental Setup Sony PlayStation 3 running a Fedora Core 9 Linux MiBench Benchmark Suite and other possible applications http://www.public.asu.edu/~kbai3/publications.html The runtimes are measured with spu_decrementer() for SPE and _mftb() for the PPE provided with IBM Cell SDK 3.1 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Unrestricted Heap Size Runtimes are comparable 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Larger Heap Space  Lower Runtime 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Runtime decreases with Granularity Granularity: # of heap objects combined as a transfer unit 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Embedded Systems Optimization If the maximum heap space needed is known No thread communication is needed. DMAs are sufficient Average 14% improvement 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Scalability of Heap Management 2018/11/15 http://www.public.asu.edu/~ashriva6/cml

Summary Moving from multi-core to many-core systems Scaling the memory architecture is a major challenge Limited Local Memory architectures are promising Code and data should be managed if they can not fit in the limited local memory We propose a heap data management scheme Manage any size of heap data in a constant space in local memory It’s automatable, then can increase productivity of programmers It’s scalable for different number of cores Overhead ~ 4-20% Comparison with software cache Does not support pointer One SW cache for one data type Cannot optimize any further 2018/11/15 http://www.public.asu.edu/~ashriva6/cml