Computation and data migration in an embedded many-core SoC January 20 2015 Matthieu BRIEDA Anca MOLNOS Julien.

Slides:



Advertisements
Similar presentations
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Advertisements

Chapter 3 Process Description and Control
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
System-level Trade-off of Networks-on-Chip Architecture Choices Network-on-Chip System-on-Chip Group, CSE-IMM, DTU.
14 Macintosh OS X Internals. © 2005 Pearson Addison-Wesley. All rights reserved The Macintosh Platform 1984 – first affordable GUI Based on Motorola 32-bit.
Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.
Slide 6-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 6 Implementing Processes, Threads, and Resources.
Process Description and Control Chapter 3. Major Requirements of an Operating System Interleave the execution of several processes to maximize processor.
OS Fall ’ 02 Introduction Operating Systems Fall 2002.
1 Process Description and Control Chapter 3. 2 Process Management—Fundamental task of an OS The OS is responsible for: Allocation of resources to processes.
Process Management. External View of the OS Hardware fork() CreateProcess() CreateThread() close() CloseHandle() sleep() semctl() signal() SetWaitableTimer()
OS Spring’03 Introduction Operating Systems Spring 2003.
Computer Science Linux Dionisys: A Kernel-Based Approach to QoS Management Richard West & Jason Gloudon Operating Systems & Services Group.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
Process Description and Control A process is sometimes called a task, it is a program in execution.
OS Spring’04 Introduction Operating Systems Spring 2004.
Using Two Queues. Using Multiple Queues Suspended Processes Processor is faster than I/O so all processes could be waiting for I/O Processor is faster.
CoNA : Dynamic Application Mapping for Congestion Reduction in Many-Core Systems 2012 IEEE 30th International Conference on Computer Design (ICCD) M. Fattah,
File System. NET+OS 6 File System Architecture Design Goals File System Layer Design Storage Services Layer Design RAM Services Layer Design Flash Services.
Jason Morrill NCOAUG Training Day February, 2008
Efficient Hardware dependant Software (HdS) Generation using SW Development Platforms Frédéric ROUSSEAU CASTNESS‘07 Computer Architectures and Software.
1 A Flexible and Secure Deployment Framework for Distributed Applications Alan Dearle, Graham Kirby, Andrew McCarthy and Juan Carlos Diaz y Carballo School.
2017/4/21 Towards Full Virtualization of Heterogeneous Noc-based Multicore Embedded Architecture 2012 IEEE 15th International Conference on Computational.
DOT’98 Heidelberg 1 A. Hoffmann & M. Born Requirements for Advanced Distribution and Configuration Support GMD FOKUS Andreas Hoffmann & Marc Born
Chapter 3 Process Description and Control Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Unrestricted Connection manager MIF WG IETF 78, Maastricht Gaëtan Feige, Cisco (presenter) Pierrick Seïté, France Telecom -
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Chapter 3 Process Description and Control Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
CSC 501 Lecture 2: Processes. Process Process is a running program a program in execution an “instantiation” of a program Program is a bunch of instructions.
Copyright © George Coulouris, Jean Dollimore, Tim Kindberg This material is made available for private study and for direct.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
Nano-RK: An Energy-Aware Resource Centric RTOS for Sensor Networks Anand Eswaran, Anthony Rowe and Raj Rajkumar Presented by: Ravi Ramaseshan.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.
OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.
Virtual Memory. Background Virtual memory is a technique that allows execution of processes that may not be completely in the physical memory. Virtual.
CSS 700: MASS CUDA Parallel‐Computing Library for Multi‐Agent Spatial Simulation Fall Quarter 2014 Nathaniel Hart UW Bothell Computing & Software Systems.
Presentation by Tom Hummel OverSoC: A Framework for the Exploration of RTOS for RSoC Platforms.
Joonwon Lee Process and Address Space.
The influence of system calls and interrupts on the performances of a PC cluster using a Remote DMA communication primitive Olivier Glück Jean-Luc Lamotte.
Processes and Virtual Memory
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Implementation.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Hardware process When the computer is powered up, it begins to execute fetch-execute cycle for the program that is stored in memory at the boot strap entry.
1 Pintos Virtual Memory Management Project (CS3204 Spring 2006 VT) Yi Ma.
2 Processor(s)Main MemoryDevices Process, Thread & Resource Manager Memory Manager Device Manager File Manager.
Globus: A Report. Introduction What is Globus? Need for Globus. Goal of Globus Approach used by Globus: –Develop High level tools and basic technologies.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
What is a Process ? A program in execution.
CS 241 Discussion Section (02/02/2012). MP2 Overview  Task is simple  Reimplement malloc(), calloc(), realloc() and free()  A contest will be running.
10.1 Chapter 10: Virtual Memory Background Demand Paging Process Creation Page Replacement Allocation of Frames Thrashing Operating System Examples.
Automatic CPU-GPU Communication Management and Optimization Thomas B. Jablin,Prakash Prabhu. James A. Jablin, Nick P. Johnson, Stephen R.Breard David I,
Region-Based Software Distributed Shared Memory Song Li, Yu Lin, and Michael Walker CS Operating Systems May 1, 2000.
Z IGBEE and OSAL Jaehoon Woo KNU RTLAB. KNU RTLAB.
Embedded Real-Time Systems
Processes and Threads Chapter 3 and 4 Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College,
Vivek Seshadri 15740/18740 Computer Architecture
Chapter 9: Virtual Memory – Part I
Chapter 9: Virtual Memory
CASE STUDY 1: Linux and Android
Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA
Introduction Enosis Learning.
Improving java performance using Dynamic Method Migration on FPGAs
Introduction Enosis Learning.
Outline Midterm results summary Distributed file systems – continued
Process Description and Control
High Performance Computing
Implementing Processes, Threads, and Resources
Maria Méndez Real, Vincent Migliore, Vianney Lapotre, Guy Gogniat
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

Computation and data migration in an embedded many-core SoC January Matthieu BRIEDA Anca MOLNOS Julien MOTTIN 1

Background 2

Simulated Heat map of Sthorm Platform before (left) and after (right) activity migration Becher, M., Bensalem, S., & Pacull, F. (2014, February). Icy-Core Framework for Simulating Thermal Effects of Task Migration Algorithms on Multi-and Many-Core Architectures. In ICONS 2014, The Ninth International Conference on Systems (pp ). 3

Work context and objective Implement computation and data migration to enable thermal mitigation 4 Many-core accelerator PE Local Memory Cluster 3Cluster 2 Cluster 0 Cluster 1 Legend Processing Element Memory NoC router HOST Global Memory PE Local Memory

Problems 1.Task migration ( between iteration)  Remote data access: performance loss 5 Many-core accelerator PE Cluster 3Cluster 2 Cluster 0 Cluster 1 PE 3 3 T T Local data access T T Migration Remote data access Legend Task Data T T 3 3

Problems 2.Data migration  Pointer invalidation: application error Address space Cluster 1 Cluster 0 Cluster 2 Cluster 3 0xFFFFFFF 0x T code: int* pointer = malloc(4); *p = 3; … int a = *p; T code: int* pointer = malloc(4); *p = 3; … int a = *p; Address mapping Many-core accelerator PE Cluster 3Cluster 2 Cluster 0 Cluster 1 PE 0x Migration 6 T T

Solution overview Host Many-core accelerator 7 Application Framework Decision policy (e.g., temperature mitigation, …) Decision policy (e.g., temperature mitigation, …) allocators, communication API, HAL OS Contributions Goal Interface Blocks Application building interface Task and data mapping interface 3. Memory translation mecanism 2. Migration protocol 1. Application management

Application building interface Task ID Init() executed once Fire() executed iteratively End() executed once List of data ID Data ID Status shared/private Size Application 8 Application control task -Start/stop app Application control task -Start/stop app Application Control Task Inter-task shared data Inter-iteration shared data

Run-time app initialization 9 Application control task Framework Decision policy 1. Application Management Shared memory allocation Private memory allocation Task creation and start Unmapped task and data Mapped task and data Local tables Local tables Legend Data flow Control flow App description PE and memory attribution Initialization

Migration protocol fire() fire() Destination controller fire() source PE destination PE Source Controller Legend Framework function User function Trigger T T P P T T P P From Cluster 1To Cluster 3 Paused 10 Resume 2. Pause task 3. Data copy 1. Allocate new memory 4. Free old memory 5. resume task

Translation mecanism 11 data ID, Task ID address 2. task_get_addess Data ID, Task ID => address fire(){ int *pointer = task_get_addess(dataID); if(iteration==1) *pointer = 3; if(iteration==2) int a = * pointer; } fire(){ int *pointer = task_get_addess(dataID); if(iteration==1) *pointer = 3; if(iteration==2) int a = * pointer; } Local table framework: – Provide address virtualization in software – Update the translation at data migration user: – Accesses data based on IDs – Never allocates memory directly => Solve the pointer invalidation problem Square 0

Experimental Setup 12

Experimental Results step Duration (cycles) dependency # of Data 23166Constant 32996Data size 49968# of Data 51536# of Data Sum36327 Frozen task Total Migration fire() fire() fire() source PE destination PE Source Controller Paused Resume Destination controller 2. Pause task 3. Data copy 1. Allocate new memory 4. Free old memory 5. resume task Legend Framework function User function Trigger Total Migration duration Frozen task Duration

Conclusion Demonstration of a proof-of-concept task and data migration on a many-core SoC at enabling thermal mitigation at a reasonable cost. & Questions 14