A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee , Hyung Gyu Lee , Soonhoi Ha.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

Operating System.

Processes Management.

Middleware Support for RDMA-based Data Transfer in Cloud Computing Yufei Ren, Tan Li, Dantong Yu, Shudong Jin, Thomas Robertazzi Department of Electrical.

Chapter 8-1 : Multiple Processor Systems Multiple Processor Systems Multiple Processor Systems Multiprocessor Hardware Multiprocessor Hardware UMA Multiprocessors.

WHAT IS AN OPERATING SYSTEM? An interface between users and hardware - an environment "architecture ” Allows convenient usage; hides the tedious stuff.

Do We Need Wide Flits in Networks-On-Chip? Junghee Lee, Chrysostomos Nicopoulos, Sung Joo Park, Madhavan Swaminathan and Jongman Kim Presented by Junghee.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

A Dynamic Operating System for Sensor Nodes (SOS) Source:The 3 rd International Conference on Mobile Systems, Applications, and Service (MobiSys 2005)

Continuously Recording Program Execution for Deterministic Replay Debugging.

04/14/2008CSCI 315 Operating Systems Design1 I/O Systems Notice: The slides for this lecture have been largely based on those accompanying the textbook.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.

Operating Systems High Level View Chapter 1,2. Who is the User? End Users Application Programmers System Programmers Administrators.

Operating System - Overview Lecture 2. OPERATING SYSTEM STRUCTURES Main componants of an O/S Process Management Main Memory Management File Management.

Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Computer Organization and Architecture

HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.

Using Two Queues. Using Multiple Queues Suspended Processes Processor is faster than I/O so all processes could be waiting for I/O Processor is faster.

I/O Systems CSCI 444/544 Operating Systems Fall 2008.

Copyright Arshi Khan1 System Programming Instructor Arshi Khan.

I/O Tanenbaum, ch. 5 p. 329 – 427 Silberschatz, ch. 13 p

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Chapter 3 Operating Systems Introduction to CS 1 st Semester, 2015 Sanghyun Park.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

I/O Systems I/O Hardware Application I/O Interface

Chapter 5 Operating System Support. Outline Operating system - Objective and function - types of OS Scheduling - Long term scheduling - Medium term scheduling.

Rensselaer Polytechnic Institute CSCI-4210 – Operating Systems CSCI-6140 – Computer Operating Systems David Goldschmidt, Ph.D.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Lecture 3 Process Concepts. What is a Process? A process is the dynamic execution context of an executing program. Several processes may run concurrently,

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

Hardware-software Interface Xiaofeng Fan

Chapter 2 Processes and Threads Introduction 2.2 Processes A Process is the execution of a Program More specifically… – A process is a program.

Operating System Principles And Multitasking

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.

Silberschatz, Galvin and Gagne  Operating System Concepts Chapter 13: I/O Systems I/O Hardware Application I/O Interface Kernel I/O Subsystem.

Operating System Issues in Multi-Processor Systems John Sung Hardware Engineer Compaq Computer Corporation.

Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.

Chapter 13 – I/O Systems (Pgs ). Devices  Two conflicting properties A. Growing uniformity in interfaces (both h/w and s/w): e.g., USB, TWAIN.

Lecture on Central Process Unit (CPU)

Overview von Neumann Architecture Computer component Computer function

Hardware-based Job Queue Management for Manycore Architectures and OpenMP Environments Junghee Lee, Chrysostomos Nicopoulos, Yongjae Lee, Hyung Gyu Lee.

1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.

System Architecture Directions for Networked Sensors.

Multiprocessor  Use large number of processor design for workstation or PC market  Has an efficient medium for communication among the processor memory.

Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->

Silberschatz, Galvin, and Gagne  Applied Operating System Concepts Module 12: I/O Systems I/O hardwared Application I/O Interface Kernel I/O.

1 Device Controller I/O units typically consist of A mechanical component: the device itself An electronic component: the device controller or adapter.

My Coordinates Office EM G.27 contact time:

Background Computer System Architectures Computer System Software.

Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine and Mendel Rosenblum Presentation by Mark Smith.

Advanced Operating Systems CS6025 Spring 2016 Processes and Threads (Chapter 2)

 Operating system.  Functions and components of OS.  Types of OS.  Process and a program.  Real time operating system (RTOS).

Introduction to Operating Systems Concepts

Module 12: I/O Systems I/O hardware Application I/O Interface

Chapter 13: I/O Systems Modified by Dr. Neerja Mhaskar for CS 3SH3.

OPERATING SYSTEMS CS3502 Fall 2017

CMSC 611: Advanced Computer Architecture

Operating System Concepts

13: I/O Systems I/O hardwared Application I/O Interface

CS703 - Advanced Operating Systems

Multithreaded Programming

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Module 12: I/O Systems I/O hardwared Application I/O Interface

Presentation transcript:

A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha †, Jongman Kim *, and Chrysostomos Nicopoulos ‡ Presented by Junghee Lee * †‡

2 Introduction Single Core Multi Core Many Core Fusion Programmable Hardware Accelerator Ex) GPGPU Powerful cores + H/W accelerator in a single die Ex) AMD Fusion Massively Parallel Processing Array

3 MPPA as Hardware Accelerator CPU I/O Massively Parallel Processing Array Core Tile Core Tile Host CPU Interface Device Memory Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Challenges Expressiveness Debugging Memory Hierarchy Design Challenges Expressiveness Debugging Memory Hierarchy Design

4 Related Works Expressiveness Debugging Memory GPGPU AMD Fusion GPGPU AMD Fusion Tilera Rigel Ambric SIMD Multiple debuggers Event graph Multiple debuggers Event graph Scratch-pad memory Cache Scratch-pad memory Cache Multi-threading Kahn process network Multiple debuggers Coherent cache Not addressed Software-managed cache Formal model Scratch-pad memory Proposed MPPA Event-driven model Inter-module debug Intra-module debug Inter-module debug Intra-module debug Scratch-pad memory Prefetching Scratch-pad memory Prefetching

5 Contents Introduction Execution Model Hardware Architecture Evaluation Conclusion

6 Execution Model Specification –Module = (b, P i, P o, C, F) b = Behavior of module P i = Input ports P o = Output ports C = Sensitivity list F = Prefetch list –Signal –Net = (d, K) d = Driver port K = A set of sink ports Semantics –A module is triggered when any signal connected to C changes –Function calls and memory accesses are limited to within a module –Non-blocking write and block read –The specification can be modified during run-time

7 Example Quick sort –Pivot is selected –The given array is partitioned so that The left segment should contain smaller elements than the pivot The right segment should contain larger elements than the pivot –Recursively partition the left and right segments Specifying quick sort –Multi-threading OK but hard to debug –SIMD Inefficient due to input dependency –Kahn process network Impossible due to the dynamic nature

8 Specify Quick Sort with Event-driven Model Partition module –b (behavior): select a pivot, partition the input array, instantiate another partition module if necessary –P i (input port): input array and its position –P o (output port): left and right segments and their position –C (sensitivity list): input array –P (prefetch list): input array Collection module –b (behavior): collect segments –P i (input port): sorted segments and intermediate result –P o (output port): final result and intermediate result –C (sensitivity list): sorted segments –P (prefetch list): sorted segments and intermediate result Partition Collection … Input array Final result Intermediate result

9 Contents Introduction Execution Model Hardware Architecture Evaluation Conclusion

10 MPPA Microarhitecture Core Tile Core Tile Host CPU Interface Device Memory Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile E E Identical core tiles Consists of uCPU, scratch-pad memory, and peripherals that support the execution model One core tile is designated to an execution engine Identical core tiles Consists of uCPU, scratch-pad memory, and peripherals that support the execution model One core tile is designated to an execution engine Software running on a core tile Consists of scheduler, signal storage and interconnect directory Supports the execution model If necessary, it is split into multiple instances running on different core tiles Software running on a core tile Consists of scheduler, signal storage and interconnect directory Supports the execution model If necessary, it is split into multiple instances running on different core tiles

11 Core Tile Architecture Scratch Pad Memory For Current Module For Current Module For Next Module For Next Module Context Manager Input Signal Queue Output Signal Queue Message Queue Network Interface uCPU Prefetcher Message Handler Message Handler Generic small processor Treated as a black box Generic small processor Treated as a black box Software-managed on- chip SRAM Double buffering where one is for the current module and the other is for the next module to be prefetched Software-managed on- chip SRAM Double buffering where one is for the current module and the other is for the next module to be prefetched Prefetches the code and data of the next module while the current module is running on uCPU Counter-part of the prefetcher Sends data to the requester Counter-part of the prefetcher Sends data to the requester Switches the context when the current module finishes and the next module is ready Stores information about the modules Switches the context when the current module finishes and the next module is ready Stores information about the modules Stores the input data The actual data is stored in the SPM while its information is managed by this module Stores the input data The actual data is stored in the SPM while its information is managed by this module Stores the output data Notifies the update event to the interconnect directory when the output is updated Stores the output data Notifies the update event to the interconnect directory when the output is updated Handles the system messages NoC router

12 Execution Engine Most of its functionality is implemented in software while the hardware facilitates communication  Software implementation gives us flexibility in the number and location of the execution engine One way to visualize our MPPA is to regard the execution engine as an event-driven simulation kernel The execution engine interacts with modules running on other core tiles through messages TypeFromToPayload REQ_FETCH_MODULEPrefetcherSchedulerRequest a new module RES_FETCH_MODULESchedulerPrefetcherModule ID and list of input ports MODULE_INSTANCESchedulerPrefetcherCode of the module REQ_SIGNALPrefetcherInterconnectPort ID RES_SIGNALSignal storage or a node PrefetcherData

13 Components of Execution Engine Scheduler –Keeps track of the status and location of modules –Maintains three queues: wait, ready and run queue Signal storage –Stores signal values in the device memory –If a signal is updated but its value is still stored in the node, the signal storage invalidates its value and keeps the location of the latest value Interconnect directory –Keeps track of connectivity of signals and ports –Maintains the sensitivity list

14 Module-Level Prefetching Hides the overhead of the dynamic scheduling Prefetches the next module while the current module is running uCPU Prefetcher Scheduler Interconn. Directory Interconn. Directory Signal Storage Other Node Execute a module Memory access

15 Illustrative Example uCPU Prefetcher Out Sig Q Msg Handler uCPU Prefetcher Out Sig Q Msg Handler uCPU Prefetcher Out Sig Q Msg Handler Interconnect Directory Signal Storage Scheduler Wait Q Ready Q Run Q Partition 0 Partition 2 Partition 3 Partition 4 Partition 5 Collection Partition 4 Partition 1

16 Contents Introduction Execution Model Hardware Architecture Evaluation Conclusion

17 Benchmark Recognition, Synthesis and Mining (RMS) benchmark Fine-grained parallelism: dominated by short tasks –Small memory foot print –High run-time scheduling overhead Task-level parallelism: exhibits dependency –Hard to be implemented with GPGPU BenchmarkMinMaxAverage Forward Solve (FS) Backward Solve (BS) Cholesky Factorization (CF) Canny Edge Detection (CED) Binomial Tree (BT) Octree Partitioning (OP) Quick Sort (QS)

18 Simulator In-house cycle-level simulator Parameters ParameterValue Number of core tiles32 Memory access time1 cycle for scratch-pad memory 100 cycles for device memory Memory size8 KB scratch-pad memory 32 MB device memory Communication Delay4 cycles per hop

19 Utilization Benchmarks FSBSCFCEDBT Core utilization w/o prefetchingw/ prefetching OP QS

20 Scalability Number of core tiles Core utilization Util (1) Util (3) Execution time (cycles) Execution time (1) Execution time (3)

21 Conclusion This paper proposes a novel MPPA architecture that employs an event-driven execution model –Handles dependencies by dynamic scheduling –Hides dynamic scheduling overhead by module-level prefetching Future works –Supports applications that require larger memory footprint –Adjusts the number of execution engines dynamically –Supports inter-module debugging

22 Questions? Contact info Junghee Lee Electrical and Computer Engineering Georgia Institute of Technology

23 Thank you!