A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha †, Jongman Kim *, and Chrysostomos Nicopoulos ‡ Presented by Junghee Lee * †‡
2 Introduction Single Core Multi Core Many Core Fusion Programmable Hardware Accelerator Ex) GPGPU Powerful cores + H/W accelerator in a single die Ex) AMD Fusion Massively Parallel Processing Array
3 MPPA as Hardware Accelerator CPU I/O Massively Parallel Processing Array Core Tile Core Tile Host CPU Interface Device Memory Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Challenges Expressiveness Debugging Memory Hierarchy Design Challenges Expressiveness Debugging Memory Hierarchy Design
4 Related Works Expressiveness Debugging Memory GPGPU AMD Fusion GPGPU AMD Fusion Tilera Rigel Ambric SIMD Multiple debuggers Event graph Multiple debuggers Event graph Scratch-pad memory Cache Scratch-pad memory Cache Multi-threading Kahn process network Multiple debuggers Coherent cache Not addressed Software-managed cache Formal model Scratch-pad memory Proposed MPPA Event-driven model Inter-module debug Intra-module debug Inter-module debug Intra-module debug Scratch-pad memory Prefetching Scratch-pad memory Prefetching
5 Contents Introduction Execution Model Hardware Architecture Evaluation Conclusion
6 Execution Model Specification –Module = (b, P i, P o, C, F) b = Behavior of module P i = Input ports P o = Output ports C = Sensitivity list F = Prefetch list –Signal –Net = (d, K) d = Driver port K = A set of sink ports Semantics –A module is triggered when any signal connected to C changes –Function calls and memory accesses are limited to within a module –Non-blocking write and block read –The specification can be modified during run-time
7 Example Quick sort –Pivot is selected –The given array is partitioned so that The left segment should contain smaller elements than the pivot The right segment should contain larger elements than the pivot –Recursively partition the left and right segments Specifying quick sort –Multi-threading OK but hard to debug –SIMD Inefficient due to input dependency –Kahn process network Impossible due to the dynamic nature
8 Specify Quick Sort with Event-driven Model Partition module –b (behavior): select a pivot, partition the input array, instantiate another partition module if necessary –P i (input port): input array and its position –P o (output port): left and right segments and their position –C (sensitivity list): input array –P (prefetch list): input array Collection module –b (behavior): collect segments –P i (input port): sorted segments and intermediate result –P o (output port): final result and intermediate result –C (sensitivity list): sorted segments –P (prefetch list): sorted segments and intermediate result Partition Collection … Input array Final result Intermediate result
9 Contents Introduction Execution Model Hardware Architecture Evaluation Conclusion
10 MPPA Microarhitecture Core Tile Core Tile Host CPU Interface Device Memory Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile Core Tile E E Identical core tiles Consists of uCPU, scratch-pad memory, and peripherals that support the execution model One core tile is designated to an execution engine Identical core tiles Consists of uCPU, scratch-pad memory, and peripherals that support the execution model One core tile is designated to an execution engine Software running on a core tile Consists of scheduler, signal storage and interconnect directory Supports the execution model If necessary, it is split into multiple instances running on different core tiles Software running on a core tile Consists of scheduler, signal storage and interconnect directory Supports the execution model If necessary, it is split into multiple instances running on different core tiles
11 Core Tile Architecture Scratch Pad Memory For Current Module For Current Module For Next Module For Next Module Context Manager Input Signal Queue Output Signal Queue Message Queue Network Interface uCPU Prefetcher Message Handler Message Handler Generic small processor Treated as a black box Generic small processor Treated as a black box Software-managed on- chip SRAM Double buffering where one is for the current module and the other is for the next module to be prefetched Software-managed on- chip SRAM Double buffering where one is for the current module and the other is for the next module to be prefetched Prefetches the code and data of the next module while the current module is running on uCPU Counter-part of the prefetcher Sends data to the requester Counter-part of the prefetcher Sends data to the requester Switches the context when the current module finishes and the next module is ready Stores information about the modules Switches the context when the current module finishes and the next module is ready Stores information about the modules Stores the input data The actual data is stored in the SPM while its information is managed by this module Stores the input data The actual data is stored in the SPM while its information is managed by this module Stores the output data Notifies the update event to the interconnect directory when the output is updated Stores the output data Notifies the update event to the interconnect directory when the output is updated Handles the system messages NoC router
12 Execution Engine Most of its functionality is implemented in software while the hardware facilitates communication Software implementation gives us flexibility in the number and location of the execution engine One way to visualize our MPPA is to regard the execution engine as an event-driven simulation kernel The execution engine interacts with modules running on other core tiles through messages TypeFromToPayload REQ_FETCH_MODULEPrefetcherSchedulerRequest a new module RES_FETCH_MODULESchedulerPrefetcherModule ID and list of input ports MODULE_INSTANCESchedulerPrefetcherCode of the module REQ_SIGNALPrefetcherInterconnectPort ID RES_SIGNALSignal storage or a node PrefetcherData
13 Components of Execution Engine Scheduler –Keeps track of the status and location of modules –Maintains three queues: wait, ready and run queue Signal storage –Stores signal values in the device memory –If a signal is updated but its value is still stored in the node, the signal storage invalidates its value and keeps the location of the latest value Interconnect directory –Keeps track of connectivity of signals and ports –Maintains the sensitivity list
14 Module-Level Prefetching Hides the overhead of the dynamic scheduling Prefetches the next module while the current module is running uCPU Prefetcher Scheduler Interconn. Directory Interconn. Directory Signal Storage Other Node Execute a module Memory access
15 Illustrative Example uCPU Prefetcher Out Sig Q Msg Handler uCPU Prefetcher Out Sig Q Msg Handler uCPU Prefetcher Out Sig Q Msg Handler Interconnect Directory Signal Storage Scheduler Wait Q Ready Q Run Q Partition 0 Partition 2 Partition 3 Partition 4 Partition 5 Collection Partition 4 Partition 1
16 Contents Introduction Execution Model Hardware Architecture Evaluation Conclusion
17 Benchmark Recognition, Synthesis and Mining (RMS) benchmark Fine-grained parallelism: dominated by short tasks –Small memory foot print –High run-time scheduling overhead Task-level parallelism: exhibits dependency –Hard to be implemented with GPGPU BenchmarkMinMaxAverage Forward Solve (FS) Backward Solve (BS) Cholesky Factorization (CF) Canny Edge Detection (CED) Binomial Tree (BT) Octree Partitioning (OP) Quick Sort (QS)
18 Simulator In-house cycle-level simulator Parameters ParameterValue Number of core tiles32 Memory access time1 cycle for scratch-pad memory 100 cycles for device memory Memory size8 KB scratch-pad memory 32 MB device memory Communication Delay4 cycles per hop
19 Utilization Benchmarks FSBSCFCEDBT Core utilization w/o prefetchingw/ prefetching OP QS
20 Scalability Number of core tiles Core utilization Util (1) Util (3) Execution time (cycles) Execution time (1) Execution time (3)
21 Conclusion This paper proposes a novel MPPA architecture that employs an event-driven execution model –Handles dependencies by dynamic scheduling –Hides dynamic scheduling overhead by module-level prefetching Future works –Supports applications that require larger memory footprint –Adjusts the number of execution engines dynamically –Supports inter-module debugging
22 Questions? Contact info Junghee Lee Electrical and Computer Engineering Georgia Institute of Technology
23 Thank you!