Exploiting Execution Order and Parallelism from Processing Flow Applying Pipeline-based Programming Method on Manycore Accelerators Shinichi Yamagiwa University of Tsukuba Japan
Table of contents 1.Research backgrounds Flow-model based programming Graphical programming on accelerators using flow-models 2.Finding an execution order 3.Parallelism Extraction Algorithm 4.Performance evaluation using manycore accelerators 5.Conclusions
Background – programming on manycore accelerators Programmer needs to write both programs for CPU and GPU. Accelerator is inserted to the peripheral bus of CPU (PCI Express). CPU executes the controlling program. CPU downloads kernel program to accelerator. Kernel program is executed. Download kernel Reading results CPU We need to make a story for mapping/ummapping the kernel programs to accelerator by the suitable order.
Flow-model based programming: Caravela platform 4 Design a CPU program that maps flow-model to accelerator Caravela Library Flow-model is stored in XML file Flow-model is executed design a flow- model Embed kernel program using DirectX, GLSL, CUDA, OpenCL Flow-model Advantages: Programmer focuses on designing flow-model Flow-model is treated like libraries for stream computing. Execution timing is automatically optimized.
Graphical programming on manycore accelerators Assigning manycore accelerators to flow-models and finding automatic execution flow? Optimized pipeline execution with concurrent execution?
Exploiting the execution order and parallelism from a pipeline flow Intuitively, these flow-models are executed in parallel. we assign multiple flow-models to available accelerators. Explicit Parallelism Intuitively we can know the execution order. we can assign an accelerator to the flow-model one by one. These two flow-models can be executed in parallel because the buffers are independently used. Implicit Parallelism
How can we exploit an execution order and the parallelism? How can we decide the execution order? Loop detection? How can we know the concurrent executable flow-models? Execution ordering Elimination of Buffer collision When we consider a continuous pipeline execution …
Research objective Graphical programming using flow-model needs Finding a deterministic execution order Extracting parallelism: Implicit and Explicit parallelism Automatic pipeline order is defined for optimized pipeline execution. We propose two algorithms: (1) Finding a deterministic execution order (2) Parallelism Extraction Algorithm
Strategy Finding a deterministic execution order Finding the first execution flow-model Parallelism Extraction Algorithm 1.Finding an execution order 2.Extracting the implicit parallelism 3.Extracting the Explicit parallelism Basic execution condition: When all input data inputs are ready, the flow-model can be executed.
Finding the first execution flow- model (Yamagiwa and Sousa, IJPEDS, 2008, world Scientific Pub.) [Step 1] Enumerating all cyclic paths from all nodes [Step 2] Sorting the cyclic paths by the number of nodes included in a path [Step 3] Reducing the cyclic paths to the minimum set [Results]
Parallelism Extraction Algorithm (PEA) 1.Defining the execution order by grouping three flow-models and the sub-graphs 2.Numbering 0, 1 and 2 to the groups 3.Finally, listing the flow-model with the same number in the execution list 4.Recursively repeating the operation above regarding the sub-grapghs
Grouping three flow-models and the sub- graphs Grouping sub-graphs of one or more flow-models Organizing the grapg into three sub-graphs
Numbering 0, 1 and 2 to the groups Numbering 0, 1 and 2 to the sub- graphs from the first executable flow-model 1 0 2
Listing the flow-model with the same number in the execution list Parallelism extraction from there sub-graphs
Recursively repeating the previous operations A B C D E C D E
Implementation of Parallelism Extraction Algorithm We introduce Execute matrix Ordering information saved in column Parallel flows (flow-model woth the same numbers) saved in row Serialize array Marks serialized pattern at every recursive iteration Batch matrix Pipeline execution is saved
Example: straight flow A B C D E C D E (0) Maximum parallelism is 3.
Example: flow with feedbacks Maximum parallelism is 2.
Performance Evaluation Image filtering 2D FFT High/Low pass filter 2D IFFT 13 flow-modes are included in the pipeline After IFFT2, the results are generated Using PEA Determining execution flow Extracting parallelism Executing on CarSh
CarSh: Commandline interface for manycore accelerators (Yamagiwa and Zhang, ICCS 2013) Exec/batch execAexecB Exec/batch execC Repeat for 3 times repeat 3 exec/batch execA & execB & Sync execC Background execution synchronization Processing flow CarSh batch
Applying PEA to the image filtering Maximum parallelism is 7.
Performance results OpenCL on CPU and GPU We measured Average time of the stage at every IFFT2 Speedup with/without parallelization CPU case: 4.9 times faster GPU case: 1.4 times faster
Conclusions and future direction Graphical programming for manycore accelerator Flow-model based programming needs Finding an execution flow Parallelism extraction in the pipeline flow Parallelism Extraction Algorithm Numbering 0, 1 and 2 to flow-models We are now implementing it on the GUI…
Eclipse plug-in for Caravela platform CarSh environment