Kwang-yeob Lee1, Nak-woong Eum2, Jae-chang Kwak1 *

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
By Michael Butler, Leslie Barnes, Debjit Das Sarma, Bob Gelinas This paper appears in: Micro, IEEE March/April 2011 (vol. 31 no. 2) pp 마이크로 프로세서.
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
1 Latest Generations of Multi Core Processors
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Advanced Science and Technology Letters Vol.28 (CIA 2013), pp An OpenCL-based Implementation of H.264.
OCR Software Architecture for Embedded Device Seho Kim', Jaehwa Park Computer Science, Chung-Ang University, Seoul, Korea
EKT303/4 Superscalar vs Super-pipelined.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT.
Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp A hardware design of optimized ORB.
PipeliningPipelining Computer Architecture (Fall 2006)
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Our Graphics Environment Landscape Rendering. Hardware  CPU  Modern CPUs are multicore processors  User programs can run at the same time as other.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Microarchitecture.
Computer Architecture Principles Dr. Mike Frank
Distributed Processors
Multi-core processors
Assembly Language for Intel-Based Computers, 5th Edition
Steven Ge, Xinmin Tian, and Yen-Kuang Chen
YangSun Lee*, YunSik Son**
Chapter 4 Data-Level Parallelism in Vector, SIMD, and GPU Architectures Topic 17 NVIDIA GPU Computational Structures Prof. Zhang Gang
Yunsik Son1, Seman Oh1, Yangsun Lee2
Lecture 2: Intro to the simd lifestyle and GPU internals
Mattan Erez The University of Texas at Austin
NVIDIA Fermi Architecture
Coe818 Advanced Computer Architecture
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
Chapter 4 Multiprocessors
Memory System Performance Chapter 3
Graphics Processing Unit
6- General Purpose GPU Programming
CSE 502: Computer Architecture
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Kwang-yeob Lee1, Nak-woong Eum2, Jae-chang Kwak1 * Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp.67-70 http://dx.doi.org/10.14257/astl.2013.43.13 Superscalar GP-GPU design of SIMT architecture for parallel processing in the embedded environment Kwang-yeob Lee1, Nak-woong Eum2, Jae-chang Kwak1 * 1Dept. of Computer Engineering, Computer Science*, SeoKyeong Univiersity, Jeongneung 4-dong, Seongbuk-gu, Seoul, Korea kylee@skuniv.ac.kr, jckwak@skuniv.ac.kr* 2Multi-media Processor Research Team, ETRI, Daejun, Korea nweum@etri.re.kr Abstract. Superscalar GP-GPU of SIMT architecture is implemented on VC707 SoC Platform. GP-GPU of current generation is used for processing graphics as well as a general purpose. GP-GPU becomes a credible alternative to general purpose processors for many applications. This paper proposes a design of superscalar GP-GPU of SIMT architecture for parallel processing in the embedded environment. In this paper, the application is processed parallel and its performance is compared using existing multi-core CPU of the embedded environment and the implemented GPU. The performance of parallel processing with the implemented GP-GPU was improved about 65%. Keywords: SIMT, GP-GPU, Parallel Processing, Embedded System 1 Introduction In recent times, parallelization of applications has become essential in an embedded environment. However, as a CPU in an embedded environment has low operating frequency and a small number of cores, the parallelization process is limited. A graphic processing unit (GPU), which was initially only used for graphic data processing, is increasingly being considered for use in general fields. The GPU, which consists of tens or thousands of effective cores, has a simpler structure than the CPU, facilitating easy parallelization [1]. In this study, we developed a GP-GPU with a superscalar single instruction multiple threads (SIMTs) structure for parallelization in the embedded environment. Comparisons between the performance of a conventional CPU in the embedded environment and that of the developed GPU with regard to parallelizing the application, confirmed that the GPU improved the parallelization performance by 65% on average. ISSN: 2287-1233 ASTL Copyright 2013 SERSC

Fig. 1. GPU High Level Architecture Advanced Science and Technology Letters Vol.43 (Multimedia 2013) 2 SIMT GP-GPU Architecture The structure of the GPU developed in this study is shown in Fig. 1. This GPU with an SIMT structure has 16 stream processors (SPs) per core. Each thread is grouped on the basis of the warp name [2], and two instructions are patched per warp. For a single SP, an odd warp and an even warp are assigned and processed. The GPUs with the conventional single instruction multiple data (SIMD) structure improved the parallelization performance by parallelizing SPs that have several ALUs [4, 5]. Each SP with the SIMD structure receives a single instruction and simultaneously processes several data. However, a module that can control the execution flow by enabling every SP to independently execute different instructions should exist as much as the number of SPs. Meanwhile, for the GPU with the SIMT structure, every SP processes the same instruction. Thus, a single SP control unit can control every SP, reducing the use rate of hardware resources and power consumption. Instruction $ Stream Warp Scheduler Stream Processor Stream Processor Stream Processor ... Interconnection Network L1 $ DDR3 Fig. 1. GPU High Level Architecture ALU OC Crossbar 2.1 Latency Hiding Processor SFU LD/ST Unit The cache accuracy rate tends to decrease in streaming applications. Thus, if cache misses often occur during memory instruction processing, a pipeline stalls with a decrease in the parallelization performance. Moreover, as hundreds of cycles are required to read data from the memory to the cache, a cache miss leads to performance degradation [6]. This study solved this problem using the multi- threading method as shown in Fig. 2. When a cache miss occurs among threads of a certain warp, the warp waits in the waiting list and a memory request of the following warp is processed until the data are hit. Warps in which the data of all the threads are not hit remain on the waiting list. While processing instructions of other warps, waiting warps are examined for whether every data element is hit by using a round robin method. The warp all of whose threads have hit data is sent to a write back module. As instructions of the following warp are processed in spite of a cache miss 68 Copyright 2013 SERSC

Fig. 2. Method to Memory Latency Hiding Advanced Science and Technology Letters Vol.43(Multimedia 2013) in a certain warp, the pipeline can maintain operation without a stall. Furthermore, the latency of the memory access time requiring a long cycle can be hidden. Warp 0 Warp 3 Instruction Cache Warp 15 Warp 8 Decode Register File Access ALU / LDST Warp 1 Warp 4 MISS Shared Mem / L1 $ ALL HIT MUX Warp 11 Warp 7 Write Back ALL DATA Fig. 2. Method to Memory Latency Hiding 3 Experimental Results As the experimental environment, the VC707 FPGA Platform by Xilinx was used. The operating frequency in the platform is 50 MHz. A verification application was used for parallelizing an integral image creation algorithm, which is frequently used in the field of image processing and recognition. Table 1. Comparison of runtime for parallelization of integral image creation (unit: ms) Heading level 1 Core 2 Core 3 Core 4 Core Frequency ARM Cortex-A9 31.24 19.46 14.57 11.46 1.4 GHz ARM1176JZF 318.14 N/A 700 MHz Ours GPU 150.89 50 MHz .. For parallelization of the CPU in the embedded platform, OpenMP was used. An image of 307,200 pixels was converted into the integral image. The experiment result is presented in Table 2. As runtime difference occurred because of the different core operating frequency of each platform used in the experiment, clocks used for processing a single pixel were compared; the experiment result is shown in Fig. 3 (The clock used for processing a single pixel = operating frequency × runtime/number of pixels). As a result of the experiment, it was verified that the GPU increased the parallelization performance by approximately 34% more than when parallelization was performed using the Odriod-X quad core in the embedded platform and by approximately 96% more than when the Raspberrypi single core was used, showing an average performance improvement of 65.7%. Copyright 2013 SERSC 69

4 Conclusion Acknowledgments References Advanced Science and Technology Letters Vol.43 (Multimedia 2013) Fig. 3. Results of comparison of clock cycles used for processing a single pixel 4 Conclusion In this study, we developed a superscalar GP-GPU with an SIMT structure on the Vertex7 VC707 FPGA platform. The parallelization performance of the developed GPU and the conventional CPU in an embedded environment were compared. For the comparison, the execution speed for parallelizing an integral image creation algorithm was considered. It was found that the parallelization performance improved by approximately 65% on average more than when the conventional CPU in the embedded environment was used. Given the experiment result, the GPU is expected to be used for parallelization in an embedded environment in the future. Furthermore, the developed GPU will be modified to be multi-core to increase the effect of parallelization. Acknowledgments This work was supported by the IT R&D program of MOTIE/KEIT. [10035152 , Energy Scalable Vector Processor - Primary Technology] References Advanced Micro Devices, Inc."ATI CTM Guide", 1.01 edition, (2006) Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym,"NVIDIA Tesla: A Unified Graphics and Computing Architecture", Micro IEEE (Volume:28,Issue:2), 39--55 (2008) NVIDIA," CUDA Technology", http://www.nvidia.com/CUDA. A. Levinthal and T. Porter.Chap, "a SIMD graphics processor". In SIGGRAPH, 77--82, (1984) R. A. Lorie and H. R. Strong."Method for conditional branch execution in SIMD vector processors", US Patent 4,435,758, (1984) Wilson W. L. Fung, Ivan Sham, George Yuan, Tor M., "DynamicWarp Formation and Scheduling for Efficient GPU Control Flow", Microarchitecture 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium, 407--420 (2007) 70 Copyright 2013 SERSC