Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp.67-70 Superscalar GP-GPU design of SIMT.

Slides:



Advertisements
Similar presentations
Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Advertisements

Vectors, SIMD Extensions and GPUs COMP 4611 Tutorial 11 Nov. 26,
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Lecture 6: Multicore Systems
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Reporter :LYWang We propose a multimedia SoC platform with a crossbar on-chip bus which can reduce the bottleneck of on-chip communication.
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Neither More Nor Less: Optimizing Thread-level Parallelism for GPGPUs Onur Kayıran, Adwait Jog, Mahmut Kandemir, Chita R. Das.
GRAPHICS AND COMPUTING GPUS Jehan-François Pâris
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
Name: Kaiyong Zhao Supervisor: Dr. X. -W Chu. Background & Related Work Multiple-Precision Integer GPU Computing & CUDA Multiple-Precision Arithmetic.
Multithreading and Dataflow Architectures CPSC 321 Andreas Klappenecker.
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 19, 2011 Emergence of GPU systems and clusters for general purpose High Performance Computing.
A Programmable Coprocessor Architecture for Wireless Applications Yuan Lin, Nadav Baron, Hyunseok Lee, Scott Mahlke, Trevor Mudge Advance Computer Architecture.
Graphics Processors CMSC 411. GPU graphics processing model Texture / Buffer Texture / Buffer Vertex Geometry Fragment CPU Displayed Pixels Displayed.
Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.
GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.
Veynu Narasiman The University of Texas at Austin Michael Shebanow
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Feb. 19, 2008 Multicore Processor Technology and Managing Contention for Shared Resource Cong Zhao Yixing Li.
MS Thesis Defense “IMPROVING GPU PERFORMANCE BY REGROUPING CPU-MEMORY DATA” by Deepthi Gummadi CoE EECS Department April 21, 2014.
GPU in HPC Scott A. Friedman ATS Research Computing Technologies.
YOU LI SUPERVISOR: DR. CHU XIAOWEN CO-SUPERVISOR: PROF. LIU JIMING THURSDAY, MARCH 11, 2010 Speeding up k-Means by GPUs 1.
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
GPU Architecture and Programming
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
1 Latest Generations of Multi Core Processors
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.
Thread Level Parallelism Since ILP has inherent limitations, can we exploit multithreading? –a thread is defined as a separate process with its own instructions.
Advanced Science and Technology Letters Vol.28 (CIA 2013), pp An OpenCL-based Implementation of H.264.
EKT303/4 Superscalar vs Super-pipelined.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Sunpyo Hong, Hyesoon Kim
Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.
GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.
SYNAR Systems Networking and Architecture Group CMPT 886: Computer Architecture Primer Dr. Alexandra Fedorova School of Computing Science SFU.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp A hardware design of optimized ORB.
MAHARANA PRATAP COLLEGE OF TECHNOLOGY SEMINAR ON- COMPUTER PROCESSOR SUBJECT CODE: CS-307 Branch-CSE Sem- 3 rd SUBMITTED TO SUBMITTED BY.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.
Microarchitecture.
Computer Architecture Principles Dr. Mike Frank
Distributed Processors
Multi-core processors
Assembly Language for Intel-Based Computers, 5th Edition
Kwang-yeob Lee1, Nak-woong Eum2, Jae-chang Kwak1 *
YangSun Lee*, YunSik Son**
Mattan Erez The University of Texas at Austin
Coe818 Advanced Computer Architecture
Operation of the Basic SM Pipeline
Mattan Erez The University of Texas at Austin
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
General Purpose Graphics Processing Units (GPGPUs)
Graphics Processing Unit
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Advanced Science and Technology Letters Vol.43 (Multimedia 2013), pp Superscalar GP-GPU design of SIMT architecture for parallel processing in the embedded environment Kwang-yeob Lee 1, Nak-woong Eum 2, Jae- chang Kwak 1 * 1 Dept. of Computer Engineering, Computer Science *, SeoKyeong Univiersity, Jeongneung 4-dong, Seongbuk-gu, Seoul, Korea * 2 Multi-media Processor Research Team, ETRI, Daejun, Korea Abstract. Superscalar GP-GPU of SIMT architecture is implemented on VC707 SoC Platform. GP-GPU of current generation is used for processing graphics as well as a general purpose. GP-GPU becomes a credible alternative to general purpose processors for many applications. This paper proposes a design of superscalar GP-GPU of SIMT architecture for parallel processing in the embedded environment. In this paper, the application is processed parallel and its performance is compared using existing multi-core CPU of the embedded environment and the implemented GPU. The performance of parallel processing with the implemented GP-GPU was improved about 65%. Keywords: SIMT, GP-GPU, Parallel Processing, Embedded System 1 Introduction In recent times, parallelization of applications has become essential in an embedded environment. However, as a CPU in an embedded environment has low operating frequency and a small number of cores, the parallelization process is limited. A graphic processing unit (GPU), which was initially only used for graphic data processing, is increasingly being considered for use in general fields. The GPU, which consists of tens or thousands of effective cores, has a simpler structure than the CPU, facilitating easy parallelization [1]. In this study, we developed a GP-GPU with a superscalar single instruction multiple threads (SIMTs) structure for parallelization in the embedded environment. Comparisons between the performance of a conventional CPU in the embedded environment and that of the developed GPU with regard to parallelizing the application, confirmed that the GPU improved the parallelization performance by 65% on average. ISSN: ASTL Copyright 2013 SERSC

Advanced Science and Technology Letters Vol.43 (Multimedia 2013) 2 SIMT GP-GPU Architecture The structure of the GPU developed in this study is shown in Fig. 1. This GPU with an SIMT structure has 16 stream processors (SPs) per core. Each thread is grouped on the basis of the warp name [2], and two instructions are patched per warp. For a single SP, an odd warp and an even warp are assigned and processed. The GPUs with the conventional single instruction multiple data (SIMD) structure improved the parallelization performance by parallelizing SPs that have several ALUs [4, 5]. Each SP with the SIMD structure receives a single instruction and simultaneously processes several data. However, a module that can control the execution flow by enabling every SP to independently execute different instructions should exist as much as the number of SPs. Meanwhile, for the GPU with the SIMT structure, every SP processes the same instruction. Thus, a single SP control unit can control every SP, reducing the use rate of hardware resources and power consumption. Instruction $ DDR3 Stream Processor Interconnection Network Warp Scheduler Stream Processor L1 $... Stream Processor Crossbar OC ALU Fig. 1. GPU High Level Architecture 2.1 Latency Hiding Processor Crossbar OC Crossbar ALU SFU LD/ST Unit Crossbar The cache accuracy rate tends to decrease in streaming applications. Thus, if cache misses often occur during memory instruction processing, a pipeline stalls with a decrease in the parallelization performance. Moreover, as hundreds of cycles are required to read data from the memory to the cache, a cache miss leads to performance degradation [6]. This study solved this problem using the multi- threading method as shown in Fig. 2. When a cache miss occurs among threads of a certain warp, the warp waits in the waiting list and a memory request of the following warp is processed until the data are hit. Warps in which the data of all the threads are not hit remain on the waiting list. While processing instructions of other warps, waiting warps are examined for whether every data element is hit by using a round robin method. The warp all of whose threads have hit data is sent to a write back module. As instructions of the following warp are processed in spite of a cache miss 68 Copyright 2013 SERSC

Advanced Science and Technology Letters Vol.43(Multimedia 2013) in a certain warp, the pipeline can maintain operation without a stall. Furthermore, the latency of the memory access time requiring a long cycle can be hidden. Fig. 2. Method to Memory Latency Hiding 3 Experimental Results As the experimental environment, the VC707 FPGA Platform by Xilinx was used. The operating frequency in the platform is 50 MHz. A verification application was used for parallelizing an integral image creation algorithm, which is frequently used in the field of image processing and recognition. Table 1. Comparison of runtime for parallelization of integral image creation (unit: ms) Heading level1 Core2 Core3 Core4 CoreFrequency ARM Cortex-A GHz ARM1176JZF318.14N/A 700 MHz Ours GPU150.89N/A 50 MHz.. For parallelization of the CPU in the embedded platform, OpenMP was used. An image of 307,200 pixels was converted into the integral image. The experiment result is presented in Table 2. As runtime difference occurred because of the different core operating frequency of each platform used in the experiment, clocks used for processing a single pixel were compared; the experiment result is shown in Fig. 3 (The clock used for processing a single pixel = operating frequency × runtime/number of pixels). As a result of the experiment, it was verified that the GPU increased the parallelization performance by approximately 34% more than when parallelization was performed using the Odriod-X quad core in the embedded platform and by approximately 96% more than when the Raspberrypi single core was used, showing an average performance improvement of 65.7%. Instruction Cache Warp 1 Warp 4 Warp 11 Warp 7 ALL HIT MUX Register File Access Shared Mem / L1 $ ALU / LDST Write Back ALL DATA Warp 15 Warp 8 Decode Warp 0 Warp 3 MISS Copyright 2013 SERSC 69

Advanced Science and Technology Letters Vol.43 (Multimedia 2013) Fig. 3. Results of comparison of clock cycles used for processing a single pixel 4 Conclusion In this study, we developed a superscalar GP-GPU with an SIMT structure on the Vertex7 VC707 FPGA platform. The parallelization performance of the developed GPU and the conventional CPU in an embedded environment were compared. For the comparison, the execution speed for parallelizing an integral image creation algorithm was considered. It was found that the parallelization performance improved by approximately 65% on average more than when the conventional CPU in the embedded environment was used. Given the experiment result, the GPU is expected to be used for parallelization in an embedded environment in the future. Furthermore, the developed GPU will be modified to be multi-core to increase the effect of parallelization. Acknowledgments This work was supported by the IT R&D program of MOTIE/KEIT. [ , Energy Scalable Vector Processor - Primary Technology] References 1.Advanced Micro Devices, Inc."ATI CTM Guide", 1.01 edition, (2006) 2.Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym,"NVIDIA Tesla: A Unified Graphics and Computing Architecture", Micro IEEE (Volume:28,Issue:2), (2008) 3.NVIDIA," CUDA Technology", 4.A. Levinthal and T. Porter.Chap, "a SIMD graphics processor". In SIGGRAPH, , (1984) 5.R. A. Lorie and H. R. Strong."Method for conditional branch execution in SIMD vector processors", US Patent 4,435,758, (1984) 6.Wilson W. L. Fung, Ivan Sham, George Yuan, Tor M., "DynamicWarp Formation and Scheduling for Efficient GPU Control Flow", Microarchitecture MICRO th Annual IEEE/ACM International Symposium, (2007) 70 Copyright 2013 SERSC