Architectural Optimizations David Ojika March 27, 2014.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:
1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.
System Simulation Of 1000-cores Heterogeneous SoCs Shivani Raghav Embedded System Laboratory (ESL) Ecole Polytechnique Federale de Lausanne (EPFL)
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.
1 Threads Chapter 4 Reading: 4.1,4.4, Process Characteristics l Unit of resource ownership - process is allocated: n a virtual address space to.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
GPGPU platforms GP - General Purpose computation using GPU
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
An Introduction Chapter Chapter 1 Introduction2 Computer Systems  Programmable machines  Hardware + Software (program) HardwareProgram.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
Computer Architecture and Operating Systems CS 3230: Operating System Section Lecture OS-7 Memory Management (1) Department of Computer Science and Software.
COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Programming Concepts in GPU Computing Dušan Gajić, University of Niš Programming Concepts in GPU Computing Dušan B. Gajić CIITLab, Dept. of Computer Science.
1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.
MAPLD Reconfigurable Computing Birds-of-a-Feather Programming Tools Jeffrey S. Vetter M. C. Smith, P. C. Roth O. O. Storaasli, S. R. Alam
Los Alamos National Lab Streams-C Maya Gokhale, Janette Frigo, Christine Ahrens, Marc Popkin- Paine Los Alamos National Laboratory Janice M. Stone Stone.
GPU Architecture and Programming
Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
Full and Para Virtualization
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Martin Kruliš by Martin Kruliš (v1.0)1.
Heterogeneous Computing using openCL lecture 4 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Los Alamos National Laboratory Streams-C Maya Gokhale Los Alamos National Laboratory September, 1999.
My Coordinates Office EM G.27 contact time:
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
1 The user’s view  A user is a person employing the computer to do useful work  Examples of useful work include spreadsheets word processing developing.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
A Study of Data Partitioning on OpenCL-based FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST) 1.
A Performance Analysis Framework for Optimizing OpenCL Applications on FPGAs Zeke Wang (NTU Singapore), Bingsheng He (NTU Singapore), Wei Zhang (HKUST)
Relational Query Processing on OpenCL-based FPGAs Zeke Wang, Johns Paul, Hui Yan Cheah (NTU, Singapore), Bingsheng He (NUS, Singapore), Wei Zhang (HKUST,
Introduction to Operating Systems Concepts
Computer Engg, IIT(BHU)
Ph.D. in Computer Science
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
FPGAs in AWS and First Use Cases, Kees Vissers
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Introduction to cosynthesis Rabi Mahapatra CSCE617
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Lecture Topics: 11/1 General Operating System Concepts Processes
Chapter 1 Introduction.
Threads Chapter 4.
Optimizing stencil code for FPGA
6- General Purpose GPU Programming
Chapter 13: I/O Systems “The two main jobs of a computer are I/O and [CPU] processing. In many cases, the main job is I/O, and the [CPU] processing is.
Presentation transcript:

Architectural Optimizations David Ojika March 27, 2014

Focus 2 From OpenCL to High-Performance Hardware on FPGAs Tomasz S. Czajkowski, Altera OpenCL Optimization Group IEEE 2012 OpenCL for FPGAs: Prototyping a Compiler Tomasz S. Czajkowski, Altera OpenCL Optimization Group ESRA 2012

Outline Introduction FPGA as an Acceleration Platform HLS Vs. OpenCL in Accelerators OpenCL-To-FPGA Compiler System Integration Implementation Details Optimization Strategies Memory Issues Experimental Setup Results Conclusion 3

Introduction Traditionally, FPGAs been used as de-factor standard for many high-performance applications  Some of the largest and most complex circuits  Applications in medical imaging, molecular dynamics, etc. Nowadays, FPGA’s being used as accelerator to speedup computationally-intensive applications 4  Orders of magnitude reduction in runtime compared to traditional stand-alone computers  Hardware parallelism enable many computations to be performed at same time Traditional Emerging

FPGA as Acceleration Platform FPGAs enable high application acceleration in many domains 5  However, parallel computation at hardware level requires great expertise Poses limitations in adoption of FPGAs as hardware accelerators GPUs being used as accelerators in computing for many years  Interests by software programmers to use GPU in general-purpose computing lead to development of new programming models CUDA for GPU OpenCL for GPU, CPU, FPGA, DSP FPGA Accelerator Kernel CUDA Model OpenCL Model FPGAs starting to compete with GPGPUs for app acceleration GPGPU: General-Purpose Graphics Processing Unit

HLS Vs. OpenCL in Accelerators Tool implement FPGA circuit from software-like languages  Many circuits can simultaneously process data Does not exploit parallelism in hardware  Pipelining not explicit in C programming language Performance may be suffer comparable in to VHDL/Verilog implementation 6 Computation performed using combination of host & kernels  Host responsible for I/O and setup of task  Kernel perform computation on independent inputs Kernel can be implemented as high-performance circuit Kernel can be replicated to improve performance of application Challenge  Efficiently synthesize kernel into HW circuits? Solution  Compiler that performs architectural optimizations! Traditional HLS OpenCL

OpenCL-To-FPGA Compiler Framework based on LLVM compiler infrastructure  Input 1: OpenCL app with set of kernels (.cl)  Input 2: Host program (.c) 7 Kernel Compilation  C language parser produces LLVM IR of each kernel  Representation optimized to target FPGA platform  Optimized LLVM converted to CDFG  CDFG optimized to improve area and performance  RTL generation produces Verilog HDL for kernel  HDL compiled to generated bitstream for programming Host Compilation  ACL Host Library: Implements OpenCL function calls; allows exchange of information between host and FPGA  Auto-Discovery module: Allows host program to discover types of kernels on FPGA Output: Verilog HDL

System Integration Host accesses kernel  Specifies workspace parameter (global & local work size)  Sets kernel arguments  Enqueues kernel Device replicates kernel  Runs kernel  Provides result to DDR mem. 8 FPGA Accelerator Kernel

Implementation Details 1. ACL Host Library  Host Library: Implements OpenCL data and control specs. 2. Kernel Compiler  Kernel: Compute-intensive workload offloaded from CPU 9

ACL Host Library Implements Platform & Runtime APIs of OpenCL 1.0 spec Platform API  Discover FPGA accelerator device  Manage execution contexts Runtime API  Manage command queues  Memory objects  Program objects  Discover and invoke precompiled kernels 10 Consists of two layers Platform Independent Layer: Performs device independent processing  Provides user visible OpenCL API functions  Performs generic book-keeping and scheduling Hardware Abstraction Layer: adapts to platform specifics  Provides low-level services to platform independent layer Device and kernel discovery, raw memory allocation, memory transfers, kernel launch and completion

Kernel Compiler Compiles kernel into hardware circuit using LLVM open- source compiler  Extends compiler to target FPGA platforms LLVM Flow:  Represents program as sequence of instructions: load, add, store  Each instruction has associated input; produces results that can be used in downstream computation  Basic Block: group of instructions in contiguous sequence  Basic block have a terminal instruction that ends program or redirects execution to another basic block  Compiler uses representation to create hardware implementation of each basic block; put together to form complete kernel circuit 11

Optimization Strategies 1. C-Language Front-End Produce intermediate representation 2. Live Variable Analysis Identifies variables produced and consumed by each basic block 3. CDFG Generation  Challenge: Each operational node can be independently stalled when successor node unable to accept data  Solution: Two signals and gates. Valid signal into node is AND gate of its data sources Loop Handling (Handled in Basic Block building step)  Challenge: It is entirely possible that a loop can stall  Solution: Insert addition stage of registers into merged node; allows pipeline to have additional spot into which that can be placed

Optimization Strategies 4. Scheduling  Challenge: No sharing of resources occurs; need to sequence operations into pipeline 13  Solution: Apply SDC scheduling algorithm; uses system of linear equations to schedule operations while minimizing cost function  Objective: Area reduction & amount of pipeline balancing regs required  Solution: Minimize cost function that reduces number of bits required by pipeline 5. Hardware Generation  Challenge: Potentially large number of independent threads need to execute using a kernel hardware  Solution: Implement each basic block module as pipelined circuit rather than finite state machine with datapath (FSMD)

Memory Issues Global memory Local memory Private memory 14

Global Memory Designated for access by all threads Large capacity for data storage; I/O by all threads Long access latency Optimization  Access are coalesced when possible to increase throughput  Compiler can detect when accesses are random or sequential and takes advantage of it When access sequential, special read or write module that includes large buffer created for burst transactions When access random, coalescing unit collects request to see if seemingly random accesses address the same 256-bit word in memory. If so, request coalesced in single data transfer, improving memory throughput 15

Local Memory Access by threads in same work group Short latency and multiple ports; efficient kernel access by kernels Enable synchronized exchange of data Barriers/memory fences used to force threads to wait on one another before proceeding  Enables complex algorithm that require collaboration of threads, E.g. Bitonic sort Optimization  Shared memory space for each LD/ST to any local memory Map logical ports required by kernel unto set of dual-ported on-chip memory modules Access split into multiple memory banks to enable faster data access 16

Private Memory Implemented with registers; store data on per-thread basis Pipelined through kernel to ensure each thread keeps data required as it proceeds with execution through the kernel Optimization  None 17

Experimental Setup Accelerator  Stratix-IV on Terasic DE4 board Host  Windows XP 64-bit machine Offline Kernel Compiler  ACDS Benchmarks  Monte Carlo Black-Scholes  Matrix Multiplication  Finite Differences  Particle Simulation

Results 19 Monte Carlo Black Scholes  OpenCL against Verilog code Throughput: 2.2 Gsims/sec in comparison to 1.8 GSims/sec with 2 Stratix-III FPGAs Finite Difference and Particle Simulation  Not much comparative performance with previous work recorded Table 1: Circuit Summary and Application Throughput Matrix Multiplication  1024 x 1024 matrix size  Vectorized kernel Improves thread access to memory; increased throughput by x4  Unrolled inner loop Performs 64 floating-point multiplications and addition simultaneously Results in 88.4 GFLOPS throughput

Conclusion Developed compilation framework that includes host library for communicating with kernels Developed OpenCL-to-FPGA compiler with optimizations  Results show that tool flow is not only feasible, but also effective in designing high-performance circuits Results indicate that OpenCL well suited for automatic generation of efficient circuits for FPGA  Coverage of OpenCL language: synchronization (barriers), support for local memory and floating-point operations 20 THANKS!