Overview Real World NP-hard problems, such as fluid dynamics, calcium cell signaling, and stomata networks in plant leaves involve extensive computation.

Slides:

Advertisements

Similar presentations

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Advertisements

Systolic Arrays & Their Applications

Instruction Set Design

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

TIE Extensions for Cryptographic Acceleration Charles-Henri Gros Alan Keefer Ankur Singla.

CIS December '99 Introduction to Parallel Architectures Dr. Laurence Boxer Niagara University.

Octavian Cret, Kalman Pusztai Cristian Vancea, Balint Szente Technical University of Cluj-Napoca, Romania CREC: A Novel Reconfigurable Computing Design.

Bryan Lahartinger. “The Apriori algorithm is a fundamental correlation-based data mining [technique]” “Software implementations of the Aprioiri algorithm.

System Development. Numerical Techniques for Matrix Inversion.

FPGA-Based System Design: Chapter 3 Copyright  2004 Prentice Hall PTR SRAM-based FPGA n SRAM-based LE –Registers in logic elements –LUT-based logic element.

Zheming CSCE715.  A wireless sensor network (WSN) ◦ Spatially distributed sensors to monitor physical or environmental conditions, and to cooperatively.

Computes the partial dot products for only the diagonal and upper triangle of the input matrix. The vector computed by this architecture is added to the.

Reference: Message Passing Fundamentals.

Advanced Topics in Algorithms and Data Structures An overview of the lecture 2 Models of parallel computation Characteristics of SIMD models Design issue.

Hypercomputing With the CORDIC Algorithm

1 Localization Technologies for Sensor Networks Craig Gotsman, Technion/Harvard Collaboration with: Yehuda Koren, AT&T Labs.

Models of Parallel Computation Advanced Algorithms & Data Structures Lecture Theme 12 Prof. Dr. Th. Ottmann Summer Semester 2006.

CS294-6 Reconfigurable Computing Day 3 September 1, 1998 Requirements for Computing Devices.

Distributed Arithmetic: Implementations and Applications

Copyright 2008 Koren ECE666/Koren Part.6a.1 Israel Koren Spring 2008 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer.

Technion – Israel Institute of Technology Department of Electrical Engineering High Speed Digital Systems Lab Written by: Haim Natan Benny Pano Supervisor:

Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.

Part II: Addressing Modes

GPGPU platforms GP - General Purpose computation using GPU

Juanjo Noguera Xilinx Research Labs Dublin, Ireland Ahmed Al-Wattar Irwin O. Irwin O. Kennedy Alcatel-Lucent Dublin, Ireland.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Introduction to FPGA AVI SINGH. Prerequisites Digital Circuit Design - Logic Gates, FlipFlops, Counters, Mux-Demux Familiarity with a procedural programming.

COMPUTER SCIENCE &ENGINEERING Compiled code acceleration on FPGAs W. Najjar, B.Buyukkurt, Z.Guo, J. Villareal, J. Cortes, A. Mitra Computer Science & Engineering.

LOPASS: A Low Power Architectural Synthesis for FPGAs with Interconnect Estimation and Optimization Harikrishnan K.C. University of Massachusetts Amherst.

Efficient FPGA Implementation of QR

(TPDS) A Scalable and Modular Architecture for High-Performance Packet Classification Authors: Thilan Ganegedara, Weirong Jiang, and Viktor K. Prasanna.

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

Chapter 1 Introduction. Objectives To explain the definition of computer architecture To discuss the history of computers To describe the von-neumann.

J. Christiansen, CERN - EP/MIC

FPGA (Field Programmable Gate Array): CLBs, Slices, and LUTs Each configurable logic block (CLB) in Spartan-6 FPGAs consists of two slices, arranged side-by-side.

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Embedding Constraint Satisfaction using Parallel Soft-Core Processors on FPGAs Prasad Subramanian, Brandon Eames, Department of Electrical Engineering,

Lecture 16: Reconfigurable Computing Applications November 3, 2004 ECE 697F Reconfigurable Computing Lecture 16 Reconfigurable Computing Applications.

“Politehnica” University of Timisoara Course No. 2: Static and Dynamic Configurable Systems (paper by Sanchez, Sipper, Haenni, Beuchat, Stauffer, Uribe)

Chapter 16 Micro-programmed Control

Computer Architecture And Organization UNIT-II General System Architecture.

A Software Solution for the Control, Acquisition, and Storage of CAPTAN Network Topologies Ryan Rivera, Marcos Turqueti, Alan Prosser, Simon Kwan Electronic.

A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.

EE3A1 Computer Hardware and Digital Design

Computer Organization & Assembly Language © by DR. M. Amer.

Paper Review Presentation Paper Title: Hardware Assisted Two Dimensional Ultra Fast Placement Presented by: Mahdi Elghazali Course: Reconfigurable Computing.

Task Graph Scheduling for RTR Paper Review By Gregor Scott.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

EKT303/4 PRINCIPLES OF PRINCIPLES OF COMPUTER ARCHITECTURE (PoCA)

Chapter 4 MARIE: An Introduction to a Simple Computer.

Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.

Computer Architecture 2 nd year (computer and Information Sc.)

A Design Flow for Optimal Circuit Design Using Resource and Timing Estimation Farnaz Gharibian and Kenneth B. Kent {f.gharibian, unb.ca Faculty.

1 Implementation of Polymorphic Matrix Inversion using Viva Arvind Sudarsanam, Dasu Aravind Utah State University.

Recursive Architectures for 2DLNS Multiplication RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS - UNIVERSITY OF WINDSOR 11 Recursive Architectures for 2DLNS.

Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.

Parallel Computing Presented by Justin Reschke

December 13, G raphical A symmetric P rocessing Prototype Presentation December 13, 2004.

Computer Operation. Binary Codes CPU operates in binary codes Representation of values in binary codes Instructions to CPU in binary codes Addresses in.

Resource Sharing in LegUp. Resource Sharing in High Level Synthesis Resource Sharing is a well-known technique in HLS to reduce circuit area by sharing.

Computer Organization and Architecture Lecture 1 : Introduction

Programmable Logic Devices

Hardware Descriptions of Multi-Layer Perceptions with Different Abstraction Levels Paper by E.M. Ortigosa , A. Canas, E.Ros, P.M. Ortigosa, S. Mota , J.

Topics SRAM-based FPGA fabrics: Xilinx. Altera..

A Closer Look at Instruction Set Architectures

Embedded Systems Design

FPGAs in AWS and First Use Cases, Kees Vissers

Pipelining and Vector Processing

Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.

Instruction encoding We’ve already seen some important aspects of processor design. A datapath contains an ALU, registers and memory. Programmers and compilers.

Presentation transcript:

Overview Real World NP-hard problems, such as fluid dynamics, calcium cell signaling, and stomata networks in plant leaves involve extensive computation and require unique solution methods. Field-Prgrammable Gate Arrays (FPGAs) are a potential solution to these problems, as they promote parallel processing, allow for reconfigurability, and increase processing speed. Most natural distributed systems are modeled using a set of interconnected processing nodes which solve differential equations. Each node calculates a solution to the differential equation on each time step. Data from the previous time step of other nodes is combined with a node’s own previous state to determine its current state. Runge-Kutta Approximation Algorithm The fourth-order Runge-Kutta approximation calculates the current value of a differential equation, based on the previous value, using these standard formulae: In other words, the differential function is evaluated four times to obtain each successive approximation. Design Tools and Target Platform The target hardware in this research is a single node (1 Xilinx 2V6000 FPGA) on a Starbridge Systems HC62 Hyper Computer. To facilitate interprocessor communication, the FPGA is under-clocked at 66 MHz. The design software used is Viva 2.4.2, a polymorphic, graphical hardware design language which interfaces directly with Xilinx ISE place and route tools. Node and Interconnect Basics The interconnections between processing nodes are Cellular Automata based. In other words, each processing node can communicate data with its four immediate neighbors. No concept of global data sharing exists. Each node contains hardware for performing calculations, and a private RAM that only the node in question can access. Figure 1 depicts the connections between neighboring nodes, where neighboring nodes are referred to as north, south, east, or west, respectively. Two different node architectures for solving the differential equations were proposed and tested. Details on each of the two types follows. Architecture #1: Explicit Hardware Differential Equation Solver Hardware was designed to solve the following set of equations relating to stomatal apertures in plants. The first of the two equations is a differential equation, which we solve using the 4th order Runge Kutta method, optimized for speed rather than physical circuit size. This method proved infeasible, for the following two reasons. The physical circuit size of one node consumed almost half of the Xilinx 2V6000 slices, preventing node duplication. (2) Hardware design is specific for a given equation set, meaning that the circuit must be modified if the equation set is changed. FPGA-based Modeling of Spatio-temporal Interactive Systems J. Phillips, V. Hariharan, A. Dasu, D. Peak Reconfigurable Computing Group, Utah State University Abstract Systems in the real world such as those in fluid dynamics or calcium signaling in cells or stomata networks in plants are examples of complicated real-world problems that involve a spatial organization of nodes or processing elements that interact with each other over time and influence each other over a spatial neighborhood. Attempts to model these problems using traditional software approaches generally involve extremely lengthy execution times. Field- Programmable Gate Arrays (FPGAs) naturally promote parallel processing and hold the potential to lend well to this type of spatial computing. In this research work we present an implementation of a general, application-independent FPGA- based circuit for modeling differential equation-based, spatially distributed problems. The work focuses on the specifics of the design of such a system. The internals of a single processing node are discussed, including an implementation of a Runge-Kutta fourth-order differential equation approximation algorithm. The polymorphic hardware design language Viva is introduced, and examples are given of how it is used to implement the system. The design platform of choice is a Hyper Computer from Starbridge Systems, which consists of several Xilinx Virtex II FPGAs. Details of the architecture are discussed. Techniques used to optimize the design are also discussed, including the reduction of floating-point multiplications and divisions and the tradeoff between physical circuit size and execution time. Difficulties and problems encountered are also mentioned, including issues with Viva floating-point implementations and the lack of support for double-precision floating-point numbers. Results are presented for an array of nodes that coordinate in space and time to solve simultaneous differential equations, with reference to a well observed stomata network. Comparisons between our implementation and a traditional software implementation in terms of speed and accuracy are provided, and a vision of future work is provided. Captions to be set in Times or Times New Roman or equivalent, italic, 18 to 24 points, to the length of the column in case a figure takes more than 2/3 of column width. Captions to be set in Times or Times New Roman or equivalent, italic, between 18 and 24 points. Left aligned if it refers to a figure on its left. Caption starts right at the top edge of the picture (graph or photo). Architecture #2: Generic Node with RISC Instruction Set A SIMD architecture was designed. A single controller feeds instructions to all nodes simultaneously (shown by red lines below). Each node processes its own data and stores results in its own memory. Nearest neighbor data connections are shown by the blue lines below. The controller consists of a program memory and a state machine that interprets instructions. The internals of the processing node are shown below. Four arithmetic instructions are available, as well as loads and stores. Instructions consist of 16-bit words, formatted as shown below. There are fields for the opcode, RAM address, and a select line for the input multiplexer. Figure 7 is a list of available instructions. The RAM address field can hold a number between 0 and 31, as each node has a 32 bit x 32 RAM. The select field can hold a value between 0 and 5, accessing data from one of the four neighboring nodes, an external data source, or the node’s own accumulator. Figure 8 shows a sample assembly program that has been translated into machine language by an assembler, which was created using FLEX. Results Final design implementation resulted in the following numbers. The controller takes 1,013 slices on the Xilinx V2P6000. Processing nodes take 3,839 slices each. It is estimated that the adder/subtractor occupies around 600 slices, the multiplier 900 slices, and around 1300 slices for the divider. Since only 1 controller node is needed, up to 9 processing nodes fit on the Xilinx 2V6000. If multiple FPGAs were available, the number of processing nodes could be increased proportionally. Figure 9 shows the speed of the algorithm coded in C running on a 1.5 GHz Intel Centrino laptop versus that of the FPGA- based SIMD computer. Notice that the FPGA-based version has the identical performance, regardless of the number of processing nodes introduced. One iteration takes about 19 microseconds. This is one of the benefits of a SIMD architecture running parallel processors. Notice also that the PC-based version requires an increasing amount of time to handle increasing nodes. This increase is linear, as expected at the rate of about 3.9 microseconds per node. From figure 9, it can be deduced that for node arrays that exceed 5 nodes, the FPGA-based implementation will yield superior results. S E N W Figure 1. Node Interconnections Figure 2. Viva example of differential function hardware 4-to-1 Input Mux F Divide By 2 Adder Divide By 2 Adder De- coder K1K1 K2K2 K3K3 K4K4 Previous Value Figure 3. Dedicated hardware for Runge-Kutta solver Processing Node Controller and Instruction Memory Processing Node Figure 4. SIMD Architecture. Figure 5. Processing Node Internals. Figure 6. Instruction word format. Figure 7. Available Instructions. Figure 8. Assembly instructions on the left are translated into the machine code on the right. Figure 9. Time taken by processing nodes. Conclusion This research work has demonstrated the potential for FPGA use in the field of spatio-temporal interactive systems. We have shown that for a system of differential equations that model the stomatal aperture in plant leaves, a single- instruction, multiple data approach, spread across multiple arithmetic units, yields greatly superior results to that of a single processor. Future work will include recoding the node architecture in VHDL rather than Viva, since VHDL tends to create a more- efficient circuit resource-wise. The instruction set will be augmented to handle conditional branches, and instructions may be added that involve multiple arithmetic operations (e.g. multiply-and-accumulate). PhillipsPage 1No. 161 MAPLD 2005