MSECE Thesis Presentation Paul D. Reynolds

Slides:

Advertisements

Similar presentations

CSE 160 – Lecture 9 Speed-up, Amdahl’s Law, Gustafson’s Law, efficiency, basic performance metrics.

Advertisements

Enhanced matrix multiplication algorithm for FPGA Tamás Herendi, S. Roland Major UDT2012.

Programmable FIR Filter Design

NEURAL NETWORKS Backpropagation Algorithm

1 Neural networks. Neural networks are made up of many artificial neurons. Each input into the neuron has its own weight associated with it illustrated.

Neural Networks  A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

© 2003 Xilinx, Inc. All Rights Reserved Looking Under the Hood.

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Final Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Characterization Presentation Neural Network Implementation On FPGA Supervisor: Chen Koren Maria Nemets Maxim Zavodchik

Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.

GPGPU platforms GP - General Purpose computation using GPU

Development in hardware – Why? Option: array of custom processing nodes Step 1: analyze the application and extract the component tasks Step 2: design.

1 Miodrag Bolic ARCHITECTURES FOR EFFICIENT IMPLEMENTATION OF PARTICLE FILTERS Department of Electrical and Computer Engineering Stony Brook University.

Basics and Architectures

Integrating Neural Network and Genetic Algorithm to Solve Function Approximation Combined with Optimization Problem Term presentation for CSC7333 Machine.

Artificial Neural Network Theory and Application Ashish Venugopal Sriram Gollapalli Ulas Bardak.

1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.

Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.

Back-Propagation MLP Neural Network Optimizer ECE 539 Andrew Beckwith.

NEURAL NETWORKS FOR DATA MINING

Research on Reconfigurable Computing Using Impulse C Carmen Li Shen Mentor: Dr. Russell Duren February 1, 2008.

HW/SW PARTITIONING OF FLOATING POINT SOFTWARE APPLICATIONS TO FIXED - POINTED COPROCESSOR CIRCUITS - Nalini Kumar Gaurav Chitroda Komal Kasat.

Floating-Point Reuse in an FPGA Implementation of a Ray-Triangle Intersection Algorithm Craig Ulmer June 27, 2006 Sandia is a multiprogram.

Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.

Hardware Implementation of a Memetic Algorithm for VLSI Circuit Layout Stephen Coe MSc Engineering Candidate Advisors: Dr. Shawki Areibi Dr. Medhat Moussa.

1 Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT.

ECE 8053 Introduction to Computer Arithmetic (Website: Course & Text Content: Part 1: Number Representation.

Robin McDougall Scott Nokleby Mechatronic and Robotic Systems Laboratory 1.

OPERATIONS USING FRACTIONS. 1. Add, subtract, multiply and divide fractions with and without a calculator. 2. Convert between equivalent forms of fractions.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Radix Sort and Hash-Join for Vector Computers Ripal Nathuji 6.893: Advanced VLSI Computer Architecture 10/12/00.

Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.

Over-Trained Network Node Removal and Neurotransmitter-Inspired Artificial Neural Networks By: Kyle Wray.

The Instruction Set Architecture. Hardware – Software boundary Java Program C Program Ada Program Compiler Instruction Set Architecture Microcode Hardware.

Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

NEURAL NETWORKS LECTURE 1 dr Zoran Ševarac FON, 2015.

MATH Lesson 2 Binary arithmetic.

Unit 1 Introduction Number Systems and Conversion.

Sridhar Rajagopal Bryan A. Jones and Joseph R. Cavallaro

Machine Learning Supervised Learning Classification and Regression

Neural networks.

When deep learning meets object detection: Introduction to two technologies: SSD and YOLO Wenchi Ma.

Hardware Descriptions of Multi-Layer Perceptions with Different Abstraction Levels Paper by E.M. Ortigosa , A. Canas, E.Ros, P.M. Ortigosa, S. Mota , J.

Digital Control CSE 421.

Backprojection Project Update January 2002

Hiba Tariq School of Engineering

PSO -Introduction Proposed by James Kennedy & Russell Eberhart in 1995

One-layer neural networks Approximation problems

UNIVERSITY OF MASSACHUSETTS Dept

Real Neurons Cell structures Cell body Dendrites Axon

Embedded Systems Design

Cache Memory Presentation I

Neural Networks A neural network is a network of simulated neurons that can be used to recognize instances of patterns. NNs learn by searching through.

with Daniel L. Silver, Ph.D. Christian Frey, BBA April 11-12, 2017

FUNDAMENTAL CONCEPT OF ARTIFICIAL NETWORKS

Implementation of IDEA on a Reconfigurable Computer

A Comparison of Field Programmable Gate

Arithmetic Logical Unit

Paul D. Reynolds Russell W. Duren Matthew L. Trumbo Robert J. Marks II

Faustino J. Gomez, Doug Burger, and Risto Miikkulainen

Zip Codes and Neural Networks: Machine Learning for

Execution time Execution Time (processor-related) = IC x CPI x T

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

Introduction SYSC5603 (ELG6163) Digital Signal Processing Microprocessors, Software and Applications Miodrag Bolic.

August 8, 2006 Danny Budik, Itamar Elhanany Machine Intelligence Lab

Presentation transcript:

MSECE Thesis Presentation Paul D. Reynolds Algorithm Implementation in FPGAs Demonstrated Through Neural Network Inversion on the SRC-6e MSECE Thesis Presentation Paul D. Reynolds

Algorithm Hardware Implementation Algorithms One Output for each Input Purely Combinational Problems Too Large to be Directly Implemented Timing Issues Solution Clocked Design Repeated Use of Hardware

Implementation Hardware SRC-6e Reconfigurable Computer 2 Pentium III Processors 1 GHz 2 Xilinx XC2V6000 FPGAs 100 MHz 144 Multipliers 144 Block RAMs 6 Memory Blocks 4 MB each

Hardware Architecture

SRC-6e Development Environment main.c C Executes on Pentium Processors Command Line Interface Hardware accessed as a Function

SRC-6e Development Environment hardware.mc Modified C or FORTRAN Executes in Hardware Controls Memory Transfer One for each FPGA used Can be for entire code or with hardware description functions

Hardware Description- VHDL and VERILOG Reasons for Use To avoid c-compiler idiosyncrasies Latency added to certain loops 16 bit multiplies converted to 32 bit multiplies More control Fixed point multiplication with truncation Pipelines and parallel execution simpler IP Cores Useable More efficient implementation

Neural Network and Inversion Example

Problem Background To determine the optimal sonar setup to maximize the ensonification of a grid of water. Influences to ensonification: Environmental Conditions – Temperature, Wind Speed Bathymetry – Bottom Type, Shape of Bottom Sonar System Total of 27 different factors accounted for

Ensonification Example 15 by 80 pixel grid Red: High signal to interference ratio Blue: Low signal to interference ratio Bottom: No signal

Original Solution Take current conditions Match to previous optimum sonar setups with similar conditions Run acoustic model using current conditions and previous optimum setups Use sonar setup with highest signal to interference ratio

New Problem Problem: Solution One acoustic model run took tens of seconds Solution Train a Neural Network on the acoustic model (APL & University of Washington)

Neural Network Overview Inspired by the human ability to recognize patterns. Mathematical structure able to mimic a pattern Trained using known data Show the network several examples and identify each example The network learns the pattern Show the network a new case and let the network identify it.

Neural Network Structure Each neuron is the squashed sum of the inputs to that neuron A squash is a non-linear function that restricts outputs to between 0 and 1 Each arrow is a weight times a neuron output OUTPUTS WEIGHT LAYER NEURON INPUTS

Ensonification Neural Network Taught using examples from the acoustical model. Recognizes a pattern between the 27 given inputs and 15 by 80 grid output 27-40-50-70-1200 Architecture Squash =

Did the neural network solve the problem? Yes: Neural network acoustic model approximation: 1 ms However- Same method of locating best: Run many possible setups in neural network Choose best Problem: Better, but still not real time

How to find a good setup solution: Particle Swarm Optimization Idea Several Particles Wandering over a Fitness Surface Math xk+1 = xk + vk vk+1 = vk + rand*w1*(Gb-xk)+rand*w2*(Pb-xk) Theory Momentum pushes particles around surface Pulled towards Personal Best Pulled towards Global Best Eventually particles oscillate around Global Best

Particle Swarm - Math xk+1 = xk + vk vk+1 = vk + rand*w1*(Gb-xk)+rand*w2*(Pb-xk) Next Position = + Current Position Current Velocity Next Velocity = + + Current Velocity Global Pull Personal Pull

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

Particle Swarm Optimization 27 Inputs to Neural Network, Sonar System Setup Fitness Surface Calculated from neural network output Two Options Match a desired output Sum of the difference from desired output Minimize the difference Maximize signal to interference ratio in an area Ignore output in undesired locations

Particle Swarm in Operation Link to Particle Swarm file – in Quicktime

New Problem Enters Time for 100k step particle swarm using a 2.2Ghz Pentium: nearly 2 minutes Desire a real time version Solution: Implement the neural network and particle swarm optimization in parallel on reconfigurable hardware

Three Design Stages Activation Function Design Neural Network Design Sigmoid not efficient to calculate Neural Network Design Parallel Design Particle Swarm Optimization Hardware Implementation

Activation Function Design Fixed Point Design Sigmoid Accuracy Level Weight Accuracy Level

Fixed Point Design VS Floating Point Data Range of -50 to 85 Easier Less Area Faster Data Range of -50 to 85 2’s Complement 7 integer bits 1 sign bit Fractional Portion Sigmoid outputs less than 1 Some number of fractional bits

Sigmoid Accuracy Level

Weight Accuracy Level

Total Accuracy

Fixed Point Results 16-bit Number Advantages 1 Sign Bit 7 Integer Bits 8 Fractional Bits Advantages 18 x 18 multipliers 64-bit input banks

Activation Function Approximation Compared 4 Designs Look-up Table Shift and Add CORDIC Taylor Series

Look-up Table Advantages Disadvantages weights Unlimited Accuracy Short Latency of 3 Disadvantages Desire entirely in chip design Might use memory needed for weights

Look-up Table

Shift and Add Y(x)=2-n*x + b Advantages Disadvantages Small Design Short Latency of 5 Disadvantages Piecewise Outputs Limited Accuracy

Shift and Add

CORDIC Computation Divide Argument By 2 Series of Rotations Sinh(x) Cosh(x) Division for Tanh(x) Shift and Add for Result

CORDIC Advantages Disadvantages Unlimited Accuracy Real Calculation Long Latency of 50 Large Design

CORDIC

Taylor Series Y(x) = a+b(x-x0)+c(x-x0)2 Advantages Average Unlimited Accuracy Average Latency of 10 Medium Size Design Disadvantages 3 multipliers

Taylor Series

Neural Network Design Desired Limitations 27-40-50-70-1200 Architecture Maximum Parallel Design Entirely on Chip design Limitations 92,000 16-bit weights in 144 RAMB16s Layers are Serial 144 18x18 Multipliers

Neural Network Design Initial Test Design Serial Pipeline One Multiply per Clock 92,000 Clocks 1 ms=PC equivalent

Test Output FPGA output Real output

Test Output FPGA output Real output

Test Output FPGA output Real output

Neural Network Design Maximum Parallel Version 71 Multiplies in Parallel Zero weight padding Treat all layers as the same length 71 25 clock wait for Pipeline Total 1475 clocks per Network Evaluation 15 microseconds 60,000 Networks Evaluations per Second

Neural Network Design

Particle Swarm Optimization 2 Chips in SRC Particle Swarm Controls inputs Sends to Fitness Chip Receives a fitness back Fitness Function Calculates Network Compares to Desired Output

Particle Swarm Implementation Problem - randomness vk+1 = vk + rand*w1*(Gb-xk)+rand*w2*(Pb-xk) Solutions Remove randomness vk+1 = vk + w1*(Gb-xk) + w2*(Pb-xk) Linear Feedback Shift Register Squared Decimal Implementation

Random vs. Deterministic Deterministic – Blue Random – Green/Red

Linear Feedback Shift Register

Squared Decimal

Randomness Results Standard Conventional Swarm Error 1.9385 units per pixel Deterministic Swarm Error 2.3587 units per pixel LFSR Swarm Error 2.3522 units per pixel Squared Decimal Error 2.3694 units per pixel

Randomness Results The gain from randomness is not significant. Deterministic method used. All much higher than conventional swarm Approximated Network Approximation Error between Networks 1.423 units per pixel Deterministic error on approximated network 1.8055 units per pixel

Particle Swarm Chip 10 Agents Restrictions Preset Starting Points and Velocities 8 from Previous Data, Random Velocities 1 at maximum range, aimed down 1 at minimum range, aimed up Restrictions Maximum Velocity Range

Update Equation Implementation XnDimk VnDimk PnDimk Gk Vmaxk Xmaxk Xmink X+V Compare New XnDimk P-X G-X V+1/8(P-X)+1/16(G-X) New VnDimk xk+1 = xk + vk vk+1 = vk + w1*(Gb-xk)+w2*(Pb-xk)

Results – Output Matching 100k iteration PSO ->1.76 s SWARM REAL

Results – Output Matching 100k iteration PSO ->1.76 s SWARM REAL

Results – Output Matching 100k iteration PSO ->1.76 s SWARM REAL

Particle Swarm-Area Specific 100k iteration PSO ->1.76 s

Particle Swarm-Area Specific 100k iteration PSO ->1.76 s

Particle Swarm-Area Specific 100k iteration PSO ->1.76 s

ANY QUESTIONS?